CSV

CSV Parser #

Saddle features a very fast CSV parser. It thrives to allocate as little as possible and make as few branching as possible during parsing. E.g. it can parse numeric tables without ever allocating a String (except for the header).

The CSV parsing logic itself is published in the saddle-io module which is a dependency free module. The saddle specific parts are in the saddle-core module.

import scala.io.Source
import org.saddle._
val irisURL = "https://gist.githubusercontent.com/pityka/d05bb892541d71c2a06a0efb6933b323/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
// irisURL: String = "https://gist.githubusercontent.com/pityka/d05bb892541d71c2a06a0efb6933b323/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
val iris : Frame[Int,String,Double] = csv.CsvParser.parseInputStreamWithHeader[Double](
      inputStream = new java.net.URL(irisURL).openStream, 
      cols = List(0,1,2,3), 
      recordSeparator = "\n").toOption.get
// iris: Frame[Int, String, Double] = [150 x 4]
//        sepal_length sepal_width petal_length petal_width 
//        ------------ ----------- ------------ ----------- 
//   0 ->       5.1000      3.5000       1.4000      0.2000 
//   1 ->       4.9000      3.0000       1.4000      0.2000 
//   2 ->       4.7000      3.2000       1.3000      0.2000 
//   3 ->       4.6000      3.1000       1.5000      0.2000 
//   4 ->       5.0000      3.6000       1.4000      0.2000 
// ...
// 145 ->       6.7000      3.0000       5.2000      2.3000 
// 146 ->       6.3000      2.5000       5.0000      1.9000 
// 147 ->       6.5000      3.0000       5.2000      2.0000 
// 148 ->       6.2000      3.4000       5.4000      2.3000 
// 149 ->       5.9000      3.0000       5.1000      1.8000 
//

Limitations #

  • recordSeparator must be a String of length one or two.
  • fieldSeparator must be a single Char
  • doubled quoted quotes are not turned back into a single quote. The csv RFC states that quoted quotes (" hi “quote” “) must be doubled (” hi ““quote”” “). These are returned as doubled.
  • The csv parser is tuned for fast parsing of trusted (conforming) input. Parsing arbitrarily broken CSV files is not in scope.

CSV Writer #

There is a simple csv writer provided in the org.saddle.csv.CsvWriter object.