Data frame

Homogeneous table with row and column index (data frame) : Frame[RX,CX,T] #

A Frame combines a Mat with a row index and a column index which provides a way to index into the Mat.

A Frame is represented internally as a sequence of column Vec instances all sharing the same row index.

Factories #

import org.saddle._
import org.saddle.order._
import org.saddle.ops.BinOps._
 val v = Vec(1, 2)                              // given the following
// v: Vec[Int] = [2 x 1]
// 1
// 2
//                               // given the following
 val u = Vec(3, 4)
// u: Vec[Int] = [2 x 1]
// 3
// 4
// 
 val s = Series(Vec(1,3,2,4), Index("c", "b", "a", "b")).sortedIx
// s: Series[String, Int] = [4 x 1]
// a ->  2
// b ->  3
// b ->  4
// c ->  1
// 
 val s2 = Series("a" -> 1, "b" -> 2)
// s2: Series[String, Int] = [2 x 1]
// a ->  1
// b ->  2
// 
 val t = Series("b" -> 3, "c" -> 4)
// t: Series[String, Int] = [2 x 1]
// b ->  3
// c ->  4
// 

 Frame(v, u)                                    // two-column frame
// res0: Frame[Int, Int, Int] = [2 x 2]
//       0  1 
//      -- -- 
// 0 ->  1  3 
// 1 ->  2  4 
//                                     // two-column frame

 Frame("x" -> v, "y" -> u)                      // with column index
// res1: Frame[Int, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// 0 ->  1  3 
// 1 ->  2  4 
//                       // with column index

 Frame(s2, t)                                    // aligned along rows
// res2: Frame[String, Int, Int] = [3 x 2]
//       0  1 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//                                     // aligned along rows

 Frame("x" -> s2, "y" -> t)                      // with column index
// res3: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//                       // with column index

 Frame(Seq(s2, t), Index("x", "y"))              // explicit column index
// res4: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//               // explicit column index

 Frame(Seq(v, u), Index(0, 1), Index("x", "y")) // row & col indexes specified explicitly
// res5: Frame[Int, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// 0 ->  1  3 
// 1 ->  2  4 
//  // row & col indexes specified explicitly

 Frame(Seq(v, u), Index("a", "b"))              // col index specified
// res6: Frame[Int, String, Int] = [2 x 2]
//       a  b 
//      -- -- 
// 0 ->  1  3 
// 1 ->  2  4 
//

The factory methods which construct a Frame from columns as Series come in two flavors regarding their behavior upon non-unique indices in the series (duplicate row index values).

  • The Frame.apply methods create a full cross product of the respective indices. This means that for a given value of the row index all respective items will be paired with all other, potentially leading to a combinatorial explosion of the number of rows of the resulting Frame in case of duplicated index values.
  • The Frame.fromCols methods disambiguate the non-unique indices before joining. This avoids the combinatorial increase in the number of rows, at the cost of arbitrarily joining items with the same index value.

An example for the difference between Frame.apply and Frame.fromCols, note the rows with 0 index:

Frame.fromCols(
        Series(0 -> 1, 2 -> 2, 1 -> 3, 0 -> 4),
        Series(1 -> 1, 2 -> 2, 0 -> 3, 0 -> 4),
        Series(0 -> 1, 1 -> 2, 2 -> 3, 0 -> 4)
      )
// res7: Frame[Int, Int, Int] = [4 x 3]
//       0  1  2 
//      -- -- -- 
// 0 ->  1  3  1 
// 2 ->  2  2  3 
// 1 ->  3  1  2 
// 0 ->  4  4  4 
// 

  Frame.apply(
        Series(0 -> 1, 2 -> 2, 1 -> 3, 0 -> 4),
        Series(1 -> 1, 2 -> 2, 0 -> 3, 0 -> 4),
        Series(0 -> 1, 1 -> 2, 2 -> 3, 0 -> 4)
      )
// res8: Frame[Int, Int, Int] = [10 x 3]
//       0  1  2 
//      -- -- -- 
// 0 ->  1  3  1 
// 0 ->  1  3  4 
// 0 ->  1  4  1 
// 0 ->  1  4  4 
// 0 ->  4  3  1 
// 0 ->  4  3  4 
// 0 ->  4  4  1 
// 0 ->  4  4  4 
// 2 ->  2  2  3 
// 1 ->  3  1  2 
//

Operations #

You’ll notice that if an index is not provided, a default int index is set where the index ranges between 0 and the length of the data.

If you want to set or reset the index, these methods are your friends:

val f = Frame("x" -> s2, "y" -> t)
// f: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
// 

 f.setRowIndex(org.saddle.index.IndexIntRange(f.numRows))
// res9: Frame[Int, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// 0 ->  1 NA 
// 1 ->  2  3 
// 2 -> NA  4 
// 
 f.setColIndex(Index("p", "q"))
// res10: Frame[String, String, Int] = [3 x 2]
//       p  q 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
// 
 f.resetRowIndex
// res11: Frame[Int, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// 0 ->  1 NA 
// 1 ->  2  3 
// 2 -> NA  4 
// 
 f.resetColIndex
// res12: Frame[String, Int, Int] = [3 x 2]
//       0  1 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//

(Note: frame f will carry through the next examples.)

You also have the following index transformation tools at hand:

f.mapRowIndex { case rx => rx+1 }
// res13: Frame[String, String, Int] = [3 x 2]
//        x  y 
//       -- -- 
// a1 ->  1 NA 
// b1 ->  2  3 
// c1 -> NA  4 
// 
f.mapColIndex { case cx => cx+2 }
// res14: Frame[String, String, Int] = [3 x 2]
//      x2 y2 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//

Let’s next look at how to extract data from the Frame.

f.rowAt(2)    // extract row at offset 2, as Series
// res15: Series[String, Int] = [2 x 1]
// x -> NA
// y ->  4
//     // extract row at offset 2, as Series
 f.rowAt(1,2)  // extract frame of rows 1 & 2
// res16: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
// c -> NA  4 
//   // extract frame of rows 1 & 2
 f.rowAt(1->2) // extract frame of rows 1 & 2
// res17: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
// c -> NA  4 
//  // extract frame of rows 1 & 2

 f.colAt(1)    // extract col at offset 1, as Series
// res18: Series[String, Int] = [3 x 1]
// a -> NA
// b ->  3
// c ->  4
//     // extract col at offset 1, as Series
 f.colAt(0,1)  // extract frame of cols 1 & 2
// res19: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//   // extract frame of cols 1 & 2
 f.colAt(0->1) // extract frame of cols 1 & 2
// res20: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//

rowAt and colAt are used under the hood for the at extractor:

f.at(1,1)              // Scalar value
// res21: scalar.Scalar[Int] = Value(el = 3)              // Scalar value
 f.at(Array(1,2), 0)    // extract rows 1,2 of column 0
// res22: Series[String, Int] = [2 x 1]
// b ->  2
// c -> NA
//

If you want more control over slicing, you can use these methods:

f.colSlice(0,1)        // frame slice consisting of column 0
// res23: Frame[String, String, Int] = [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
//         // frame slice consisting of column 0
 f.rowSlice(0,3,2)      // row slice from 0 until 3, striding by 2
// res24: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// c -> NA  4 
//

Of course, this is an bi-indexed data structure, so we can use its indexes to select out data using keys:

f.row("a")             // row series 'a', with all columns
// res25: Frame[String, String, Int] = [1 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
//              // row series 'a', with all columns
 f.col("x")             // col series 'x', with all rows
// res26: Frame[String, String, Int] = [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
//              // col series 'x', with all rows
 f.row("a", "c")        // select two rows
// res27: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// c -> NA  4 
//         // select two rows
 f.row("a"->"b")        // slice two rows (index must be sorted)
// res28: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
//         // slice two rows (index must be sorted)
 f.row(Seq("a", "c"):_*)   // another way to select
// res29: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// c -> NA  4 
//

A more explict way to slice with keys is as follows, and you can specify whether the right bound is inclusive or exclusive. Again, to slice, the index keys must be ordered.

f.rowSliceBy("a", "b", inclusive=false)
// res30: Frame[String, String, Int] = [1 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// 
 f.colSliceBy("x", "x", inclusive=true)
// res31: Frame[String, String, Int] = [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
//

The row and col methods are used under the hood for the apply method:

f("a", "x")             // extract a one-element frame by keys
// res32: Frame[String, String, Int] = [1 x 1]
//       x 
//      -- 
// a ->  1 
//              // extract a one-element frame by keys
 f("a"->"b", "x")        // two-row, one-column frame
// res33: Frame[String, String, Int] = [2 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
//         // two-row, one-column frame
 f(Vec("a", "c").toArray, "x")   // same as above, but extracting, not slicing
// res34: Frame[String, String, Int] = [2 x 1]
//       x 
//      -- 
// a ->  1 
// c -> NA 
//

The methods of extracting multiple rows shown above can of course be done on columns as well.

You can also split up the Frame by key or index:

f.colSplitAt(1)          // split into two frames at column 1
// res35: (Frame[String, String, Int], Frame[String, String, Int]) = (
//   [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
// ,
//   [3 x 1]
//       y 
//      -- 
// a -> NA 
// b ->  3 
// c ->  4 
// 
// )          // split into two frames at column 1
 f.colSplitBy("y")
// res36: (Frame[String, String, Int], Frame[String, String, Int]) = (
//   [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
// ,
//   [3 x 1]
//       y 
//      -- 
// a -> NA 
// b ->  3 
// c ->  4 
// 
// )

 f.rowSplitAt(1)
// res37: (Frame[String, String, Int], Frame[String, String, Int]) = (
//   [1 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// ,
//   [2 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
// c -> NA  4 
// 
// )
 f.rowSplitBy("b")
// res38: (Frame[String, String, Int], Frame[String, String, Int]) = (
//   [1 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// ,
//   [2 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
// c -> NA  4 
// 
// )

You extract some number of rows or columns:

f.head(2)                // operates on rows
// res39: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
//                 // operates on rows
 f.tail(2)
// res40: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
// c -> NA  4 
// 
 f.headCol(1)             // operates on cols
// res41: Frame[String, String, Int] = [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
//              // operates on cols
 f.tailCol(1)
// res42: Frame[String, String, Int] = [3 x 1]
//       y 
//      -- 
// a -> NA 
// b ->  3 
// c ->  4 
//

Or the first & last of some key (which is helpful when you’ve got a multi-key index):

f.first("b")              // first row indexed by "b" key
// res43: Series[String, Int] = [2 x 1]
// x ->  2
// y ->  3
//               // first row indexed by "b" key
 f.last("b")               // last row indexed by "b" key
// res44: Series[String, Int] = [2 x 1]
// x ->  2
// y ->  3
//                // last row indexed by "b" key
 f.firstCol("x")
// res45: Series[String, Int] = [3 x 1]
// a ->  1
// b ->  2
// c -> NA
// 
 f.lastCol("x")
// res46: Series[String, Int] = [3 x 1]
// a ->  1
// b ->  2
// c -> NA
//

There are a few other methods of extracting data:

import org.saddle.linalg._
 f.filter { case s => s.toVec.map(_.toDouble).mean2 > 2.0 }  // any column whose series satisfies predicate
// res47: Frame[String, String, Int] = Empty Frame  // any column whose series satisfies predicate
 f.filterIx { case x => x == "x" }    // col where index matches key "x"
// res48: Frame[String, String, Int] = [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
//     // col where index matches key "x"
 f.where(Vec(false, true))            // extract second column
// res49: Frame[String, String, Int] = [3 x 1]
//       y 
//      -- 
// a -> NA 
// b ->  3 
// c ->  4 
//

There are analogous methods to operate on rows rather then columns:

rfilter
rfilterIx
rwhere

etc… in general, methods operate on a column-wise basis, whereas the r-counterpart does so on a row-wise basis.

You can drop cols (rows) containing any NA values:

f.dropNA
// res50: Frame[String, String, Int] = Empty Frame
 f.rdropNA
// res51: Frame[String, String, Int] = [1 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
//

Let’s take a look at some operations we can do with Frames. We can do all the normal binary math operations with Frames, with either a scalar value or with another Frame. When two frames are involved, they are reindexed along both axes to match the outer join of their indices, but any missing observation in either will carry through the calculations.

f + 1
// res52: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  2 NA 
// b ->  3  4 
// c -> NA  5 
// 
 f * f
// res53: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  4  9 
// c -> NA 16 
// 
 val g = Frame("y"->Series("b"->5, "d"->10))
// g: Frame[String, String, Int] = [2 x 1]
//       y 
//      -- 
// b ->  5 
// d -> 10 
// 
 f + g                      // one non-NA entry, ("b", "y", 8)
// res54: Frame[String, String, Int] = [4 x 2]
//       x  y 
//      -- -- 
// a -> NA NA 
// b -> NA  8 
// c -> NA NA 
// d -> NA NA 
//

You can effectively supply your own binary frame operation using joinMap, which lets you control the join style on rows and columns:

f.joinMap(g, rhow=index.LeftJoin, chow=index.LeftJoin) { case (x, y) => x + y }
// res55: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a -> NA NA 
// b -> NA  8 
// c -> NA NA 
//

If you want simply to align one frame to another without performing an operation, use the following method:

val (fNew, gNew) = f.align(g, rhow=index.LeftJoin, chow=index.OuterJoin)
// fNew: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
// 
// gNew: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a -> NA NA 
// b -> NA  5 
// c -> NA NA 
//

If you want to treat a Frame as a matrix to use in linear algebraic fashion, call the toMat method.

We can sort a frame in various ways:

f.sortedRIx                // sorted by row index
// res56: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//                 // sorted by row index
 f.sortedCIx                // sorted by col index
// res57: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//                 // sorted by col index
 f.sortedRows(0,1)          // sort rows by (primary) col 0 and (secondary) col 1
// res58: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// c -> NA  4 
// a ->  1 NA 
// b ->  2  3 
//           // sort rows by (primary) col 0 and (secondary) col 1
 f.sortedCols(1,0)          // sort cols by (primary) row 1 and (secondary) row 0
// res59: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//

We can also sort by an ordering provided by the result of a function acting on rows or cols:

f.sortedRowsBy { case r => r.at(0) }   // sort rows by first element of row
// res60: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//    // sort rows by first element of row
 f.sortedColsBy { case c => c.at(0) }   // sort cols by first element of col
// res61: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
//

There are several mapping functions:

f.mapValues { case t => t + 1 }        // add one to each element of frame
// res62: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// a ->  2 NA 
// b ->  3  4 
// c -> NA  5 
//         // add one to each element of frame
 import org.saddle.linalg._
 f.mapVec { case v => v.map(_.toDouble).demeaned }      // map over each col vec of the frame
// res63: Frame[String, String, Double] = [3 x 2]
//       x  y 
//      -- -- 
// a -> NA NA 
// b -> NA NA 
// c -> NA NA 
//       // map over each col vec of the frame
 f.reduce { case s => s.toVec.map(_.toDouble).mean2 }          // collapse each col series to a single value
// res64: Series[String, Double] = [2 x 1]
// x -> NA
// y -> NA
//           // collapse each col series to a single value
 f.transform { case s => s.reversed }   // transform each series; outerjoin results
// res65: Frame[String, String, Int] = [3 x 2]
//       x  y 
//      -- -- 
// c -> NA  4 
// b ->  2  3 
// a ->  1 NA 
//

We can mask out values:

 f.mask(_ > 2)                          // mask out values > 2
 f.mask(Vec(false, true, true))         // mask out rows 1 & 2 (keep row 0)

Columns (rows) containing only NA values can be dropped as follows:

f.mask(Vec(true, false, false)).rsqueeze   // drop rows containing NA values
// res66: Frame[String, String, Int] = [2 x 2]
//       x  y 
//      -- -- 
// b ->  2  3 
// c -> NA  4 
//    // drop rows containing NA values
 f.rmask(Vec(false, true)).squeeze          // takes "x" column
// res67: Frame[String, String, Int] = [3 x 1]
//       x 
//      -- 
// a ->  1 
// b ->  2 
// c -> NA 
//

We can groupBy in order to combine or transform:

import org.saddle.linalg._
 f.groupBy(_ == "a").combine(_.count)       // # obs in each column that have/not row key "a"
// res68: Frame[Boolean, String, Int] = [2 x 2]
//           x  y 
//          -- -- 
// false ->  1  2 
//  true ->  1  0 
//        // # obs in each column that have/not row key "a"
 f.groupBy(_ == "a").transform(_.map(_.toDouble).demeaned)  // contrived, but you get the idea hopefully!
// res69: Frame[String, String, Double] = [3 x 2]
//           x       y 
//      ------ ------- 
// a -> 0.0000      NA 
// b ->     NA -0.5000 
// c ->     NA  0.5000 
//

We can join against another frame, or against a series.

f.rconcat(g, how=index.LeftJoin)              
// res70: Frame[String, String, Int] = [3 x 3]
//       x  y  y 
//      -- -- -- 
// a ->  1 NA NA 
// b ->  2  3  5 
// c -> NA  4 NA 
//               
 f.concat(g, how=index.LeftJoin)              
// res71: Frame[String, String, Int] = [5 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
// b -> NA  5 
// d -> NA 10 
//               
 f.cbind(g, how=index.LeftJoin)              
// res72: Frame[String, String, Int] = [3 x 3]
//       x  y  y 
//      -- -- -- 
// a ->  1 NA NA 
// b ->  2  3  5 
// c -> NA  4 NA 
//               
 f.rbind(g, how=index.LeftJoin)              
// res73: Frame[String, String, Int] = [5 x 2]
//       x  y 
//      -- -- 
// a ->  1 NA 
// b ->  2  3 
// c -> NA  4 
// b -> NA  5 
// d -> NA 10 
//

Btw, to join a Frame to a series, the call looks like this:

s.joinF(g, how=index.LeftJoin)
// res74: Frame[String, Int, Int] = [3 x 2]
//       0  1 
//      -- -- 
// b ->  3  5 
// b ->  4  5 
// d -> NA 10 
//