Rover

Simple, powerful data frames for Ruby

:mountain: Designed for data exploration and machine learning, and powered by numo-narray-alt

:evergreen_tree: Uses Vega for visualization

Build Status

Installation

Add this line to your application’s Gemfile:

gem "rover-df"

Intro

A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.

Creating Data Frames

From an array

Rover::DataFrame.new([
  {a: 1, b: "one"},
  {a: 2, b: "two"},
  {a: 3, b: "three"}
])

From a hash

Rover::DataFrame.new({
  a: [1, 2, 3],
  b: ["one", "two", "three"]
})

From Active Record

Rover::DataFrame.new(User.all)

From a CSV

Rover.read_csv("file.csv")
# or
Rover.parse_csv("CSV,data,string")

From Parquet (requires the red-parquet gem)

Rover.read_parquet("file.parquet")
# or
Rover.parse_parquet("PAR1...")

Attributes

Get number of rows

df.count

Get column names

df.keys

Check if a column exists

df.include?(name)

Selecting Data

Select a column

df[:a]

Select multiple columns

df[[:a, :b]]

Select first rows

df.head
# or
df.first(5)

Select last rows

df.tail
# or
df.last(5)

Select rows by index

df[1]
# or
df[1..3]
# or
df[[1, 4, 5]]

Iterate over rows

df.each_row { |row| ... }

Iterate over a column

df[:a].each { |item| ... }
# or
df[:a].each_with_index { |item, index| ... }

Filtering

Filter on a condition

df[df[:a] == 100]
df[df[:a] != 100]
df[df[:a] > 100]
df[df[:a] >= 100]
df[df[:a] < 100]
df[df[:a] <= 100]

In

df[df[:a].in?([1, 2, 3])]
df[df[:a].in?(1..3)]
df[df[:a].in?(["a", "b", "c"])]

Not in

df[!df[:a].in?([1, 2, 3])]

And, or, and exclusive or

df[(df[:a] > 100) & (df[:b] == "one")] # and
df[(df[:a] > 100) | (df[:b] == "one")] # or
df[(df[:a] > 100) ^ (df[:b] == "one")] # xor

Operations

Basic operations

df[:a] + 5
df[:a] - 5
df[:a] * 5
df[:a] / 5
df[:a] % 5
df[:a] ** 2
df[:a].sqrt
df[:a].cbrt
df[:a].abs

Rounding

df[:a].round
df[:a].ceil
df[:a].floor

Logarithm

df[:a].ln # or log
df[:a].log(5)
df[:a].log10
df[:a].log2

Exponentiation

df[:a].exp
df[:a].exp2

Trigonometric functions

df[:a].sin
df[:a].cos
df[:a].tan
df[:a].asin
df[:a].acos
df[:a].atan

Hyperbolic functions

df[:a].sinh
df[:a].cosh
df[:a].tanh
df[:a].asinh
df[:a].acosh
df[:a].atanh

Error function

df[:a].erf
df[:a].erfc

Summary statistics

df[:a].count
df[:a].sum
df[:a].mean
df[:a].median
df[:a].percentile(90)
df[:a].min
df[:a].max
df[:a].std
df[:a].var

Count occurrences

df[:a].tally

Cross tabulation

df[:a].crosstab(df[:b])

Grouping

Group

df.group(:a).count

Works with all summary statistics

df.group(:a).max(:b)

Multiple groups

df.group(:a, :b).count

Visualization

Add Vega to your application’s Gemfile:

gem "vega"

And use:

df.plot(:a, :b)

Specify the chart type (line, pie, column, bar, area, or scatter)

df.plot(:a, :b, type: "pie")

Group data

df.plot(:a, :b, group: :c)

Stacked columns or bars

df.plot(:a, :b, group: :c, stacked: true)

Updating Data

Add a new column

df[:a] = 1
# or
df[:a] = [1, 2, 3]

Update a single element

df[:a][0] = 100

Update multiple elements

df[:a][0..2] = 1
# or
df[:a][0..2] = [1, 2, 3]

Update all elements

df[:a] = df[:a].map { |v| v.gsub("a", "b") }
# or
df[:a].map! { |v| v.gsub("a", "b") }

Update elements matching a condition

df[:a][df[:a] > 100] = 0

Clamp

df[:a].clamp!(0, 100)

Delete columns

df.delete(:a)
# or
df.except!(:a, :b)

Rename columns

df.rename(a: :new_a, b: :new_b)
# or
df[:new_a] = df.delete(:a)

Sort rows

df.sort_by! { |r| r[:a] }

Clear all data

df.clear

Combining Data Frames

Add rows

df.concat(other_df)

Add columns

df.merge!(other_df)

Inner join

df.inner_join(other_df)
# or
df.inner_join(other_df, on: :a)
# or
df.inner_join(other_df, on: [:a, :b])
# or
df.inner_join(other_df, on: {df_col: :other_df_col})

Left join

df.left_join(other_df)

Encoding

One-hot encoding

df.one_hot

Drop a variable in each category to avoid the dummy variable trap

df.one_hot(drop: true)

Conversion

Array of hashes

df.to_a

Hash of arrays

df.to_h

Numo array

df.to_numo

CSV

df.to_csv

Parquet (requires the red-parquet gem)

df.to_parquet

Types

You can specify column types when creating a data frame

Rover::DataFrame.new(data, types: {"a" => :int64, "b" => :float64})

Or

Rover.read_csv("data.csv", types: {"a" => :int64, "b" => :float64})

Supported types are:

  • boolean - :bool
  • float - :float64, :float32
  • integer - :int64, :int32, :int16, :int8
  • unsigned integer - :uint64, :uint32, :uint16, :uint8
  • object - :object

Get column types

df.types

For a specific column

df[:a].type

Change the type of a column

df[:a].to!(:int32)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/rover.git
cd rover
bundle install
bundle exec rake test