Rover
Simple, powerful data frames for Ruby
:mountain: Designed for data exploration and machine learning, and powered by numo-narray-alt
:evergreen_tree: Uses Vega for visualization
Installation
Add this line to your application’s Gemfile:
gem "rover-df"
Intro
A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
Creating Data Frames
From an array
Rover::DataFrame.new([
{a: 1, b: "one"},
{a: 2, b: "two"},
{a: 3, b: "three"}
])
From a hash
Rover::DataFrame.new({
a: [1, 2, 3],
b: ["one", "two", "three"]
})
From Active Record
Rover::DataFrame.new(User.all)
From a CSV
Rover.read_csv("file.csv")
# or
Rover.parse_csv("CSV,data,string")
From Parquet (requires the red-parquet gem)
Rover.read_parquet("file.parquet")
# or
Rover.parse_parquet("PAR1...")
Attributes
Get number of rows
df.count
Get column names
df.keys
Check if a column exists
df.include?(name)
Selecting Data
Select a column
df[:a]
Select multiple columns
df[[:a, :b]]
Select first rows
df.head
# or
df.first(5)
Select last rows
df.tail
# or
df.last(5)
Select rows by index
df[1]
# or
df[1..3]
# or
df[[1, 4, 5]]
Iterate over rows
df.each_row { |row| ... }
Iterate over a column
df[:a].each { |item| ... }
# or
df[:a].each_with_index { |item, index| ... }
Filtering
Filter on a condition
df[df[:a] == 100]
df[df[:a] != 100]
df[df[:a] > 100]
df[df[:a] >= 100]
df[df[:a] < 100]
df[df[:a] <= 100]
In
df[df[:a].in?([1, 2, 3])]
df[df[:a].in?(1..3)]
df[df[:a].in?(["a", "b", "c"])]
Not in
df[!df[:a].in?([1, 2, 3])]
And, or, and exclusive or
df[(df[:a] > 100) & (df[:b] == "one")] # and
df[(df[:a] > 100) | (df[:b] == "one")] # or
df[(df[:a] > 100) ^ (df[:b] == "one")] # xor
Operations
Basic operations
df[:a] + 5
df[:a] - 5
df[:a] * 5
df[:a] / 5
df[:a] % 5
df[:a] ** 2
df[:a].sqrt
df[:a].cbrt
df[:a].abs
Rounding
df[:a].round
df[:a].ceil
df[:a].floor
Logarithm
df[:a].ln # or log
df[:a].log(5)
df[:a].log10
df[:a].log2
Exponentiation
df[:a].exp
df[:a].exp2
Trigonometric functions
df[:a].sin
df[:a].cos
df[:a].tan
df[:a].asin
df[:a].acos
df[:a].atan
Hyperbolic functions
df[:a].sinh
df[:a].cosh
df[:a].tanh
df[:a].asinh
df[:a].acosh
df[:a].atanh
Error function
df[:a].erf
df[:a].erfc
Summary statistics
df[:a].count
df[:a].sum
df[:a].mean
df[:a].median
df[:a].percentile(90)
df[:a].min
df[:a].max
df[:a].std
df[:a].var
Count occurrences
df[:a].tally
Cross tabulation
df[:a].crosstab(df[:b])
Grouping
Group
df.group(:a).count
Works with all summary statistics
df.group(:a).max(:b)
Multiple groups
df.group(:a, :b).count
Visualization
Add Vega to your application’s Gemfile:
gem "vega"
And use:
df.plot(:a, :b)
Specify the chart type (line, pie, column, bar, area, or scatter)
df.plot(:a, :b, type: "pie")
Group data
df.plot(:a, :b, group: :c)
Stacked columns or bars
df.plot(:a, :b, group: :c, stacked: true)
Updating Data
Add a new column
df[:a] = 1
# or
df[:a] = [1, 2, 3]
Update a single element
df[:a][0] = 100
Update multiple elements
df[:a][0..2] = 1
# or
df[:a][0..2] = [1, 2, 3]
Update all elements
df[:a] = df[:a].map { |v| v.gsub("a", "b") }
# or
df[:a].map! { |v| v.gsub("a", "b") }
Update elements matching a condition
df[:a][df[:a] > 100] = 0
Clamp
df[:a].clamp!(0, 100)
Delete columns
df.delete(:a)
# or
df.except!(:a, :b)
Rename columns
df.rename(a: :new_a, b: :new_b)
# or
df[:new_a] = df.delete(:a)
Sort rows
df.sort_by! { |r| r[:a] }
Clear all data
df.clear
Combining Data Frames
Add rows
df.concat(other_df)
Add columns
df.merge!(other_df)
Inner join
df.inner_join(other_df)
# or
df.inner_join(other_df, on: :a)
# or
df.inner_join(other_df, on: [:a, :b])
# or
df.inner_join(other_df, on: {df_col: :other_df_col})
Left join
df.left_join(other_df)
Encoding
One-hot encoding
df.one_hot
Drop a variable in each category to avoid the dummy variable trap
df.one_hot(drop: true)
Conversion
Array of hashes
df.to_a
Hash of arrays
df.to_h
Numo array
df.to_numo
CSV
df.to_csv
Parquet (requires the red-parquet gem)
df.to_parquet
Types
You can specify column types when creating a data frame
Rover::DataFrame.new(data, types: {"a" => :int64, "b" => :float64})
Or
Rover.read_csv("data.csv", types: {"a" => :int64, "b" => :float64})
Supported types are:
- boolean -
:bool - float -
:float64,:float32 - integer -
:int64,:int32,:int16,:int8 - unsigned integer -
:uint64,:uint32,:uint16,:uint8 - object -
:object
Get column types
df.types
For a specific column
df[:a].type
Change the type of a column
df[:a].to!(:int32)
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/rover.git
cd rover
bundle install
bundle exec rake test