spark-connect (Ruby)

CI Gem Version Docs License

A pure-Ruby client for Apache Spark Connect - the gRPC-based, decoupled client/server protocol for Apache Spark.

spark-connect lets you build and run Spark DataFrame queries from Ruby against a remote Spark cluster, with an API that closely mirrors PySpark. No JVM, no local Spark installation, no spark-submit - just a gRPC connection to a Spark Connect server.

require "spark-connect"

spark = SparkConnect::SparkSession.builder
                                  .remote("sc://localhost:15002")
                                  .get_or_create

F = SparkConnect::F

spark.range(1, 1_000)
     .select(F.col("id"), (F.col("id") % 3).alias("bucket"))
     .group_by("bucket")
     .agg(F.count("*").alias("n"), F.sum("id").alias("total"))
     .order_by("bucket")
     .show

spark.stop
+------+---+------+
|bucket|  n| total|
+------+---+------+
|     0|333|166833|
|     1|333|166167|
|     2|333|166500|
+------+---+------+

What it supports

spark-connect implements the Spark Connect DataFrame, SQL, Structured Streaming, and Declarative Pipelines API -- everything except user-defined functions (UDFs) and the foreach/foreachBatch streaming sinks, whose Spark Connect protobuf definitions are not yet finalized. (The separate, experimental MLlib-over-Connect surface is also out of scope.)

Results decode through Apache Arrow into ordered, name-addressable Rows. Method names are snake_case (idiomatic Ruby) with camelCase aliases for the common PySpark names (groupBy, withColumn, orderBy, createDataFrame, ...), so PySpark code ports almost verbatim.

Requirements

  • Ruby >= 3.1
  • Apache Arrow C++/GLib system libraries (required by the red-arrow dependency):
  • A reachable Spark Connect server. This client is generated against the Spark Connect 4.1 protocol and supports Apache Spark 3.5 and above.

See the installation guide for details.

Installation

gem install spark-connect

Or in a Gemfile:

gem "spark-connect"

Running a local Spark Connect server

# Download a Spark distribution (4.1.0 shown here; 3.5+ also works)
curl -fsSL https://archive.apache.org/dist/spark/spark-4.1.0/spark-4.1.0-bin-hadoop3.tgz | tar xz
cd spark-4.1.0-bin-hadoop3

# Start the Connect server (requires Java 17+)
./sbin/start-connect-server.sh --jars "$(pwd)/jars/spark-connect_2.13-4.1.0.jar"

The server listens on sc://localhost:15002 by default.

Connecting

Connection strings follow the standard Spark Connect grammar:

# Plaintext, local
SparkConnect::SparkSession.builder.remote("sc://localhost:15002").get_or_create

# TLS + bearer token (token implies SSL)
SparkConnect::SparkSession.builder
  .remote("sc://spark.example.com:443/;token=#{ENV['SPARK_TOKEN']};user_id=alice")
  .get_or_create

Supported parameters: token, user_id, user_agent, use_ssl, session_id, and any x-* custom gRPC headers.

A quick tour

F = SparkConnect::F
T = SparkConnect::Types

# Build a DataFrame from local Ruby data
df = spark.create_data_frame([
  { "name" => "alice", "dept" => "eng", "salary" => 120 },
  { "name" => "bob",   "dept" => "eng", "salary" => 100 },
  { "name" => "carol", "dept" => "ops", "salary" => 110 },
])

# Transform and aggregate
df.where(F.col("salary") >= 105)
  .group_by("dept")
  .agg(F.avg("salary").alias("avg_salary"), F.count("*").alias("headcount"))
  .order_by(F.col("avg_salary").desc)
  .show

# Window functions
w = SparkConnect::Window.partition_by("dept").order_by(F.col("salary").desc)
df.with_column("rank", F.rank.over(w)).show

# Schemas
df.print_schema
df.schema.simple_string  #=> "struct<name:string,dept:string,salary:bigint>"

# SQL with parameters
spark.sql("SELECT * FROM VALUES (1), (2), (3) AS t(x) WHERE x > :min", { min: 1 }).show

Documentation

Full documentation, including guides for every part of the API, lives at https://hyukjinkwon.github.io/spark-connect-ruby/.

Runnable examples/ cover quickstart, transformations, aggregations, joins, window functions, SQL, reading/writing, local data, and NA/stat helpers.

Compatibility

The client is generated against the Spark Connect 4.1 protocol and supports Apache Spark 3.5 and above (the Spark Connect wire protocol is backward compatible across these releases).

Development

git clone https://github.com/HyukjinKwon/spark-connect-ruby
cd spark-connect-ruby
bundle install

bundle exec rake spec      # unit specs (no server required)
bundle exec rake rubocop   # lint
bundle exec rake yard      # API docs

# Integration specs against a live server
SPARK_REMOTE=sc://localhost:15002 bundle exec rspec spec/integration

# Regenerate the protobuf/gRPC stubs from the vendored .proto files
bin/generate-protos

See CONTRIBUTING.md.