Class: Polars::LazyFrame

Inherits:
Object
  • Object
show all
Defined in:
lib/polars/lazy_frame.rb

Overview

Representation of a Lazy computation graph/query against a DataFrame.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil) ⇒ LazyFrame

Create a new LazyFrame.



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# File 'lib/polars/lazy_frame.rb', line 8

def initialize(
  data = nil,
  schema: nil,
  schema_overrides: nil,
  strict: true,
  orient: nil,
  infer_schema_length: N_INFER_DEFAULT,
  nan_to_null: false,
  height: nil
)
  self._ldf = (
    DataFrame.new(
      data,
      schema: schema,
      schema_overrides: schema_overrides,
      strict: strict,
      orient: orient,
      infer_schema_length: infer_schema_length,
      nan_to_null: nan_to_null,
      height: height
    )
    .lazy
    ._ldf
  )
end

Class Method Details

.deserialize(source, format: "binary") ⇒ LazyFrame

Note:

This function uses marshaling if the logical plan contains Ruby UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.

Note:

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Read a logical plan from a file to construct a LazyFrame.

Examples:

lf = Polars::LazyFrame.new({"a" => [1, 2, 3]}).sum
bytes = lf.serialize
Polars::LazyFrame.deserialize(StringIO.new(bytes)).collect
# =>
# shape: (1, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 6   │
# └─────┘

Parameters:

  • source (Object)

    Path to a file or a file-like object (by file-like object, we refer to objects that have a read method, such as a file handler or StringIO).

  • format ('binary', 'json') (defaults to: "binary")

    The format with which the LazyFrame was serialized. Options:

    • "binary": Deserialize from binary format (bytes). This is the default.
    • "json": Deserialize from JSON format (string).

Returns:



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/polars/lazy_frame.rb', line 76

def self.deserialize(source, format: "binary")
  if Utils.pathlike?(source)
    source = Utils.normalize_filepath(source)
  end

  if format == "binary"
    raise Todo unless RbLazyFrame.respond_to?(:deserialize_binary)
    deserializer = RbLazyFrame.method(:deserialize_binary)
  elsif format == "json"
    deserializer = RbLazyFrame.method(:deserialize_json)
  else
    msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}"
    raise ArgumentError, msg
  end

  _from_rbldf(deserializer.(source))
end

Instance Method Details

#bottom_k(k, by:, reverse: false) ⇒ LazyFrame

Return the k smallest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call :func:sort after this function if you wish the output to be sorted.

Examples:

Get the rows which contain the 4 smallest values in column b.

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [2, 1, 1, 3, 2, 1]
  }
)
lf.bottom_k(4, by: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 1   │
# │ a   ┆ 1   │
# │ c   ┆ 1   │
# │ a   ┆ 2   │
# └─────┴─────┘

Get the rows which contain the 4 smallest values when sorting on column a and b.

lf.bottom_k(4, by: ["a", "b"]).collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 1   │
# │ a   ┆ 2   │
# │ b   ┆ 1   │
# │ b   ┆ 2   │
# └─────┴─────┘

Parameters:

  • k (Integer)

    Number of rows to return.

  • by (Object)

    Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.

  • reverse (Object) (defaults to: false)

    Consider the k largest elements of the by column(s) (instead of the k smallest). This can be specified per column by passing an array of booleans.

Returns:



902
903
904
905
906
907
908
909
910
# File 'lib/polars/lazy_frame.rb', line 902

def bottom_k(
  k,
  by:,
  reverse: false
)
  by = Utils.parse_into_list_of_expressions(by)
  reverse = Utils.extend_bool(reverse, by.length, "reverse", "by")
  _from_rbldf(_ldf.bottom_k(k, by, reverse))
end

#cacheLazyFrame

Cache the result once the execution of the physical plan hits this node.

Returns:



1915
1916
1917
# File 'lib/polars/lazy_frame.rb', line 1915

def cache
  _from_rbldf(_ldf.cache)
end

#cast(dtypes, strict: true) ⇒ LazyFrame

Cast LazyFrame column(s) to the specified dtype(s).

Examples:

Cast specific frame columns to the specified dtypes:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => [Date.new(2020, 1, 2), Date.new(2021, 3, 4), Date.new(2022, 5, 6)]
  }
)
lf.cast({"foo" => Polars::Float32, "bar" => Polars::UInt8}).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ foo ┆ bar ┆ ham        │
# │ --- ┆ --- ┆ ---        │
# │ f32 ┆ u8  ┆ date       │
# ╞═════╪═════╪════════════╡
# │ 1.0 ┆ 6   ┆ 2020-01-02 │
# │ 2.0 ┆ 7   ┆ 2021-03-04 │
# │ 3.0 ┆ 8   ┆ 2022-05-06 │
# └─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

lf.cast({Polars::Date => Polars::Datetime}).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────────────────┐
# │ foo ┆ bar ┆ ham                 │
# │ --- ┆ --- ┆ ---                 │
# │ i64 ┆ f64 ┆ datetime[μs]        │
# ╞═════╪═════╪═════════════════════╡
# │ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
# │ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
# │ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
# └─────┴─────┴─────────────────────┘

Cast all frame columns to the specified dtype:

lf.cast(Polars::String).collect.to_h(as_series: false)
# => {"foo"=>["1", "2", "3"], "bar"=>["6.0", "7.0", "8.0"], "ham"=>["2020-01-02", "2021-03-04", "2022-05-06"]}

Parameters:

  • dtypes (Hash)

    Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.

  • strict (Boolean) (defaults to: true)

    Throw an error if a cast could not be done (for instance, due to an overflow).

Returns:



1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
# File 'lib/polars/lazy_frame.rb', line 1968

def cast(dtypes, strict: true)
  if !dtypes.is_a?(Hash)
    return _from_rbldf(_ldf.cast_all(dtypes, strict))
  end

  cast_map = {}
  dtypes.each do |c, dtype|
    dtype = Utils.parse_into_dtype(dtype)
    cast_map.merge!(
      c.is_a?(::String) ? {c => dtype} : Utils.expand_selector(self, c).to_h { |x| [x, dtype] }
    )
  end

  _from_rbldf(_ldf.cast(cast_map, strict))
end

#clear(n = 0) ⇒ LazyFrame

Create an empty copy of the current LazyFrame.

The copy has an identical schema but no data.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil],
  }
).lazy
lf.clear.collect
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘
lf.clear(2).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ bool │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Returns:



2020
2021
2022
# File 'lib/polars/lazy_frame.rb', line 2020

def clear(n = 0)
  DataFrame.new(schema: schema).clear(n).lazy
end

#collect(engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Materialize this LazyFrame into a DataFrame.

By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to false.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.all.sum).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 4   ┆ 10  │
# │ b   ┆ 11  ┆ 10  │
# │ c   ┆ 6   ┆ 1   │
# └─────┴─────┴─────┘

Parameters:

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • background (Boolean) (defaults to: false)

    Run the query in the background and get a handle to the query. This handle can be used to fetch the result or cancel the query.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
# File 'lib/polars/lazy_frame.rb', line 1019

def collect(
  engine: "auto",
  background: false,
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if engine == "streaming"
    Utils.issue_unstable_warning("streaming mode is considered unstable.")
  end

  ldf = _ldf.with_optimizations(optimizations._rboptflags)
  if background
    Utils.issue_unstable_warning("background mode is considered unstable.")
    return InProcessQuery.new(ldf.collect_concurrently)
  end

  Utils.wrap_df(ldf.collect(engine))
end

#collect_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Note:

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Evaluate the query in streaming mode and get a generator that returns chunks.

This allows streaming results that are larger than RAM to be written to disk.

The query will always be fully executed unless stop is called, so you should call next until all chunks have been seen.

Parameters:

  • chunk_size (Integer) (defaults to: nil)

    The number of rows that are buffered before a chunk is given.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • lazy (Boolean) (defaults to: false)

    Start the query when first requesting a batch.

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

Returns:



1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
# File 'lib/polars/lazy_frame.rb', line 1875

def collect_batches(
  chunk_size: nil,
  maintain_order: true,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  ldf = _ldf.with_optimizations(optimizations._rboptflags)
  inner = ldf.collect_batches(
    engine,
    maintain_order,
    chunk_size,
    lazy
  )
  CollectBatches.new(inner)
end

#collect_schemaSchema

Resolve the schema of this LazyFrame.

Examples:

Determine the schema.

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
lf.collect_schema
# => Polars::Schema({"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String})

Access various properties of the schema.

schema = lf.collect_schema
schema["bar"]
# => Polars::Float64
schema.names
# => ["foo", "bar", "ham"]
schema.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]
schema.length
# => 3

Returns:



1070
1071
1072
# File 'lib/polars/lazy_frame.rb', line 1070

def collect_schema
  Schema.new(_ldf.collect_schema, check_dtypes: false)
end

#columnsArray

Get or set column names.

Examples:

df = (
   Polars::DataFrame.new(
     {
       "foo" => [1, 2, 3],
       "bar" => [6, 7, 8],
       "ham" => ["a", "b", "c"]
     }
   )
   .lazy
   .select(["foo", "bar"])
)
df.columns
# => ["foo", "bar"]

Returns:



112
113
114
# File 'lib/polars/lazy_frame.rb', line 112

def columns
  _ldf.collect_schema.keys
end

#countLazyFrame

Return the number of non-null elements for each column.

Examples:

lf = Polars::LazyFrame.new(
  {"a" => [1, 2, 3, 4], "b" => [1, 2, 1, nil], "c" => [nil, nil, nil, nil]}
)
lf.count.collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 4   ┆ 3   ┆ 0   │
# └─────┴─────┴─────┘

Returns:



5453
5454
5455
# File 'lib/polars/lazy_frame.rb', line 5453

def count
  _from_rbldf(_ldf.count)
end

#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame

Note:

The median is included by default as the 50% percentile.

Note:

This method does not maintain the laziness of the frame, and will collect the final result. This could potentially be an expensive operation.

Note:

We do not guarantee the output of describe to be stable. It will show statistics that we deem informative, and may be updated in the future. Using describe programmatically (versus interactive exploration) is not recommended for this reason.

Creates a summary of statistics for a LazyFrame, returning a DataFrame.

Examples:

Show default frame statistics:

lf = Polars::LazyFrame.new(
  {
    "float" => [1.0, 2.8, 3.0],
    "int" => [40, 50, nil],
    "bool" => [true, false, true],
    "str" => ["zz", "xx", "yy"],
    "date" => [Date.new(2020, 1, 1), Date.new(2021, 7, 5), Date.new(2022, 12, 31)]
  }
)
lf.describe
# =>
# shape: (9, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────────┐
# │ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                     │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                     │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                       │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                       │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 UTC │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                    │
# │ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01              │
# │ 25%        ┆ 2.8      ┆ 40.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 50%        ┆ 2.8      ┆ 50.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 75%        ┆ 3.0      ┆ 50.0     ┆ null     ┆ null ┆ 2022-12-31              │
# │ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31              │
# └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────────┘

Customize which percentiles are displayed, applying linear interpolation:

lf.describe(
  percentiles: [0.1, 0.3, 0.5, 0.7, 0.9],
  interpolation: "linear"
)
# =>
# shape: (11, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────────┐
# │ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                     │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                     │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                       │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                       │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 UTC │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                    │
# │ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01              │
# │ …          ┆ …        ┆ …        ┆ …        ┆ …    ┆ …                       │
# │ 30%        ┆ 2.08     ┆ 43.0     ┆ null     ┆ null ┆ 2020-11-26              │
# │ 50%        ┆ 2.8      ┆ 45.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 70%        ┆ 2.88     ┆ 47.0     ┆ null     ┆ null ┆ 2022-02-07              │
# │ 90%        ┆ 2.96     ┆ 49.0     ┆ null     ┆ null ┆ 2022-09-13              │
# │ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31              │
# └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────────┘

Parameters:

  • percentiles (Array) (defaults to: [0.25, 0.5, 0.75])

    One or more percentiles to include in the summary statistics. All values must be in the range [0, 1].

  • interpolation ('nearest', 'higher', 'lower', 'midpoint', 'linear', 'equiprobable') (defaults to: "nearest")

    Interpolation method used when calculating percentiles.

Returns:



394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
# File 'lib/polars/lazy_frame.rb', line 394

def describe(
  percentiles: [0.25, 0.5, 0.75],
  interpolation: "nearest"
)
  schema = collect_schema.to_h

  if schema.empty?
    msg = "cannot describe a LazyFrame that has no columns"
    raise TypeError, msg
  end

  # create list of metrics
  metrics = ["count", "null_count", "mean", "std", "min"]
  if (quantiles = Utils.parse_percentiles(percentiles)).any?
    metrics.concat(quantiles.map { |q| "%g%%" % [q * 100] })
  end
  metrics.append("max")

  skip_minmax = lambda do |dt|
    dt.nested? || [Categorical, Enum, Null, Object, Unknown].include?(dt)
  end

  # determine which columns will produce std/mean/percentile/etc
  # statistics in a single pass over the frame schema
  has_numeric_result, sort_cols = Set.new, Set.new
  metric_exprs = []
  null = F.lit(nil)

  schema.each do |c, dtype|
    is_numeric = dtype.numeric?
    is_temporal = !is_numeric && dtype.temporal?

    # counts
    count_exprs = [
      F.col(c).count.name.prefix("count:"),
      F.col(c).null_count.name.prefix("null_count:")
    ]
    # mean
    mean_expr =
      if is_temporal || is_numeric || dtype == Boolean
        F.col(c).mean
      else
        null
      end

    # standard deviation, min, max
    expr_std = is_numeric ? F.col(c).std : null
    min_expr = !skip_minmax.(dtype) ? F.col(c).min : null
    max_expr = !skip_minmax.(dtype) ? F.col(c).max : null

    # percentiles
    pct_exprs = []
    quantiles.each do |p|
      if is_numeric || is_temporal
        pct_expr =
          if is_temporal
            F.col(c).to_physical.quantile(p, interpolation: interpolation).cast(dtype)
          else
            F.col(c).quantile(p, interpolation: interpolation)
          end
        sort_cols.add(c)
      else
        pct_expr = null
      end
      pct_exprs << pct_expr.alias("#{p}:#{c}")
    end

    if is_numeric || dtype.nested? || [Null, Boolean].include?(dtype)
      has_numeric_result.add(c)
    end

    # add column expressions (in end-state 'metrics' list order)
    metric_exprs.concat(
      [
        *count_exprs,
        mean_expr.alias("mean:#{c}"),
        expr_std.alias("std:#{c}"),
        min_expr.alias("min:#{c}"),
        *pct_exprs,
        max_expr.alias("max:#{c}")
      ]
    )
  end

  # calculate requested metrics in parallel, then collect the result
  df_metrics = (
    (
      # if more than one quantile, sort the relevant columns to make them O(1)
      # TODO: drop sort once we have efficient retrieval of multiple quantiles
      sort_cols ? with_columns(sort_cols.map { |c| F.col(c).sort }) : self
    )
    .select(*metric_exprs)
    .collect
  )

  # reshape wide result
  n_metrics = metrics.length
  column_metrics =
    schema.length.times.map do |n|
      df_metrics.row(0)[(n * n_metrics)...((n + 1) * n_metrics)]
    end

  summary = schema.keys.zip(column_metrics).to_h

  # cast by column type (numeric/bool -> float), (other -> string)
  schema.each_key do |c|
    summary[c] =
      summary[c].map do |v|
        if v.nil? || v.is_a?(Hash)
          nil
        else
          if has_numeric_result.include?(c)
            if v == true
              1.0
            elsif v == false
              0.0
            else
              v.to_f
            end
          else
            "#{v}"
          end
        end
      end
  end

  # return results as a DataFrame
  df_summary = Polars.from_hash(summary)
  df_summary.insert_column(0, Polars::Series.new("statistic", metrics))
  df_summary
end

#drop(*columns, strict: true) ⇒ LazyFrame

Remove one or multiple columns from a DataFrame.

Examples:

Drop a single column by passing the name of that column.

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
lf.drop("ham").collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 6.0 │
# │ 2   ┆ 7.0 │
# │ 3   ┆ 8.0 │
# └─────┴─────┘

Drop multiple columns by passing a selector.

lf.drop(Polars.cs.numeric).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ ham │
# │ --- │
# │ str │
# ╞═════╡
# │ a   │
# │ b   │
# │ c   │
# └─────┘

Use positional arguments to drop multiple columns.

lf.drop("foo", "ham").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ bar │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 6.0 │
# │ 7.0 │
# │ 8.0 │
# └─────┘

Parameters:

  • columns (Object)
    • Name of the column that should be removed.
    • List of column names.
  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not.

Returns:



3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
# File 'lib/polars/lazy_frame.rb', line 3688

def drop(*columns, strict: true)
  selectors = []
  columns.each do |c|
    if c.is_a?(Enumerable)
      selectors += c
    else
      selectors += [c]
    end
  end

  drop_cols = Utils.parse_list_into_selector(selectors, strict: strict)
  _from_rbldf(_ldf.drop(drop_cols._rbselector))
end

#drop_nans(subset: nil) ⇒ LazyFrame

Drop all rows that contain one or more NaN values.

The original order of the remaining rows is preserved.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [-20.5, Float::NAN, 80.0],
    "bar" => [Float::NAN, 110.0, 25.5],
    "ham" => ["xxx", "yyy", nil]
  }
)
lf.drop_nans.collect
# =>
# shape: (1, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ f64  ┆ f64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 80.0 ┆ 25.5 ┆ null │
# └──────┴──────┴──────┘
lf.drop_nans(subset: ["bar"]).collect
# =>
# shape: (2, 3)
# ┌──────┬───────┬──────┐
# │ foo  ┆ bar   ┆ ham  │
# │ ---  ┆ ---   ┆ ---  │
# │ f64  ┆ f64   ┆ str  │
# ╞══════╪═══════╪══════╡
# │ NaN  ┆ 110.0 ┆ yyy  │
# │ 80.0 ┆ 25.5  ┆ null │
# └──────┴───────┴──────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Column name(s) for which NaN values are considered; if set to nil (default), use all columns (note that only floating-point columns can contain NaNs).

Returns:



4633
4634
4635
4636
4637
4638
4639
# File 'lib/polars/lazy_frame.rb', line 4633

def drop_nans(subset: nil)
  selector_subset = nil
  if !subset.nil?
    selector_subset = Utils.parse_list_into_selector(subset)._rbselector
  end
  _from_rbldf(_ldf.drop_nans(selector_subset))
end

#drop_nulls(subset: nil) ⇒ LazyFrame

Drop all rows that contain one or more null values.

The original order of the remaining rows is preserved.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", nil]
  }
)
lf.drop_nulls.collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘
lf.drop_nulls(subset: Polars.cs.integer).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1   ┆ 6   ┆ a    │
# │ 3   ┆ 8   ┆ null │
# └─────┴─────┴──────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Column name(s) for which null values are considered. If set to nil (default), use all columns.

Returns:



4682
4683
4684
4685
4686
4687
4688
# File 'lib/polars/lazy_frame.rb', line 4682

def drop_nulls(subset: nil)
  selector_subset = nil
  if !subset.nil?
    selector_subset = Utils.parse_list_into_selector(subset)._rbselector
  end
  _from_rbldf(_ldf.drop_nulls(selector_subset))
end

#dtypesArray

Get dtypes of columns in LazyFrame.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

Returns:



130
131
132
# File 'lib/polars/lazy_frame.rb', line 130

def dtypes
  _ldf.collect_schema.values
end

#explain(format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ String

Create a string representation of the query plan.

Different optimizations can be turned on or off.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
lf.group_by("a", maintain_order: true).agg(Polars.all.sum).sort(
  "a"
).explain

Returns:



543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
# File 'lib/polars/lazy_frame.rb', line 543

def explain(
  format: "plain",
  optimized: true,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if engine == "streaming"
    Utils.issue_unstable_warning("streaming mode is considered unstable.")
  end

  if optimized
    ldf = _ldf.with_optimizations(optimizations._rboptflags)
    if format == "tree"
      return ldf.describe_optimized_plan_tree
    else
      return ldf.describe_optimized_plan
    end
  end

  if format == "tree"
    _ldf.describe_plan_tree
  else
    _ldf.describe_plan
  end
end

#explode(columns, *more_columns, empty_as_null: true, keep_nulls: true) ⇒ LazyFrame

Explode lists to long format.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]],
  }
).lazy
df.explode("numbers").collect
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘

Parameters:

  • columns (Object)

    Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the List or Array data type.

  • more_columns (Array)

    Additional names of columns to explode, specified as positional arguments.

  • empty_as_null (Boolean) (defaults to: true)

    Explode an empty list/array into a null.

  • keep_nulls (Boolean) (defaults to: true)

    Explode a null list/array into a null.

Returns:



4510
4511
4512
4513
4514
4515
4516
4517
4518
4519
4520
# File 'lib/polars/lazy_frame.rb', line 4510

def explode(
  columns,
  *more_columns,
  empty_as_null: true,
  keep_nulls: true
)
  subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector(
    more_columns
  )
  _from_rbldf(_ldf.explode(subset._rbselector, empty_as_null, keep_nulls))
end

#fill_nan(value) ⇒ LazyFrame

Note:

Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use fill_null instead.

Fill floating point NaN values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13],
  }
).lazy
df.fill_nan(99).collect
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘

Parameters:

  • value (Object)

    Value to fill the NaN values with.

Returns:



4249
4250
4251
4252
4253
4254
# File 'lib/polars/lazy_frame.rb', line 4249

def fill_nan(value)
  if !value.is_a?(Expr)
    value = F.lit(value)
  end
  _from_rbldf(_ldf.fill_nan(value._rbexpr))
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ LazyFrame

Fill null values using the specified value or strategy.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, nil, 4],
    "b" => [0.5, 4, nil, 13]
  }
)
lf.fill_null(99).collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 99  ┆ 99.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "forward").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "max").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "zero").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 0   ┆ 0.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

Returns:



4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
# File 'lib/polars/lazy_frame.rb', line 4174

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true)
  if !value.nil?
    if value.is_a?(Expr)
      dtypes = nil
    elsif value.is_a?(TrueClass) || value.is_a?(FalseClass)
      dtypes = [Boolean]
    elsif matches_supertype && (value.is_a?(Integer) || value.is_a?(Float))
      dtypes = [
        Int8,
        Int16,
        Int32,
        Int64,
        Int128,
        UInt8,
        UInt16,
        UInt32,
        UInt64,
        Float32,
        Float64,
        Decimal.new
      ]
    elsif value.is_a?(Integer)
      dtypes = [Int64]
    elsif value.is_a?(Float)
      dtypes = [Float64]
    elsif value.is_a?(::Date)
      dtypes = [Date]
    elsif value.is_a?(::String)
      dtypes = [String, Categorical]
    else
      # fallback; anything not explicitly handled above
      dtypes = nil
    end

    if dtypes
      return with_columns(
        F.col(dtypes).fill_null(value, strategy: strategy, limit: limit)
      )
    end
  end

  select(Polars.all.fill_null(value, strategy: strategy, limit: limit))
end

#filter(*predicates, **constraints) ⇒ LazyFrame

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3, nil, 4, nil, 0],
    "bar" => [6, 7, 8, nil, nil, 9, 0],
    "ham" => ["a", "b", "c", nil, "d", "e", "f"]
  }
)
lf.filter(Polars.col("foo") > 1).collect
# =>
# shape: (3, 3)
# ┌─────┬──────┬─────┐
# │ foo ┆ bar  ┆ ham │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ i64  ┆ str │
# ╞═════╪══════╪═════╡
# │ 2   ┆ 7    ┆ b   │
# │ 3   ┆ 8    ┆ c   │
# │ 4   ┆ null ┆ d   │
# └─────┴──────┴─────┘

Filter on multiple conditions:

lf.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a")).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Provide multiple filters using *args syntax:

lf.filter(
  Polars.col("foo") == 1,
  Polars.col("ham") == "a"
).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Provide multiple filters using **kwargs syntax:

lf.filter(foo: 1, ham: "a").collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Filter on an OR condition:

lf.filter(
  (Polars.col("foo") == 1) | (Polars.col("ham") == "c")
).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Filter by comparing two columns against each other

lf.filter(
  Polars.col("foo") == Polars.col("bar")
).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 0   ┆ 0   ┆ f   │
# └─────┴─────┴─────┘
lf.filter(
  Polars.col("foo") != Polars.col("bar")
).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • predicates (Array)

    Expression(s) that evaluate to a boolean Series.

  • constraints (Hash)

    Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as Polars.col(name).eq(value), and is implicitly joined with the other filter conditions using &.

Returns:



2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
# File 'lib/polars/lazy_frame.rb', line 2139

def filter(*predicates, **constraints)
  if constraints.empty?
    # early-exit conditions (exclude/include all rows)
    if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass))
      return dup
    end
    if predicates.length == 1 && predicates[0].is_a?(FalseClass)
      return clear
    end
  end

  _filter(
    predicates: predicates,
    constraints: constraints,
    invert: false
  )
end

#firstLazyFrame

Get the first row of the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 5, 3],
    "b" => [2, 4, 6]
  }
)
lf.first.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

Returns:



4039
4040
4041
# File 'lib/polars/lazy_frame.rb', line 4039

def first
  slice(0, 1)
end

#gather(indices, null_on_oob: false) ⇒ LazyFrame

Note:

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Selects rows from this LazyFrame at the given indices.

Examples:

lf = Polars::LazyFrame.new({"x" => [2, 1, 0], "s" => ["foo", "bar", "baz"]})
lf.gather([2, 0, 0]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ x   ┆ s   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 0   ┆ baz │
# │ 2   ┆ foo │
# │ 2   ┆ foo │
# └─────┴─────┘
lf.gather([0, 10, 1], null_on_oob: true).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ x    ┆ s    │
# │ ---  ┆ ---  │
# │ i64  ┆ str  │
# ╞══════╪══════╡
# │ 2    ┆ foo  │
# │ null ┆ null │
# │ 1    ┆ bar  │
# └──────┴──────┘
idxs = Polars::LazyFrame.new({"i" => [1, 10, 0], "b" => [true, false, true]})
lf.gather(idxs.filter(Polars.col("b")).select(Polars.col("i"))).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ x   ┆ s   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ bar │
# │ 2   ┆ foo │
# └─────┴─────┘

Parameters:

  • indices (Object)

    The indices of the rows to select.

    Due to the lack of a LazySeries it's permitted to pass a single-width LazyFrame as indices as well.

  • null_on_oob (Boolean) (defaults to: false)

    If true when an index is out-of-bounds a null row will be generated instead of raising an error.

Returns:



3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
# File 'lib/polars/lazy_frame.rb', line 3543

def gather(indices, null_on_oob: false)
  if !indices.is_a?(LazyFrame)
    if indices.is_a?(::Array)
      indices_expr = F.lit(Series.new("", indices, dtype: Int64))
    else
      indices_expr = wrap_expr(Utils.parse_into_expression(indices))
    end
    indices = select(indices_expr)
  end

  _from_rbldf(_ldf.gather(indices._ldf, null_on_oob))
end

#gather_every(n, offset: 0) ⇒ LazyFrame

Take every nth row in the LazyFrame and return as a new LazyFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]}).lazy
s.gather_every(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘

Parameters:

  • n (Integer)

    Gather every n-th row.

  • offset (Integer) (defaults to: 0)

    Starting index.

Returns:



4101
4102
4103
# File 'lib/polars/lazy_frame.rb', line 4101

def gather_every(n, offset: 0)
  select(F.col("*").gather_every(n, offset))
end

#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.col("b").sum).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

Parameters:

  • by (Array)

    Column(s) to group by.

  • maintain_order (Boolean) (defaults to: false)

    Make sure that the order of the groups remain consistent. This is more expensive than a default group by.

  • named_by (Hash)

    Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2443
2444
2445
2446
2447
# File 'lib/polars/lazy_frame.rb', line 2443

def group_by(*by, maintain_order: false, **named_by)
  exprs = Utils.parse_into_list_of_expressions(*by, **named_by)
  lgb = _ldf.group_by(exprs, maintain_order)
  LazyGroupBy.new(lgb)
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window
  • period: length of the window
  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬─────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                   │
# │ ---                 ┆ ---        ┆ ---                             │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]              │
# ╞═════════════════════╪════════════╪═════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]           │
# └─────────────────────┴────────────┴─────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  group_by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (4, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a dynamic group by on indices, dtype needs to be one of \{Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

  • every (Object)

    Interval of the window.

  • period (Object) (defaults to: nil)

    Length of the window, if nil it is equal to 'every'.

  • offset (Object) (defaults to: nil)

    Offset of the window if nil and period is nil it will be equal to negative every.

  • include_boundaries (Boolean) (defaults to: false)

    Add the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize

  • closed ("right", "left", "both", "none") (defaults to: "left")

    Define whether the temporal window interval is closed or not.

  • label ('left', 'right', 'datapoint') (defaults to: "left")

    Define which label to use for the window:

    • 'left': lower boundary of the window
    • 'right': upper boundary of the window
    • 'datapoint': the first value of the index column in the given window. If you don't need the label to be at one of the boundaries, choose this option for maximum performance
  • group_by (Object) (defaults to: nil)

    Also group by this column/these columns

  • start_by ('window', 'datapoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday') (defaults to: "window")

    The strategy to determine the start of the first window by.

    • 'window': Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
    • 'datapoint': Start from the first encountered data point.
    • a day of the week (only takes effect if every contains 'w'):

      • 'monday': Start the window on the Monday before the first data point.
      • 'tuesday': Start the window on the Tuesday before the first data point.
      • ...
      • 'sunday': Start the window on the Sunday before the first data point.

    The resulting window is then shifted back until the earliest datapoint is in or in front of it.

Returns:



2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
# File 'lib/polars/lazy_frame.rb', line 2806

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  include_boundaries: false,
  closed: "left",
  label: "left",
  group_by: nil,
  start_by: "window"
)
  index_column = Utils.parse_into_expression(index_column, str_as_lit: false)
  if offset.nil?
    offset = period.nil? ? "-#{every}" : "0ns"
  end

  if period.nil?
    period = every
  end

  period = Utils.parse_as_duration_string(period)
  offset = Utils.parse_as_duration_string(offset)
  every = Utils.parse_as_duration_string(every)

  rbexprs_by = group_by.nil? ? [] : Utils.parse_into_list_of_expressions(group_by)
  lgb = _ldf.group_by_dynamic(
    index_column,
    every,
    period,
    offset,
    label,
    include_boundaries,
    closed,
    rbexprs_by,
    start_by
  )
  LazyGroupBy.new(lgb)
end

#head(n = 5) ⇒ LazyFrame

Get the first n rows.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.head.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# └─────┴─────┘
lf.head(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



3944
3945
3946
# File 'lib/polars/lazy_frame.rb', line 3944

def head(n = 5)
  slice(0, n)
end

#include?(key) ⇒ Boolean

Check if LazyFrame includes key.

Returns:



167
168
169
# File 'lib/polars/lazy_frame.rb', line 167

def include?(key)
  columns.include?(key)
end

#interpolateLazyFrame

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
).lazy
df.interpolate.collect
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘

Returns:



5042
5043
5044
# File 'lib/polars/lazy_frame.rb', line 5042

def interpolate
  select(F.col("*").interpolate)
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame

Add a join operation to the Logical Plan.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
).lazy
df.join(other_df, on: "ham").collect
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "full").collect
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘
df.join(other_df, on: "ham", how: "left").collect
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "semi").collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘
df.join(other_df, on: "ham", how: "anti").collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • other (LazyFrame)

    Lazy DataFrame to join with.

  • left_on (Object) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (Object) (defaults to: nil)

    Join column of the right DataFrame.

  • on (defaults to: nil)

    Object Join column of both DataFrames. If set, left_on and right_on should be nil.

  • how ("inner", "left", "full", "semi", "anti", "cross") (defaults to: "inner")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • validate ('m:m', 'm:1', '1:m', '1:1') (defaults to: "m:m")

    Checks if join is of specified type.

    • many_to_many - “m:m”: default, does not result in checks
    • one_to_one - “1:1”: check if join keys are unique in both left and right datasets
    • one_to_many - “1:m”: check if join keys are unique in left dataset
    • many_to_one - “m:1”: check if join keys are unique in right dataset
  • nulls_equal (Boolean) (defaults to: false)

    Join on null values. By default null values will never produce matches.

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: nil)

    Coalescing behavior (merging of join columns).

    • nil: -> join specific.
    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.
  • maintain_order ('none', 'left', 'right', 'left_right', 'right_left') (defaults to: nil)

    Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance Supported for inner, left, right and full joins

    • none No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
    • left Preserves the order of the left DataFrame.
    • right Preserves the order of the right DataFrame.
    • left_right First preserves the order of the left DataFrame, then the right.
    • right_left First preserves the order of the right DataFrame, then the left.

Returns:



3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
# File 'lib/polars/lazy_frame.rb', line 3317

def join(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  validate: "m:m",
  nulls_equal: false,
  allow_parallel: true,
  force_parallel: false,
  coalesce: nil,
  maintain_order: nil
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if maintain_order.nil?
    maintain_order = "none"
  end

  if how == "outer"
    how = "full"
  elsif how == "cross"
    return _from_rbldf(
      _ldf.join(
        other._ldf,
        [],
        [],
        allow_parallel,
        nulls_equal,
        force_parallel,
        how,
        suffix,
        validate,
        maintain_order,
        coalesce
      )
    )
  end

  if !on.nil?
    rbexprs = Utils.parse_into_list_of_expressions(on)
    rbexprs_left = rbexprs
    rbexprs_right = rbexprs
  elsif !left_on.nil? && !right_on.nil?
    rbexprs_left = Utils.parse_into_list_of_expressions(left_on)
    rbexprs_right = Utils.parse_into_list_of_expressions(right_on)
  else
    raise ArgumentError, "must specify `on` OR `left_on` and `right_on`"
  end

  _from_rbldf(
    self._ldf.join(
      other._ldf,
      rbexprs_left,
      rbexprs_right,
      allow_parallel,
      force_parallel,
      nulls_equal,
      how,
      suffix,
      validate,
      maintain_order,
      coalesce
    )
  )
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the join_asof key.

For each row in the left DataFrame:

  • A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
  • A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Examples:

gdp = Polars::LazyFrame.new(
  {
    "date" => Polars.date_range(
      Date.new(2016, 1, 1),
      Date.new(2020, 1, 1),
      "1y",
      eager: true
    ),
    "gdp" => [4164, 4411, 4566, 4696, 4827]
  }
)
gdp.collect
# =>
# shape: (5, 2)
# ┌────────────┬──────┐
# │ date       ┆ gdp  │
# │ ---        ┆ ---  │
# │ date       ┆ i64  │
# ╞════════════╪══════╡
# │ 2016-01-01 ┆ 4164 │
# │ 2017-01-01 ┆ 4411 │
# │ 2018-01-01 ┆ 4566 │
# │ 2019-01-01 ┆ 4696 │
# │ 2020-01-01 ┆ 4827 │
# └────────────┴──────┘
population = Polars::LazyFrame.new(
  {
    "date" => [Date.new(2016, 3, 1), Date.new(2018, 8, 1), Date.new(2019, 1, 1)],
    "population" => [82.19, 82.66, 83.12]
  }
).sort("date")
population.collect
# =>
# shape: (3, 2)
# ┌────────────┬────────────┐
# │ date       ┆ population │
# │ ---        ┆ ---        │
# │ date       ┆ f64        │
# ╞════════════╪════════════╡
# │ 2016-03-01 ┆ 82.19      │
# │ 2018-08-01 ┆ 82.66      │
# │ 2019-01-01 ┆ 83.12      │
# └────────────┴────────────┘

Note how the dates don't quite match. If we join them using join_asof and strategy: "backward", then each date from population which doesn't have an exact match is matched with the closest earlier date from gdp:

population.join_asof(gdp, on: "date", strategy: "backward").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 4566 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
population.join_asof(
  gdp, on: "date", strategy: "backward", coalesce: false
).collect
# =>
# shape: (3, 4)
# ┌────────────┬────────────┬────────────┬──────┐
# │ date       ┆ population ┆ date_right ┆ gdp  │
# │ ---        ┆ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ date       ┆ i64  │
# ╞════════════╪════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 2016-01-01 ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 2018-01-01 ┆ 4566 │
# │ 2019-01-01 ┆ 83.12      ┆ 2019-01-01 ┆ 4696 │
# └────────────┴────────────┴────────────┴──────┘

If we instead use strategy: "forward", then each date from population which doesn't have an exact match is matched with the closest later date from gdp:

population.join_asof(gdp, on: "date", strategy: "forward").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4411 │
# │ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
population.join_asof(gdp, on: "date", strategy: "nearest").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
gdp_dates = Polars.date_range(
  Date.new(2016, 1, 1), Date.new(2020, 1, 1), "1y", eager: true
)
gdp2 = Polars::LazyFrame.new(
  {
    "country" => ["Germany"] * 5 + ["Netherlands"] * 5,
    "date" => Polars.concat([gdp_dates, gdp_dates]),
    "gdp" => [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909]
  }
).sort("country", "date")
gdp2.collect
# =>
# shape: (10, 3)
# ┌─────────────┬────────────┬──────┐
# │ country     ┆ date       ┆ gdp  │
# │ ---         ┆ ---        ┆ ---  │
# │ str         ┆ date       ┆ i64  │
# ╞═════════════╪════════════╪══════╡
# │ Germany     ┆ 2016-01-01 ┆ 4164 │
# │ Germany     ┆ 2017-01-01 ┆ 4411 │
# │ Germany     ┆ 2018-01-01 ┆ 4566 │
# │ Germany     ┆ 2019-01-01 ┆ 4696 │
# │ Germany     ┆ 2020-01-01 ┆ 4827 │
# │ Netherlands ┆ 2016-01-01 ┆ 784  │
# │ Netherlands ┆ 2017-01-01 ┆ 833  │
# │ Netherlands ┆ 2018-01-01 ┆ 914  │
# │ Netherlands ┆ 2019-01-01 ┆ 910  │
# │ Netherlands ┆ 2020-01-01 ┆ 909  │
# └─────────────┴────────────┴──────┘
pop2 = Polars::LazyFrame.new(
  {
    "country" => ["Germany"] * 3 + ["Netherlands"] * 3,
    "date" => [
      Date.new(2016, 3, 1),
      Date.new(2018, 8, 1),
      Date.new(2019, 1, 1),
      Date.new(2016, 3, 1),
      Date.new(2018, 8, 1),
      Date.new(2019, 1, 1)
    ],
    "population" => [82.19, 82.66, 83.12, 17.11, 17.32, 17.40]
  }
).sort("country", "date")
pop2.collect
# =>
# shape: (6, 3)
# ┌─────────────┬────────────┬────────────┐
# │ country     ┆ date       ┆ population │
# │ ---         ┆ ---        ┆ ---        │
# │ str         ┆ date       ┆ f64        │
# ╞═════════════╪════════════╪════════════╡
# │ Germany     ┆ 2016-03-01 ┆ 82.19      │
# │ Germany     ┆ 2018-08-01 ┆ 82.66      │
# │ Germany     ┆ 2019-01-01 ┆ 83.12      │
# │ Netherlands ┆ 2016-03-01 ┆ 17.11      │
# │ Netherlands ┆ 2018-08-01 ┆ 17.32      │
# │ Netherlands ┆ 2019-01-01 ┆ 17.4       │
# └─────────────┴────────────┴────────────┘
pop2.join_asof(gdp2, by: "country", on: "date", strategy: "nearest", check_sortedness: false).collect
# =>
# shape: (6, 4)
# ┌─────────────┬────────────┬────────────┬──────┐
# │ country     ┆ date       ┆ population ┆ gdp  │
# │ ---         ┆ ---        ┆ ---        ┆ ---  │
# │ str         ┆ date       ┆ f64        ┆ i64  │
# ╞═════════════╪════════════╪════════════╪══════╡
# │ Germany     ┆ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ Germany     ┆ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ Germany     ┆ 2019-01-01 ┆ 83.12      ┆ 4696 │
# │ Netherlands ┆ 2016-03-01 ┆ 17.11      ┆ 784  │
# │ Netherlands ┆ 2018-08-01 ┆ 17.32      ┆ 910  │
# │ Netherlands ┆ 2019-01-01 ┆ 17.4       ┆ 910  │
# └─────────────┴────────────┴────────────┴──────┘

Parameters:

  • other (LazyFrame)

    Lazy DataFrame to join with.

  • left_on (String) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (String) (defaults to: nil)

    Join column of the right DataFrame.

  • on (String) (defaults to: nil)

    Join column of both DataFrames. If set, left_on and right_on should be nil.

  • by_left (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • by_right (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • by (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • strategy ("backward", "forward") (defaults to: "backward")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • tolerance (Object) (defaults to: nil)

    Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time" you use the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: true)

    Coalescing behavior (merging of join columns).

    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.
  • allow_exact_matches (Boolean) (defaults to: true)

    Whether exact matches are valid join predicates.

    • If true, allow matching with the same on value (i.e. less-than-or-equal-to / greater-than-or-equal-to).
    • If false, don't match the same on value (i.e., strictly less-than / strictly greater-than).
  • check_sortedness (Boolean) (defaults to: true)

    Check the sortedness of the asof keys. If the keys are not sorted Polars will error, or in case of 'by' argument raise a warning. This might become a hard error in the future.

Returns:



3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
# File 'lib/polars/lazy_frame.rb', line 3104

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false,
  coalesce: true,
  allow_exact_matches: true,
  check_sortedness: true
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if on.is_a?(::String)
    left_on = on
    right_on = on
  end

  if left_on.nil? || right_on.nil?
    raise ArgumentError, "You should pass the column to join on as an argument."
  end

  if by_left.is_a?(::String) || by_left.is_a?(Expr)
    by_left_ = [by_left]
  else
    by_left_ = by_left
  end

  if by_right.is_a?(::String) || by_right.is_a?(Expr)
    by_right_ = [by_right]
  else
    by_right_ = by_right
  end

  if by.is_a?(::String)
    by_left_ = [by]
    by_right_ = [by]
  elsif by.is_a?(::Array)
    by_left_ = by
    by_right_ = by
  end

  tolerance_str = nil
  tolerance_num = nil
  if tolerance.is_a?(::String)
    tolerance_str = tolerance
  else
    tolerance_num = tolerance
  end

  _from_rbldf(
    _ldf.join_asof(
      other._ldf,
      Polars.col(left_on)._rbexpr,
      Polars.col(right_on)._rbexpr,
      by_left_,
      by_right_,
      allow_parallel,
      force_parallel,
      suffix,
      strategy,
      tolerance_num,
      tolerance_str,
      coalesce,
      allow_exact_matches,
      check_sortedness
    )
  )
end

#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame

Note:

The row order of the input DataFrames is not preserved.

Note:

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Perform a join based on one or multiple (in)equality predicates.

This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.

Examples:

Join two lazyframes together based on two predicates which get AND-ed together.

east = Polars::LazyFrame.new(
  {
    "id" => [100, 101, 102],
    "dur" => [120, 140, 160],
    "rev" => [12, 14, 16],
    "cores" => [2, 8, 4]
  }
)
west = Polars::LazyFrame.new(
  {
    "t_id" => [404, 498, 676, 742],
    "time" => [90, 130, 150, 170],
    "cost" => [9, 13, 15, 16],
    "cores" => [4, 2, 1, 4]
  }
)
east.join_where(
  west,
  Polars.col("dur") < Polars.col("time"),
  Polars.col("rev") < Polars.col("cost")
).collect
# =>
# shape: (5, 8)
# ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
# │ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
# │ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
# │ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
# ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

To OR them together, use a single expression and the | operator.

east.join_where(
  west,
  (Polars.col("dur") < Polars.col("time")) | (Polars.col("rev") < Polars.col("cost"))
).collect
# =>
# shape: (6, 8)
# ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
# │ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
# │ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
# │ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
# ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 102 ┆ 160 ┆ 16  ┆ 4     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

Parameters:

  • other (Object)

    DataFrame to join with.

  • predicates (Object)

    (In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

Returns:



3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
# File 'lib/polars/lazy_frame.rb', line 3466

def join_where(
  other,
  *predicates,
  suffix: "_right"
)
  Utils.require_same_type(self, other)

  rbexprs = Utils.parse_into_list_of_expressions(*predicates)

  _from_rbldf(
    _ldf.join_where(
      other._ldf,
      rbexprs,
      suffix
    )
  )
end

#lastLazyFrame

Get the last row of the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 5, 3],
    "b" => [2, 4, 6]
  }
)
lf.last.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 3   ┆ 6   │
# └─────┴─────┘

Returns:



4014
4015
4016
# File 'lib/polars/lazy_frame.rb', line 4014

def last
  tail(1)
end

#lazyLazyFrame

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.lazy

Returns:



1908
1909
1910
# File 'lib/polars/lazy_frame.rb', line 1908

def lazy
  self
end

#limit(n = 5) ⇒ LazyFrame

Get the first n rows.

Alias for #head.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.limit.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# └─────┴─────┘
lf.limit(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



3899
3900
3901
# File 'lib/polars/lazy_frame.rb', line 3899

def limit(n = 5)
  head(n)
end

#map_batches(predicate_pushdown: false, projection_pushdown: false, slice_pushdown: false, no_optimizations: nil, schema: nil, validate_output_schema: true, streamable: false, &function) ⇒ LazyFrame

Note:

The schema of a LazyFrame must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.

Note:

It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column, predicate_pushdown should not be allowed, as this prunes rows and will influence your aggregation results.

Note:

A UDF passed to map_batches must be pure, meaning that it cannot modify or depend on state other than its arguments.

Apply a custom function.

It is important that the function returns a Polars DataFrame.

Examples:

lf = (
  Polars::LazyFrame.new(
    {
      "a": Polars.int_range(-100_000, 0, eager: true),
      "b": Polars.int_range(0, 100_000, eager: true)
    }
  )
  .map_batches(streamable: true) { |x| x * 2 }
  .collect(engine: "streaming")
)
# =>
# shape: (100_000, 2)
# ┌─────────┬────────┐
# │ a       ┆ b      │
# │ ---     ┆ ---    │
# │ i64     ┆ i64    │
# ╞═════════╪════════╡
# │ -200000 ┆ 0      │
# │ -199998 ┆ 2      │
# │ -199996 ┆ 4      │
# │ -199994 ┆ 6      │
# │ -199992 ┆ 8      │
# │ …       ┆ …      │
# │ -10     ┆ 199990 │
# │ -8      ┆ 199992 │
# │ -6      ┆ 199994 │
# │ -4      ┆ 199996 │
# │ -2      ┆ 199998 │
# └─────────┴────────┘

Parameters:

  • predicate_pushdown (Boolean) (defaults to: false)

    Allow predicate pushdown optimization to pass this node.

  • projection_pushdown (Boolean) (defaults to: false)

    Allow projection pushdown optimization to pass this node.

  • slice_pushdown (Boolean) (defaults to: false)

    Allow slice pushdown optimization to pass this node.

  • no_optimizations (Boolean) (defaults to: nil)

    Turn off all optimizations past this point.

  • schema (Object) (defaults to: nil)

    Output schema of the function, if set to nil we assume that the schema will remain unchanged by the applied function.

  • validate_output_schema (Boolean) (defaults to: true)

    It is paramount that polars' schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to false will not do this check, but may lead to hard to debug bugs.

  • streamable (Boolean) (defaults to: false)

    Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.

Returns:

Raises:

  • (Todo)


4986
4987
4988
4989
4990
4991
4992
4993
4994
4995
4996
4997
4998
4999
5000
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
5015
# File 'lib/polars/lazy_frame.rb', line 4986

def map_batches(
  predicate_pushdown: false,
  projection_pushdown: false,
  slice_pushdown: false,
  no_optimizations: nil,
  schema: nil,
  validate_output_schema: true,
  streamable: false,
  &function
)
  raise Todo if !schema.nil?

  if no_optimizations
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
  end

  _from_rbldf(
    _ldf.map_batches(
      function,
      predicate_pushdown,
      projection_pushdown,
      slice_pushdown,
      streamable,
      schema,
      validate_output_schema,
    )
  )
end

#match_to_schema(schema, missing_columns: "raise", missing_struct_fields: "raise", extra_columns: "raise", extra_struct_fields: "raise", integer_cast: "forbid", float_cast: "forbid") ⇒ LazyFrame

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Match or evolve the schema of a LazyFrame into a specific schema.

By default, match_to_schema returns an error if the input schema does not exactly match the target schema. It also allows columns to be freely reordered, with additional coercion rules available through optional parameters.

Examples:

Ensuring the schema matches

lf = Polars::LazyFrame.new({"a" => [1, 2, 3], "b" => ["A", "B", "C"]})
lf.match_to_schema({"a" => Polars::Int64, "b" => Polars::String}).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ A   │
# │ 2   ┆ B   │
# │ 3   ┆ C   │
# └─────┴─────┘

Adding missing columns

Polars::LazyFrame.new({"a" => [1, 2, 3]})
.match_to_schema(
    {"a" => Polars::Int64, "b" => Polars::String},
    missing_columns: "insert"
)
.collect
# =>
# shape: (3, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ str  │
# ╞═════╪══════╡
# │ 1   ┆ null │
# │ 2   ┆ null │
# │ 3   ┆ null │
# └─────┴──────┘
Polars::LazyFrame.new({"a" => [1, 2, 3]})
.match_to_schema(
  {"a" => Polars::Int64, "b" => Polars::String},
  missing_columns: {"b" => Polars.col("a").cast(Polars::String)}
)
.collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ 1   │
# │ 2   ┆ 2   │
# │ 3   ┆ 3   │
# └─────┴─────┘

Removing extra columns

Polars::LazyFrame.new({"a" => [1, 2, 3], "b" => ["A", "B", "C"]})
.match_to_schema(
  {"a" => Polars::Int64},
  extra_columns: "ignore"
)
.collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘

Upcasting integers and floats

Polars::LazyFrame.new(
  {"a" => [1, 2, 3], "b" => [1.0, 2.0, 3.0]},
  schema: {"a" => Polars::Int32, "b" => Polars::Float32}
)
.match_to_schema(
  {"a" => Polars::Int64, "b" => Polars::Float64},
  integer_cast: "upcast",
  float_cast: "upcast"
)
.collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 1.0 │
# │ 2   ┆ 2.0 │
# │ 3   ┆ 3.0 │
# └─────┴─────┘

Parameters:

  • schema (Object)

    Target schema to match or evolve to.

  • missing_columns (Object) (defaults to: "raise")

    Raise of insert missing columns from the input with respect to the schema.

    This can also be an expression per column with what to insert if it is missing.

  • missing_struct_fields (Object) (defaults to: "raise")

    Raise of insert missing struct fields from the input with respect to the schema.

  • extra_columns (Object) (defaults to: "raise")

    Raise of ignore extra columns from the input with respect to the schema.

  • extra_struct_fields (Object) (defaults to: "raise")

    Raise of ignore extra struct fields from the input with respect to the schema.

  • integer_cast (Object) (defaults to: "forbid")

    Forbid of upcast for integer columns from the input to the respective column in schema.

  • float_cast (Object) (defaults to: "forbid")

    Forbid of upcast for float columns from the input to the respective column in schema.

Returns:



5585
5586
5587
5588
5589
5590
5591
5592
5593
5594
5595
5596
5597
5598
5599
5600
5601
5602
5603
5604
5605
5606
5607
5608
5609
5610
5611
5612
5613
5614
5615
5616
5617
5618
5619
5620
5621
5622
5623
5624
5625
5626
5627
5628
5629
5630
# File 'lib/polars/lazy_frame.rb', line 5585

def match_to_schema(
  schema,
  missing_columns: "raise",
  missing_struct_fields: "raise",
  extra_columns: "raise",
  extra_struct_fields: "raise",
  integer_cast: "forbid",
  float_cast: "forbid"
)
  prepare_missing_columns = lambda do |value|
    if value.is_a?(Expr)
      value._rbexpr
    else
      value
    end
  end

  if schema.is_a?(Hash)
    schema_prep = Schema.new(schema)
  else
    schema_prep = schema
  end

  if missing_columns.is_a?(Hash)
    missing_columns_rbexpr =
      missing_columns.to_h do |key, value|
        [key.to_s, prepare_missing_columns.(value)]
      end
  elsif missing_columns.is_a?(Expr)
    missing_columns_rbexpr = prepare_missing_columns.(missing_columns)
  else
    missing_columns_rbexpr = missing_columns
  end

  LazyFrame._from_rbldf(
    _ldf.match_to_schema(
      schema_prep,
      missing_columns_rbexpr,
      missing_struct_fields,
      extra_columns,
      extra_struct_fields,
      integer_cast,
      float_cast
    )
  )
end

#maxLazyFrame

Aggregate the columns in the DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.max.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 4   ┆ 2   │
# └─────┴─────┘

Returns:



4336
4337
4338
# File 'lib/polars/lazy_frame.rb', line 4336

def max
  _from_rbldf(_ldf.max)
end

#meanLazyFrame

Aggregate the columns in the DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.mean.collect
# =>
# shape: (1, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ f64 ┆ f64  │
# ╞═════╪══════╡
# │ 2.5 ┆ 1.25 │
# └─────┴──────┘

Returns:



4396
4397
4398
# File 'lib/polars/lazy_frame.rb', line 4396

def mean
  _from_rbldf(_ldf.mean)
end

#medianLazyFrame

Aggregate the columns in the DataFrame to their median value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.median.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 2.5 ┆ 1.0 │
# └─────┴─────┘

Returns:



4416
4417
4418
# File 'lib/polars/lazy_frame.rb', line 4416

def median
  _from_rbldf(_ldf.median)
end

#merge_sorted(other, key, maintain_order: false) ⇒ LazyFrame

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both LazyFrames must be equal.

Examples:

df0 = Polars::LazyFrame.new(
  {"name" => ["steve", "elise", "bob"], "age" => [42, 44, 18]}
).sort("age")
df1 = Polars::LazyFrame.new(
  {"name" => ["anna", "megan", "steve", "thomas"], "age" => [21, 33, 42, 20]}
).sort("age")
df0.merge_sorted(df1, "age").collect
# =>
# shape: (7, 2)
# ┌────────┬─────┐
# │ name   ┆ age │
# │ ---    ┆ --- │
# │ str    ┆ i64 │
# ╞════════╪═════╡
# │ bob    ┆ 18  │
# │ thomas ┆ 20  │
# │ anna   ┆ 21  │
# │ megan  ┆ 33  │
# │ steve  ┆ 42  │
# │ steve  ┆ 42  │
# │ elise  ┆ 44  │
# └────────┴─────┘

Parameters:

  • other (DataFrame)

    Other DataFrame that must be merged

  • key (String)

    Key that is sorted.

  • maintain_order (Boolean) (defaults to: false)

    If true, the output is guaranteed to have left-biased ordering for equal keys: rows from the left frame appear before rows from the right frame when their keys are equal.

Returns:



5151
5152
5153
# File 'lib/polars/lazy_frame.rb', line 5151

def merge_sorted(other, key, maintain_order: false)
  _from_rbldf(_ldf.merge_sorted(other._ldf, key, maintain_order))
end

#minLazyFrame

Aggregate the columns in the DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.min.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 1   │
# └─────┴─────┘

Returns:



4356
4357
4358
# File 'lib/polars/lazy_frame.rb', line 4356

def min
  _from_rbldf(_ldf.min)
end

#null_countLazyFrame

Aggregate the columns in the LazyFrame as the sum of their null value count.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, nil, 3],
    "bar" => [6, 7, nil],
    "ham" => ["a", "b", "c"]
  }
)
lf.null_count.collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 0   │
# └─────┴─────┴─────┘

Returns:



4442
4443
4444
# File 'lib/polars/lazy_frame.rb', line 4442

def null_count
  _from_rbldf(_ldf.null_count)
end

#pipe(function, *args, **kwargs, &block) ⇒ LazyFrame

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_columns(Polars.col(col_name).cast(Polars::Int64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]}).lazy
df.pipe(cast_str_to_int, col_name: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘

Parameters:

  • function (Object)

    Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

  • args (Object)

    Arguments to pass to the UDF.

  • kwargs (Object)

    Keyword arguments to pass to the UDF.

Returns:



261
262
263
# File 'lib/polars/lazy_frame.rb', line 261

def pipe(function, *args, **kwargs, &block)
  function.(self, *args, **kwargs, &block)
end

#pipe_with_schema(function) ⇒ LazyFrame

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Allows to alter the lazy frame during the plan stage with the resolved schema.

In contrast to pipe, this method does not execute function immediately but only during the plan stage. This allows to use the resolved schema of the input to dynamically alter the lazy frame. This also means that any exceptions raised by function will only be emitted during the plan stage.

Examples:

cast_to_float_if_necessary = lambda do |lf, schema|
  required_casts =
    schema.filter_map do |name, dtype|
      Polars.col(name).cast(Polars::Float64) if !dtype.float?
    end
  lf.with_columns(required_casts)
end
lf = Polars::LazyFrame.new(
  {"a" => [1.0, 2.0], "b" => ["1.0", "2.5"], "c" => [2.0, 3.0]},
  schema: {"a" => Polars::Float64, "b" => Polars::String, "c" => Polars::Float32}
)
lf.pipe_with_schema(cast_to_float_if_necessary).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ f64 ┆ f64 ┆ f32 │
# ╞═════╪═════╪═════╡
# │ 1.0 ┆ 1.0 ┆ 2.0 │
# │ 2.0 ┆ 2.5 ┆ 3.0 │
# └─────┴─────┴─────┘

Parameters:

  • function (Object)

    Callable; will receive the frame as the first parameter and the resolved schema as the second parameter.

Returns:



305
306
307
308
309
310
311
312
313
314
315
316
# File 'lib/polars/lazy_frame.rb', line 305

def pipe_with_schema(function)
  wrapper = lambda do |lf_and_schema|
    # The last index is because we return a list for multiple inputs
    # to make `pipe_with_schemas` (plural) work, but we don't use that
    function.(
      _from_rbldf(lf_and_schema[0][0]),
      lf_and_schema[1][0]
    )._ldf
  end

  _from_rbldf(_ldf.pipe_with_schema(wrapper))
end

#pivot(on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_", column_naming: "auto") ⇒ LazyFrame

Note:

In some other frameworks, you might know this operation as pivot_wider.

Create a spreadsheet-style pivot table as a DataFrame.

Examples:

Using pivot, we can reshape so we have one row per student, with different subjects as columns, and their test_1 scores as values:

df = Polars::DataFrame.new(
  {
    "name" => ["Cady", "Cady", "Karen", "Karen"],
    "subject" => ["maths", "physics", "maths", "physics"],
    "test_1" => [98, 99, 61, 58],
    "test_2" => [100, 100, 60, 60]
  }
)
df.lazy.pivot(
  "subject",
  on_columns: ["maths", "physics"],
  index: "name",
  values: "test_1",
  maintain_order: true
).collect
# =>
# shape: (2, 3)
# ┌───────┬───────┬─────────┐
# │ name  ┆ maths ┆ physics │
# │ ---   ┆ ---   ┆ ---     │
# │ str   ┆ i64   ┆ i64     │
# ╞═══════╪═══════╪═════════╡
# │ Cady  ┆ 98    ┆ 99      │
# │ Karen ┆ 61    ┆ 58      │
# └───────┴───────┴─────────┘

Parameters:

  • on (Object)

    The column(s) whose values will be used as the new columns of the output DataFrame.

  • on_columns (Object)

    What value combinations will be considered for the output table.

  • index (Object) (defaults to: nil)

    The column(s) that remain from the input to the output. The output DataFrame will have one row for each unique combination of the index's values. If nil, all remaining columns not specified on on and values will be used. At least one of index and values must be specified.

  • values (Object) (defaults to: nil)

    The existing column(s) of values which will be moved under the new columns from index. If an aggregation is specified, these are the values on which the aggregation will be computed. If nil, all remaining columns not specified on on and index will be used. At least one of index and values must be specified.

  • aggregate_function (Object) (defaults to: nil)

    Choose from:

    • nil: no aggregation takes place, will raise error if multiple values are in group.
    • A predefined aggregate function string, one of \{'min', 'max', 'first', 'last', 'sum', 'mean', 'median', 'len', 'item'}
    • An expression to do the aggregation. The expression can only access data from the respective 'values' columns as generated by pivot, through pl.element().
  • maintain_order (Boolean) (defaults to: false)

    Ensure the values of index are sorted by discovery order.

  • separator (String) (defaults to: "_")

    Used as separator/delimiter in generated column names in case of multiple values columns.

  • column_naming ('auto', 'combine') (defaults to: "auto")

    How resulting column names will be constructed.

Returns:



4754
4755
4756
4757
4758
4759
4760
4761
4762
4763
4764
4765
4766
4767
4768
4769
4770
4771
4772
4773
4774
4775
4776
4777
4778
4779
4780
4781
4782
4783
4784
4785
4786
4787
4788
4789
4790
4791
4792
4793
4794
4795
4796
4797
4798
4799
4800
4801
4802
4803
4804
4805
4806
4807
4808
4809
4810
4811
4812
4813
4814
4815
4816
4817
4818
4819
4820
4821
4822
4823
4824
4825
4826
4827
4828
4829
4830
4831
4832
4833
4834
4835
4836
4837
4838
4839
4840
# File 'lib/polars/lazy_frame.rb', line 4754

def pivot(
  on,
  on_columns:,
  index: nil,
  values: nil,
  aggregate_function: nil,
  maintain_order: false,
  separator: "_",
  column_naming: "auto"
)
  if index.nil? && values.nil?
    msg = "`pivot` needs either `index or `values` needs to be specified"
    raise InvalidOperationError, msg
  end

  on_selector = Utils.parse_list_into_selector(on)
  if !values.nil?
    values_selector = Utils.parse_list_into_selector(values)
  end
  if !index.nil?
    index_selector = Utils.parse_list_into_selector(index)
  end

  if values.nil?
    values_selector = Polars.cs.all - on_selector - index_selector
  end
  if index.nil?
    index_selector = Polars.cs.all - on_selector - values_selector
  end

  agg = F.element
  if aggregate_function.is_a?(::String)
    if aggregate_function == "first"
      agg = agg.first
    elsif aggregate_function == "item"
      agg = agg.item
    elsif aggregate_function == "sum"
      agg = agg.sum
    elsif aggregate_function == "max"
      agg = agg.max
    elsif aggregate_function == "min"
      agg = agg.min
    elsif aggregate_function == "mean"
      agg = agg.mean
    elsif aggregate_function == "median"
      agg = agg.median
    elsif aggregate_function == "last"
      agg = agg.last
    elsif aggregate_function == "len"
      agg = agg.len
    elsif aggregate_function == "count"
      Utils.issue_deprecation_warning(
        "`aggregate_function='count'` input for `pivot` is deprecated." +
        " Please use `aggregate_function='len'`."
      )
      agg = agg.len
    else
      msg = "invalid input for `aggregate_function` argument: #{aggregate_function.inspect}"
      raise ArgumentError, msg
    end
  elsif aggregate_function.nil?
    agg = agg.item(allow_empty: true)
  else
    agg = aggregate_function
  end

  if on_columns.is_a?(DataFrame)
    on_cols = on_columns
  elsif on_columns.is_a?(Series)
    on_cols = on_columns.to_frame
  else
    on_cols = Series.new(on_columns).to_frame
  end

  _from_rbldf(
    _ldf.pivot(
      on_selector._rbselector,
      on_cols._df,
      index_selector._rbselector,
      values_selector._rbselector,
      agg._rbexpr,
      maintain_order,
      separator,
      column_naming
    )
  )
end

#profile(engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Array

Profile a LazyFrame.

This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.

The units of the timings are microseconds.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
lf.group_by("a", maintain_order: true).agg(Polars.all.sum).sort(
  "a"
).profile
# =>
# [shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 4   ┆ 10  │
# │ b   ┆ 11  ┆ 10  │
# │ c   ┆ 6   ┆ 1   │
# └─────┴─────┴─────┘,
# shape: (3, 3)
# ┌──────────────┬───────┬─────┐
# │ node         ┆ start ┆ end │
# │ ---          ┆ ---   ┆ --- │
# │ str          ┆ u64   ┆ u64 │
# ╞══════════════╪═══════╪═════╡
# │ optimization ┆ 0     ┆ 67  │
# │ sort(a)      ┆ 67    ┆ 79  │
# └──────────────┴───────┴─────┘]

Parameters:

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

Returns:



964
965
966
967
968
969
970
971
972
973
974
# File 'lib/polars/lazy_frame.rb', line 964

def profile(
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  ldf = _ldf.with_optimizations(optimizations._rboptflags)

  df_rb, timings_rb = ldf.profile
  [Utils.wrap_df(df_rb), Utils.wrap_df(timings_rb)]
end

#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame

Aggregate the columns in the DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.quantile(0.7).collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 3.0 ┆ 1.0 │
# └─────┴─────┘

Parameters:

  • quantile (Float)

    Quantile between 0.0 and 1.0.

  • interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest")

    Interpolation method.

Returns:



4467
4468
4469
4470
# File 'lib/polars/lazy_frame.rb', line 4467

def quantile(quantile, interpolation: "nearest")
  quantile = Utils.parse_into_expression(quantile, str_as_lit: false)
  _from_rbldf(_ldf.quantile(quantile, interpolation))
end

#remove(*predicates, **constraints) ⇒ LazyFrame

Remove rows, dropping those that match the given predicate expression(s).

The original order of the remaining rows is preserved.

Rows where the filter predicate does not evaluate to true are retained (this includes rows where the predicate evaluates as null).

Examples:

Remove rows matching a condition:

lf = Polars::LazyFrame.new(
  {
    "foo" => [2, 3, nil, 4, 0],
    "bar" => [5, 6, nil, nil, 0],
    "ham" => ["a", "b", nil, "c", "d"]
  }
)
lf.remove(
  Polars.col("bar") >= 5
).collect
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# │ 0    ┆ 0    ┆ d    │
# └──────┴──────┴──────┘

Discard rows based on multiple conditions, combined with and/or operators:

lf.remove(
  (Polars.col("foo") >= 0) & (Polars.col("bar") >= 0)
).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘
lf.remove(
  (Polars.col("foo") >= 0) | (Polars.col("bar") >= 0)
).collect
# =>
# shape: (1, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Provide multiple constraints using *args syntax:

lf.remove(
  Polars.col("ham").is_not_null,
  Polars.col("bar") >= 0
).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘

Provide constraints(s) using **kwargs syntax:

lf.remove(foo: 0, bar: 0).collect
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 2    ┆ 5    ┆ a    │
# │ 3    ┆ 6    ┆ b    │
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘

Remove rows by comparing two columns against each other; in this case, we remove rows where the two columns are not equal (using ne_missing to ensure that null values compare equal):

lf.remove(
  Polars.col("foo").ne_missing(Polars.col("bar"))
).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 0    ┆ 0    ┆ d    │
# └──────┴──────┴──────┘

Parameters:

  • predicates (Array)

    Expression that evaluates to a boolean Series.

  • constraints (Hash)

    Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as Polars.col(name).eq(value), and is implicitly joined with the other filter conditions using &.

Returns:



2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
# File 'lib/polars/lazy_frame.rb', line 2270

def remove(
  *predicates,
  **constraints
)
  if constraints.empty?
    # early-exit conditions (exclude/include all rows)
    if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass))
      return clear
    end
    if predicates.length == 1 && predicates[0].is_a?(FalseClass)
      return dup
    end
  end

  _filter(
    predicates: predicates,
    constraints: constraints,
    invert: true
  )
end

#rename(mapping, strict: true) ⇒ LazyFrame

Rename column names.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
lf.rename({"foo" => "apple"}).collect
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 1     ┆ 6   ┆ a   │
# │ 2     ┆ 7   ┆ b   │
# │ 3     ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

Parameters:

  • mapping (Hash)

    Key value pairs that map from old name to new name.

  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).

Returns:



3733
3734
3735
3736
3737
3738
3739
3740
3741
# File 'lib/polars/lazy_frame.rb', line 3733

def rename(mapping, strict: true)
  if mapping.respond_to?(:call)
    select(F.all.name.map(&mapping))
  else
    existing = mapping.keys
    _new = mapping.values
    _from_rbldf(_ldf.rename(existing, _new, strict))
  end
end

#reverseLazyFrame

Reverse the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "key" => ["a", "b", "c"],
    "val" => [1, 2, 3]
  }
)
lf.reverse.collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ key ┆ val │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 1   │
# └─────┴─────┘

Returns:



3766
3767
3768
# File 'lib/polars/lazy_frame.rb', line 3766

def reverse
  _from_rbldf(_ldf.reverse)
end

#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ LazyFrame

Create rolling groups based on a time column.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic.

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::LazyFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_columns(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
).collect
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a rolling group by on indices, dtype needs to be one of \{UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.

  • period (Object)

    Length of the window.

  • offset (Object) (defaults to: nil)

    Offset of the window. Default is -period.

  • closed ("right", "left", "both", "none") (defaults to: "right")

    Define whether the temporal window interval is closed or not.

  • group_by (Object) (defaults to: nil)

    Also group by this column/these columns.

Returns:



2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
# File 'lib/polars/lazy_frame.rb', line 2531

def rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  group_by: nil
)
  index_column = Utils.parse_into_expression(index_column)
  if offset.nil?
    offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period))
  end

  rbexprs_by = (
    !group_by.nil? ? Utils.parse_into_list_of_expressions(group_by) : []
  )
  period = Utils.parse_as_duration_string(period)
  offset = Utils.parse_as_duration_string(offset)

  lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by)
  LazyGroupBy.new(lgb)
end

#schemaHash

Get the schema.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.schema
# => {"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String}

Returns:

  • (Hash)


148
149
150
# File 'lib/polars/lazy_frame.rb', line 148

def schema
  _ldf.collect_schema
end

#select(*exprs, **named_exprs) ⇒ LazyFrame

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
).lazy
df.select("foo").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘
df.select(["foo", "bar"]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘
df.select(Polars.col("foo") + 1).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘
df.select([Polars.col("foo") + 1, Polars.col("bar") + 1]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘
df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0)).collect
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2379
2380
2381
2382
2383
2384
2385
2386
# File 'lib/polars/lazy_frame.rb', line 2379

def select(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0"

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.select(rbexprs))
end

#select_seq(*exprs, **named_exprs) ⇒ LazyFrame

Select columns from this LazyFrame.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2402
2403
2404
2405
2406
2407
2408
2409
# File 'lib/polars/lazy_frame.rb', line 2402

def select_seq(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.select_seq(rbexprs))
end

#serialize(file = nil, format: "binary") ⇒ Object

Note:

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Serialize the logical plan of this LazyFrame to a file or string.

Examples:

Serialize the logical plan into a binary representation.

lf = Polars::LazyFrame.new({"a" => [1, 2, 3]}).sum
bytes = lf.serialize
Polars::LazyFrame.deserialize(StringIO.new(bytes)).collect
# =>
# shape: (1, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 6   │
# └─────┘

Parameters:

  • file (Object) (defaults to: nil)

    File path to which the result should be written. If set to nil (default), the output is returned as a string instead.

  • format ('binary', 'json') (defaults to: "binary")

    The format in which to serialize. Options:

    • "binary": Serialize to binary format (bytes). This is the default.
    • "json": Serialize to JSON format (string) (deprecated).

Returns:



214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/polars/lazy_frame.rb', line 214

def serialize(file = nil, format: "binary")
  if format == "binary"
    raise Todo unless _ldf.respond_to?(:serialize_binary)
    serializer = _ldf.method(:serialize_binary)
  elsif format == "json"
    msg = "'json' serialization format of LazyFrame is deprecated"
    warn msg
    serializer = _ldf.method(:serialize_json)
  else
    msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}"
    raise ArgumentError, msg
  end

  Utils.serialize_polars_object(serializer, file)
end

#set_sorted(column, *more_columns, descending: false, nulls_last: false) ⇒ LazyFrame

Note:

This can lead to incorrect results if the data is NOT sorted! Use with care!

Flag a column as sorted.

This can speed up future operations.

Parameters:

  • column (Object)

    Column that is sorted.

  • more_columns (Array)

    Columns that are sorted over after column.

  • descending (Boolean) (defaults to: false)

    Whether the column is sorted in descending order.

  • nulls_last (Boolean) (defaults to: false)

    Whether the nulls are at the end.

Returns:



5172
5173
5174
5175
5176
5177
5178
5179
5180
5181
5182
5183
5184
5185
5186
5187
5188
5189
5190
5191
5192
5193
5194
5195
5196
5197
5198
5199
# File 'lib/polars/lazy_frame.rb', line 5172

def set_sorted(
  column,
  *more_columns,
  descending: false,
  nulls_last: false
)
  if !Utils.strlike?(column)
    msg = "expected a 'str' for argument 'column' in 'set_sorted'"
    raise TypeError, msg
  end

  if Utils.bool?(descending)
    ds = [descending]
  else
    ds = descending
  end
  if Utils.bool?(nulls_last)
    nl = [nulls_last]
  else
    nl = nulls_last
  end

  _from_rbldf(
    _ldf.hint_sorted(
      [column] + more_columns, ds, nl
    )
  )
end

#shift(n = 1, fill_value: nil) ⇒ LazyFrame

Shift the values by a given period.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.shift(1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ null ┆ null │
# │ 1    ┆ 2    │
# │ 3    ┆ 4    │
# └──────┴──────┘
df.shift(-1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ 3    ┆ 4    │
# │ 5    ┆ 6    │
# │ null ┆ null │
# └──────┴──────┘

Parameters:

  • n (Integer) (defaults to: 1)

    Number of places to shift (may be negative).

  • fill_value (Object) (defaults to: nil)

    Fill the resulting null values with this value.

Returns:



3812
3813
3814
3815
3816
3817
3818
# File 'lib/polars/lazy_frame.rb', line 3812

def shift(n = 1, fill_value: nil)
  if !fill_value.nil?
    fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true)
  end
  n = Utils.parse_into_expression(n)
  _from_rbldf(_ldf.shift(n, fill_value))
end

#show_graph(optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object

Show a plot of the query plan.

Note that Graphviz must be installed to render the visualization (if not already present, you can download it here: https://graphviz.org/download.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
lf.group_by("a", maintain_order: true).agg(Polars.all.sum).sort(
  "a"
).show_graph

Parameters:

  • optimized (Boolean) (defaults to: true)

    Optimize the query plan.

  • show (Boolean) (defaults to: true)

    Show the figure.

  • output_path (String) (defaults to: nil)

    Write the figure to disk.

  • raw_output (Boolean) (defaults to: false)

    Return dot syntax. This cannot be combined with show and/or output_path.

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine.

  • plan_stage ('ir', 'physical') (defaults to: "ir")

    Select the stage to display. Currently only the streaming engine has a separate physical stage, for the other engines both IR and physical are the same.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The set of the optimizations considered during query optimization.

Returns:



612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
# File 'lib/polars/lazy_frame.rb', line 612

def show_graph(
  optimized: true,
  show: true,
  output_path: nil,
  raw_output: false,
  figsize: [16.0, 12.0],
  engine: "auto",
  plan_stage: "ir",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if engine == "streaming"
    issue_unstable_warning("streaming mode is considered unstable.")
  end

  optimizations = optimizations.dup
  optimizations._rboptflags.streaming = engine == "streaming"
  _ldf = self._ldf.with_optimizations(optimizations._rboptflags)

  if plan_stage == "ir"
    dot = _ldf.to_dot(optimized)
  elsif plan_stage == "physical"
    if engine == "streaming"
      dot = _ldf.to_dot_streaming_phys(optimized)
    else
      dot = _ldf.to_dot(optimized)
    end
  else
    error_msg = "invalid plan stage '#{plan_stage}'"
    raise TypeError, error_msg
  end

  Utils.display_dot_graph(
    dot: dot,
    show: show,
    output_path: output_path,
    raw_output: raw_output
  )
end

#sink_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, &function) ⇒ Object

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Note:

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Evaluate the query and call a user-defined function for every ready batch.

This allows streaming results that are larger than RAM in certain cases.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_batches { |df| p df }

Parameters:

  • chunk_size (Integer) (defaults to: nil)

    The number of rows that are buffered before the callback is called.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to True.

Returns:



1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
# File 'lib/polars/lazy_frame.rb', line 1813

def sink_batches(
  chunk_size: nil,
  maintain_order: true,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS,
  &function
)
  _wrap = lambda do |rbdf|
    df = Utils.wrap_df(rbdf)
    !!function.(df)
  end

  ldf = _ldf.sink_batches(
    _wrap,
    maintain_order,
    chunk_size
  )

  if !lazy
    ldf = ldf.with_optimizations(optimizations._rboptflags)
    lf = LazyFrame._from_rbldf(ldf)
    lf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf)
end

#sink_csv(path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Evaluate the query in streaming mode and write to a CSV file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_csv("out.csv")

Parameters:

  • path (String)

    File path to which the file should be written.

  • include_bom (Boolean) (defaults to: false)

    Whether to include UTF-8 BOM in the CSV output.

  • include_header (Boolean) (defaults to: true)

    Whether to include header in the CSV output.

  • separator (String) (defaults to: ",")

    Separate CSV fields with this symbol.

  • line_terminator (String) (defaults to: "\n")

    String used to end each row.

  • quote_char (String) (defaults to: '"')

    Byte to use as quoting character.

  • batch_size (Integer) (defaults to: 1024)

    Number of rows that will be processed per thread.

  • datetime_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame's Datetime cols (if any).

  • date_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate.

  • time_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate.

  • float_scientific (Integer) (defaults to: nil)

    Whether to use scientific form always (true), never (false), or automatically (nil) for Float32 and Float64 datatypes.

  • float_precision (Integer) (defaults to: nil)

    Number of decimal places to write, applied to both Float32 and Float64 datatypes.

  • decimal_comma (Boolean) (defaults to: false)

    Use a comma as the decimal separator instead of a point. Floats will be encapsulated in quotes if necessary; set the field separator to override.

  • null_value (String) (defaults to: nil)

    A string representing null values (defaulting to the empty string).

  • quote_style ("necessary", "always", "non_numeric", "never") (defaults to: nil)

    Determines the quoting strategy used.

    • necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
    • always: This puts quotes around every field. Always.
    • never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
    • non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (Object) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
# File 'lib/polars/lazy_frame.rb', line 1580

def sink_csv(
  path,
  include_bom: false,
  compression: "uncompressed",
  compression_level: nil,
  check_extension: true,
  include_header: true,
  separator: ",",
  line_terminator: "\n",
  quote_char: '"',
  batch_size: 1024,
  datetime_format: nil,
  date_format: nil,
  time_format: nil,
  float_scientific: nil,
  float_precision: nil,
  decimal_comma: false,
  null_value: nil,
  quote_style: nil,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  mkdir: false,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  Utils._check_arg_is_1byte("separator", separator, false)
  Utils._check_arg_is_1byte("quote_char", quote_char, false)
  engine = _select_engine(engine)

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_csv"
  )

  target = _to_sink_target(path)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_csv(
    target,
    sink_options,
    include_bom,
    compression,
    compression_level,
    check_extension,
    include_header,
    separator.ord,
    line_terminator,
    quote_char.ord,
    batch_size,
    datetime_format,
    date_format,
    time_format,
    float_scientific,
    float_precision,
    decimal_comma,
    null_value,
    quote_style
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#sink_delta(target, mode: "error", storage_options: nil, credential_provider: "auto", delta_write_options: nil, delta_merge_options: nil, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Sink DataFrame as delta table.

Examples:

Sink a large than fits into memory dataset to a Delta Lake table.

lf = Polars.scan_parquet(
  "/path/to/my_larger_than_ram_file.parquet"
)
table_path = "/path/to/delta-table/"
lf.sink_delta(table_path)

Parameters:

  • target (Object)

    URI of a table or a DeltaTable object.

  • mode ('error', 'append', 'overwrite', 'ignore', 'merge') (defaults to: "error")

    How to handle existing data.

    • If 'error', throw an error if the table already exists (default).
    • If 'append', will add new data.
    • If 'overwrite', will replace table with new data.
    • If 'ignore', will not write anything if table already exists.
    • If 'merge', return a TableMerger object to merge data from the DataFrame with the existing data.
  • storage_options (Object) (defaults to: nil)

    Extra options for the storage backends supported by deltalake. For cloud storages, this may include configurations for authentication etc.

    • See a list of supported storage options for S3 here.
    • See a list of supported storage options for GCS here.
    • See a list of supported storage options for Azure here.
  • credential_provider (Object) (defaults to: "auto")

    Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

  • delta_write_options (Hash) (defaults to: nil)

    Additional keyword arguments while writing a Delta lake Table. See a list of supported write options here.

  • delta_merge_options (Hash) (defaults to: nil)

    Keyword arguments which are required to MERGE a Delta lake Table. See a list of supported merge options here.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

Returns:



1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
# File 'lib/polars/lazy_frame.rb', line 1278

def sink_delta(
  target,
  mode: "error",
  storage_options: nil,
  credential_provider: "auto",
  delta_write_options: nil,
  delta_merge_options: nil,
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  Polars.send(:_check_if_delta_available)

  # TODO
  # _check_for_unsupported_types(collect_schema.dtypes)

  if Utils.pathlike?(target)
    target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false)
  end

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  if !target.is_a?(DeltaLake::Table)
    credential_provider_builder = _init_credential_provider_builder.(
      credential_provider, target, storage_options, "sink_delta"
    )
  elsif !credential_provider.nil? && credential_provider != "auto"
    msg = "cannot use credential_provider when passing a DeltaTable object"
    raise ArgumentError, msg
  else
    credential_provider_builder = nil
  end

  credential_provider_creds = {}

  if credential_provider_builder
    raise Todo
  end

  # We aren't calling into polars-native write functions so we just update
  # the storage_options here.
  storage_options =
    if !storage_options.nil? || !credential_provider_builder.nil?
      (storage_options || {}).merge(credential_provider_creds)
    else
      nil
    end

  stream = collect_batches(
    engine: "streaming",
    maintain_order: true,
    chunk_size: nil,
    lazy: true,
    optimizations: optimizations
  )

  if mode == "merge"
    if delta_merge_options.nil?
      msg = "you need to pass delta_merge_options with at least a given predicate for `MERGE` to work."
      raise ArgumentError, msg
    end
    if target.is_a?(::String)
      dt = DeltaLake::Table.new(target, storage_options: storage_options)
    else
      dt = target
    end

    dt.merge(stream, **delta_merge_options)
  else
    if delta_write_options.nil?
      delta_write_options = {}
    end

    DeltaLake.write(
      target,
      stream,
      mode: mode,
      storage_options: storage_options,
      **delta_write_options
    )
    nil
  end
end

#sink_ipc(path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false) ⇒ DataFrame

Evaluate the query in streaming mode and write to an IPC file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_ipc("out.arrow")

Parameters:

  • path (String)

    File path to which the file should be written.

  • compression ("lz4", "zstd") (defaults to: "uncompressed")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (String) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
# File 'lib/polars/lazy_frame.rb', line 1415

def sink_ipc(
  path,
  compression: "uncompressed",
  compat_level: nil,
  record_batch_size: nil,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  mkdir: false,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS,
  _record_batch_statistics: false
)
  engine = _select_engine(engine)

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_ipc"
  )

  target = _to_sink_target(path)

  compat_level_rb = nil
  if compat_level.nil?
    compat_level_rb = true
  else
    raise Todo
  end

  if compression.nil?
    compression = "uncompressed"
  end

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_ipc(
    target,
    sink_options,
    compression,
    compat_level_rb,
    record_batch_size,
    _record_batch_statistics
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#sink_ndjson(path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Evaluate the query in streaming mode and write to an NDJSON file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_ndjson("out.ndjson")

Parameters:

  • path (String)

    File path to which the file should be written.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (String) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
# File 'lib/polars/lazy_frame.rb', line 1719

def sink_ndjson(
  path,
  compression: "uncompressed",
  compression_level: nil,
  check_extension: true,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  mkdir: false,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_ndjson"
  )

  target = _to_sink_target(path)

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_ndjson(
    target,
    compression,
    compression_level,
    check_extension,
    sink_options
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _sinked_paths_callback: nil) ⇒ DataFrame

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_parquet("out.parquet")

Parameters:

  • path (String)

    File path to which the file should be written.

  • compression ("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd") (defaults to: "zstd")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression. Choose "snappy" for more backwards compatibility guarantees when you deal with older parquet readers.

  • compression_level (Integer) (defaults to: nil)

    The level of compression to use. Higher compression means smaller files on disk.

    • "gzip" : min-level: 0, max-level: 10.
    • "brotli" : min-level: 0, max-level: 11.
    • "zstd" : min-level: 1, max-level: 22.
  • statistics (Boolean) (defaults to: true)

    Write statistics to the parquet headers. This requires extra compute.

  • row_group_size (Integer) (defaults to: nil)

    Size of the row groups in number of rows. If nil (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

  • data_page_size (Integer) (defaults to: nil)

    Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (String) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
# File 'lib/polars/lazy_frame.rb', line 1148

def sink_parquet(
  path,
  compression: "zstd",
  compression_level: nil,
  statistics: true,
  row_group_size: nil,
  data_page_size: nil,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  metadata: nil,
  mkdir: false,
  lazy: false,
  arrow_schema: nil,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS,
  _sinked_paths_callback: nil
)
  engine = _select_engine(engine)

  if statistics == true
    statistics = {
      min: true,
      max: true,
      distinct_count: false,
      null_count: true
    }
  elsif statistics == false
    statistics = {}
  elsif statistics == "full"
    statistics = {
      min: true,
      max: true,
      distinct_count: true,
      null_count: true
    }
  end

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_parquet"
  )

  target = _to_sink_target(path)

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder,
    sinked_paths_callback: _sinked_paths_callback
  )

  ldf_rb = _ldf.sink_parquet(
    target,
    sink_options,
    compression,
    compression_level,
    statistics,
    row_group_size,
    data_page_size,
    ,
    arrow_schema
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#slice(offset, length = nil) ⇒ LazyFrame

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
).lazy
df.slice(1, 2).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ y   ┆ 3   ┆ 4   │
# │ z   ┆ 5   ┆ 6   │
# └─────┴─────┴─────┘

Parameters:

  • offset (Integer)

    Start index. Negative indexing is supported.

  • length (Integer) (defaults to: nil)

    Length of the slice. If set to nil, all rows starting at the offset will be selected.

Returns:



3849
3850
3851
3852
3853
3854
# File 'lib/polars/lazy_frame.rb', line 3849

def slice(offset, length = nil)
  if length && length < 0
    raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame"
  end
  _from_rbldf(_ldf.slice(offset, length))
end

#sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame

Sort the DataFrame.

Sorting can be done by:

  • A single column name
  • An expression
  • Multiple expressions

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
df.sort("foo", descending: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • by (Object)

    Column (expressions) to sort by.

  • more_by (Array)

    Additional columns to sort by, specified as positional arguments.

  • descending (Boolean) (defaults to: false)

    Sort in descending order.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last. Can only be used if sorted by a single column.

  • maintain_order (Boolean) (defaults to: false)

    Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.

  • multithreaded (Boolean) (defaults to: true)

    Sort using multiple threads.

Returns:



698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
# File 'lib/polars/lazy_frame.rb', line 698

def sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true)
  if by.is_a?(::String) && more_by.empty?
    return _from_rbldf(
      _ldf.sort(
        by, descending, nulls_last, maintain_order, multithreaded
      )
    )
  end

  by = Utils.parse_into_list_of_expressions(by, *more_by)
  descending = Utils.extend_bool(descending, by.length, "descending", "by")
  nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by")
  _from_rbldf(
    _ldf.sort_by_exprs(
      by, descending, nulls_last, maintain_order, multithreaded
    )
  )
end

#sql(query, table_name: "self") ⇒ Expr

Note:

This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.

Note:
  • The calling frame is automatically registered as a table in the SQL context under the name "self". If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level Polars.sql.
  • More control over registration and execution behaviour is available by using the SQLContext object.

Execute a SQL query against the LazyFrame.

Examples:

Query the LazyFrame using SQL:

lf1 = Polars::LazyFrame.new({"a" => [1, 2, 3], "b" => [6, 7, 8], "c" => ["z", "y", "x"]})
lf2 = Polars::LazyFrame.new({"a" => [3, 2, 1], "d" => [125, -654, 888]})
lf1.sql("SELECT c, b FROM self WHERE a > 1").collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ c   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ y   ┆ 7   │
# │ x   ┆ 8   │
# └─────┴─────┘

Apply SQL transforms (aliasing "self" to "frame") then filter natively (you can freely mix SQL and native operations):

lf1.sql(
  "
    SELECT
        a,
        (a % 2 == 0) AS a_is_even,
        (b::float4 / 2) AS \"b/2\",
        CONCAT_WS(':', c, c, c) AS c_c_c
    FROM frame
    ORDER BY a
  ",
  table_name: "frame",
).filter(~Polars.col("c_c_c").str.starts_with("x")).collect
# =>
# shape: (2, 4)
# ┌─────┬───────────┬─────┬───────┐
# │ a   ┆ a_is_even ┆ b/2 ┆ c_c_c │
# │ --- ┆ ---       ┆ --- ┆ ---   │
# │ i64 ┆ bool      ┆ f32 ┆ str   │
# ╞═════╪═══════════╪═════╪═══════╡
# │ 1   ┆ false     ┆ 3.0 ┆ z:z:z │
# │ 2   ┆ true      ┆ 3.5 ┆ y:y:y │
# └─────┴───────────┴─────┴───────┘

Parameters:

  • query (String)

    SQL query to execute.

  • table_name (String) (defaults to: "self")

    Optionally provide an explicit name for the table that represents the calling frame (defaults to "self").

Returns:



777
778
779
780
781
782
# File 'lib/polars/lazy_frame.rb', line 777

def sql(query, table_name: "self")
  ctx = Polars::SQLContext.new
  name = table_name || "self"
  ctx.register(name, self)
  ctx.execute(query)
end

#std(ddof: 1) ⇒ LazyFrame

Aggregate the columns in the DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.std.collect
# =>
# shape: (1, 2)
# ┌──────────┬─────┐
# │ a        ┆ b   │
# │ ---      ┆ --- │
# │ f64      ┆ f64 │
# ╞══════════╪═════╡
# │ 1.290994 ┆ 0.5 │
# └──────────┴─────┘
df.std(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────────┬──────────┐
# │ a        ┆ b        │
# │ ---      ┆ ---      │
# │ f64      ┆ f64      │
# ╞══════════╪══════════╡
# │ 1.118034 ┆ 0.433013 │
# └──────────┴──────────┘

Returns:



4284
4285
4286
# File 'lib/polars/lazy_frame.rb', line 4284

def std(ddof: 1)
  _from_rbldf(_ldf.std(ddof))
end

#sumLazyFrame

Aggregate the columns in the DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.sum.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 5   │
# └─────┴─────┘

Returns:



4376
4377
4378
# File 'lib/polars/lazy_frame.rb', line 4376

def sum
  _from_rbldf(_ldf.sum)
end

#tail(n = 5) ⇒ LazyFrame

Get the last n rows.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.tail.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# │ 6   ┆ 12  │
# └─────┴─────┘
lf.tail(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 5   ┆ 11  │
# │ 6   ┆ 12  │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows.

Returns:



3989
3990
3991
# File 'lib/polars/lazy_frame.rb', line 3989

def tail(n = 5)
  _from_rbldf(_ldf.tail(n))
end

#to_sString

Returns a string representing the LazyFrame.

Returns:



176
177
178
179
180
181
182
# File 'lib/polars/lazy_frame.rb', line 176

def to_s
  <<~EOS
    naive plan: (run LazyFrame#explain(optimized: true) to see the optimized plan)

    #{explain(optimized: false)}
  EOS
end

#top_k(k, by:, reverse: false) ⇒ LazyFrame

Return the k largest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call :func:sort after this function if you wish the output to be sorted.

Examples:

Get the rows which contain the 4 largest values in column b.

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [2, 1, 1, 3, 2, 1]
  }
)
lf.top_k(4, by: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 3   │
# │ a   ┆ 2   │
# │ b   ┆ 2   │
# │ b   ┆ 1   │
# └─────┴─────┘

Get the rows which contain the 4 largest values when sorting on column b and a.

lf.top_k(4, by: ["b", "a"]).collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 2   │
# │ c   ┆ 1   │
# └─────┴─────┘

Parameters:

  • k (Integer)

    Number of rows to return.

  • by (Object)

    Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.

  • reverse (Object) (defaults to: false)

    Consider the k smallest elements of the by column(s) (instead of the k largest). This can be specified per column by passing an array of booleans.

Returns:



838
839
840
841
842
843
844
845
846
# File 'lib/polars/lazy_frame.rb', line 838

def top_k(
  k,
  by:,
  reverse: false
)
  by = Utils.parse_into_list_of_expressions(by)
  reverse = Utils.extend_bool(reverse, by.length, "reverse", "by")
  _from_rbldf(_ldf.top_k(k, by, reverse))
end

#unique(maintain_order: false, subset: nil, keep: "any") ⇒ LazyFrame

Drop duplicate rows from this DataFrame.

Note that this fails if there is a column of type List in the DataFrame or subset.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3, 1],
    "bar" => ["a", "a", "a", "a"],
    "ham" => ["b", "b", "b", "b"]
  }
)
lf.unique(maintain_order: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ a   ┆ b   │
# │ 2   ┆ a   ┆ b   │
# │ 3   ┆ a   ┆ b   │
# └─────┴─────┴─────┘
lf.unique(subset: ["bar", "ham"], maintain_order: true).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ a   ┆ b   │
# └─────┴─────┴─────┘
lf.unique(keep: "last", maintain_order: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 2   ┆ a   ┆ b   │
# │ 3   ┆ a   ┆ b   │
# │ 1   ┆ a   ┆ b   │
# └─────┴─────┴─────┘

Parameters:

  • maintain_order (Boolean) (defaults to: false)

    Keep the same order as the original DataFrame. This requires more work to compute.

  • subset (Object) (defaults to: nil)

    Subset to use to compare rows.

  • keep ("first", "last") (defaults to: "any")

    Which of the duplicate rows to keep.

Returns:



4583
4584
4585
4586
4587
4588
4589
# File 'lib/polars/lazy_frame.rb', line 4583

def unique(maintain_order: false, subset: nil, keep: "any")
  parsed_subset = nil
  if !subset.nil?
    parsed_subset = Utils.parse_into_list_of_expressions(subset, __require_selectors: true)
  end
  _from_rbldf(_ldf.unique(maintain_order, parsed_subset, keep))
end

#unnest(columns = nil, *more_columns, separator: nil) ⇒ LazyFrame

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = (
  Polars::DataFrame.new(
    {
      "before" => ["foo", "bar"],
      "t_a" => [1, 2],
      "t_b" => ["a", "b"],
      "t_c" => [true, nil],
      "t_d" => [[1, 2], [3]],
      "after" => ["baz", "womp"]
    }
  )
  .lazy
  .select(
    ["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"]
  )
)
df.collect
# =>
# shape: (2, 3)
# ┌────────┬─────────────────────┬───────┐
# │ before ┆ t_struct            ┆ after │
# │ ---    ┆ ---                 ┆ ---   │
# │ str    ┆ struct[4]           ┆ str   │
# ╞════════╪═════════════════════╪═══════╡
# │ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
# │ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
# └────────┴─────────────────────┴───────┘
df.unnest("t_struct").collect
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

Parameters:

  • columns (Object) (defaults to: nil)

    Names of the struct columns that will be decomposed by its fields

  • more_columns (Array)

    Additional columns to unnest, specified as positional arguments.

  • separator (String) (defaults to: nil)

    Rename output column names as combination of the struct column name, name separator and field name.

Returns:



5102
5103
5104
5105
5106
5107
# File 'lib/polars/lazy_frame.rb', line 5102

def unnest(columns = nil, *more_columns, separator: nil)
  subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector(
    more_columns
  )
  _from_rbldf(_ldf.unnest(subset._rbselector, separator))
end

#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
)
lf.unpivot(Polars.cs.numeric, index: "a").collect
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

Parameters:

  • on (Object) (defaults to: nil)

    Column(s) or selector(s) to use as values variables; if on is empty no columns will be used. If set to nil (default) all columns that are not in index will be used.

  • index (Object) (defaults to: nil)

    Column(s) or selector(s) to use as identifier variables.

  • variable_name (String) (defaults to: nil)

    Name to give to the variable column. Defaults to "variable"

  • value_name (String) (defaults to: nil)

    Name to give to the value column. Defaults to "value"

  • streamable (Boolean) (defaults to: true)

    Allow this node to run in the streaming engine. If this runs in streaming, the output of the unpivot operation will not have a stable ordering.

Returns:



4891
4892
4893
4894
4895
4896
4897
4898
4899
4900
4901
4902
4903
4904
4905
4906
4907
4908
4909
4910
4911
4912
4913
# File 'lib/polars/lazy_frame.rb', line 4891

def unpivot(
  on = nil,
  index: nil,
  variable_name: nil,
  value_name: nil,
  streamable: true
)
  if !streamable
    warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated"
  end

  selector_on = on.nil? ? nil : Utils.parse_list_into_selector(on)._rbselector
  selector_index = index.nil? ? Selectors.empty : Utils.parse_list_into_selector(index)

  _from_rbldf(
    _ldf.unpivot(
      selector_on,
      selector_index._rbselector,
      value_name,
      variable_name
    )
  )
end

#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Note:

This is syntactic sugar for a left/inner join that preserves the order of the left DataFrame by default, with an optional coalesce when include_nulls: false.

Update the values in this LazyFrame with the values in other.

Examples:

Update df values with the non-null values in new_df, by row index:

lf = Polars::LazyFrame.new(
  {
    "A" => [1, 2, 3, 4],
    "B" => [400, 500, 600, 700]
  }
)
new_lf = Polars::LazyFrame.new(
  {
    "B" => [-66, nil, -99],
    "C" => [5, 3, 1]
  }
)
lf.update(new_lf).collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -66 │
# │ 2   ┆ 500 │
# │ 3   ┆ -99 │
# │ 4   ┆ 700 │
# └─────┴─────┘

Update df values with the non-null values in new_df, by row index, but only keeping those rows that are common to both frames:

lf.update(new_lf, how: "inner").collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -66 │
# │ 2   ┆ 500 │
# │ 3   ┆ -99 │
# └─────┴─────┘

Update df values with the non-null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

lf.update(new_lf, left_on: ["A"], right_on: ["C"], how: "full").collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -99 │
# │ 2   ┆ 500 │
# │ 3   ┆ 600 │
# │ 4   ┆ 700 │
# │ 5   ┆ -66 │
# └─────┴─────┘

Update df values including null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

lf.update(
  new_lf, left_on: "A", right_on: "C", how: "full", include_nulls: true
).collect
# =>
# shape: (5, 2)
# ┌─────┬──────┐
# │ A   ┆ B    │
# │ --- ┆ ---  │
# │ i64 ┆ i64  │
# ╞═════╪══════╡
# │ 1   ┆ -99  │
# │ 2   ┆ 500  │
# │ 3   ┆ null │
# │ 4   ┆ 700  │
# │ 5   ┆ -66  │
# └─────┴──────┘

Parameters:

  • other (LazyFrame)

    LazyFrame that will be used to update the values

  • on (Object) (defaults to: nil)

    Column names that will be joined on. If set to nil (default), the implicit row index of each frame is used as a join key.

  • how ('left', 'inner', 'full') (defaults to: "left")
    • 'left' will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row's key.
    • 'inner' keeps only those rows where the key exists in both frames.
    • 'full' will update existing rows where the key matches while also adding any new rows contained in the given frame.
  • left_on (Object) (defaults to: nil)

    Join column(s) of the left DataFrame.

  • right_on (Object) (defaults to: nil)

    Join column(s) of the right DataFrame.

  • include_nulls (Boolean) (defaults to: false)

    Overwrite values in the left frame with null values from the right frame. If set to false (default), null values in the right frame are ignored.

  • maintain_order ('none', 'left', 'right', 'left_right', 'right_left') (defaults to: "left")

    Which order of rows from the inputs to preserve. See LazyFrame.join for details. Unlike join this function preserves the left order by default.

Returns:



5311
5312
5313
5314
5315
5316
5317
5318
5319
5320
5321
5322
5323
5324
5325
5326
5327
5328
5329
5330
5331
5332
5333
5334
5335
5336
5337
5338
5339
5340
5341
5342
5343
5344
5345
5346
5347
5348
5349
5350
5351
5352
5353
5354
5355
5356
5357
5358
5359
5360
5361
5362
5363
5364
5365
5366
5367
5368
5369
5370
5371
5372
5373
5374
5375
5376
5377
5378
5379
5380
5381
5382
5383
5384
5385
5386
5387
5388
5389
5390
5391
5392
5393
5394
5395
5396
5397
5398
5399
5400
5401
5402
5403
5404
5405
5406
5407
5408
5409
5410
5411
5412
5413
5414
5415
5416
5417
5418
5419
5420
5421
5422
5423
5424
5425
5426
5427
5428
5429
5430
5431
5432
5433
# File 'lib/polars/lazy_frame.rb', line 5311

def update(
  other,
  on: nil,
  how: "left",
  left_on: nil,
  right_on: nil,
  include_nulls: false,
  maintain_order: "left"
)
  Utils.require_same_type(self, other)
  if ["outer", "outer_coalesce"].include?(how)
    how = "full"
  end

  if !["left", "inner", "full"].include?(how)
    msg = "`how` must be one of {{'left', 'inner', 'full'}}; found #{how.inspect}"
    raise ArgumentError, msg
  end

  slf = self
  row_index_used = false
  if on.nil?
    if left_on.nil? && right_on.nil?
      # no keys provided--use row index
      row_index_used = true
      row_index_name = "__POLARS_ROW_INDEX"
      slf = slf.with_row_index(name: row_index_name)
      other = other.with_row_index(name: row_index_name)
      left_on = right_on = [row_index_name]
    else
      # one of left or right is missing, raise error
      if left_on.nil?
        msg = "missing join columns for left frame"
        raise ArgumentError, msg
      end
      if right_on.nil?
        msg = "missing join columns for right frame"
        raise ArgumentError, msg
      end
    end
  else
    # move on into left/right_on to simplify logic
    left_on = right_on = on
  end

  if left_on.is_a?(::String)
    left_on = [left_on]
  end
  if right_on.is_a?(::String)
    right_on = [right_on]
  end

  left_schema = slf.collect_schema
  left_on.each do |name|
    if !left_schema.include?(name)
      msg = "left join column #{name.inspect} not found"
      raise ArgumentError, msg
    end
  end
  right_schema = other.collect_schema
  right_on.each do |name|
    if !right_schema.include?(name)
      msg = "right join column #{name.inspect} not found"
      raise ArgumentError, msg
    end
  end

  # no need to join if *only* join columns are in other (inner/left update only)
  if how != "full" && right_schema.length == right_on.length
    if row_index_used
      return slf.drop(row_index_name)
    end
    return slf
  end

  # only use non-idx right columns present in left frame
  right_other = Set.new(right_schema.to_h.keys).intersection(left_schema.to_h.keys) - Set.new(right_on)

  # When include_nulls is true, we need to distinguish records after the join that
  # were originally null in the right frame, as opposed to records that were null
  # because the key was missing from the right frame.
  # Add a validity column to track whether row was matched or not.
  if include_nulls
    validity = ["__POLARS_VALIDITY"]
    other = other.with_columns(F.lit(true).alias(validity[0]))
  else
    validity = []
  end

  tmp_name = "__POLARS_RIGHT"
  drop_columns = right_other.map { |name| "#{name}#{tmp_name}" } + validity
  result = (
    slf.join(
      other.select(*right_on, *right_other, *validity),
      left_on: left_on,
      right_on: right_on,
      how: how,
      suffix: tmp_name,
      coalesce: true,
      maintain_order: maintain_order
    )
    .with_columns(
      right_other.map do |name|
        (
          if include_nulls
            # use left value only when right value failed to join
            F.when(F.col(validity).is_null)
            .then(F.col(name))
            .otherwise(F.col("#{name}#{tmp_name}"))
          else
            F.coalesce(["#{name}#{tmp_name}", F.col(name)])
          end
        ).alias(name)
      end
    )
    .drop(drop_columns)
  )
  if row_index_used
    result = result.drop(row_index_name)
  end

  _from_rbldf(result._ldf)
end

#var(ddof: 1) ⇒ LazyFrame

Aggregate the columns in the DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.var.collect
# =>
# shape: (1, 2)
# ┌──────────┬──────┐
# │ a        ┆ b    │
# │ ---      ┆ ---  │
# │ f64      ┆ f64  │
# ╞══════════╪══════╡
# │ 1.666667 ┆ 0.25 │
# └──────────┴──────┘
df.var(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────┬────────┐
# │ a    ┆ b      │
# │ ---  ┆ ---    │
# │ f64  ┆ f64    │
# ╞══════╪════════╡
# │ 1.25 ┆ 0.1875 │
# └──────┴────────┘

Returns:



4316
4317
4318
# File 'lib/polars/lazy_frame.rb', line 4316

def var(ddof: 1)
  _from_rbldf(_ldf.var(ddof))
end

#widthInteger

Get the width of the LazyFrame.

Examples:

lf = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]}).lazy
lf.width
# => 2

Returns:

  • (Integer)


160
161
162
# File 'lib/polars/lazy_frame.rb', line 160

def width
  _ldf.collect_schema.length
end

#with_columns(*exprs, **named_exprs) ⇒ LazyFrame

Add or overwrite multiple columns in a DataFrame.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
).lazy
ldf.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").is_not).alias("not c")
  ]
).collect
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Parameters:

  • exprs (Object)

    List of Expressions that evaluate to columns.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



3593
3594
3595
3596
3597
3598
3599
# File 'lib/polars/lazy_frame.rb', line 3593

def with_columns(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0"

  rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify)

  _from_rbldf(_ldf.with_columns(rbexprs))
end

#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame

Add columns to this LazyFrame.

Added columns will replace existing columns with the same name.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

  • exprs (Array)

    Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
# File 'lib/polars/lazy_frame.rb', line 3617

def with_columns_seq(
  *exprs,
  **named_exprs
)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.with_columns_seq(rbexprs))
end

#with_row_index(name: "index", offset: 0) ⇒ LazyFrame

Note:

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.with_row_index.collect
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ index ┆ a   ┆ b   │
# │ ---   ┆ --- ┆ --- │
# │ u32   ┆ i64 ┆ i64 │
# ╞═══════╪═════╪═════╡
# │ 0     ┆ 1   ┆ 2   │
# │ 1     ┆ 3   ┆ 4   │
# │ 2     ┆ 5   ┆ 6   │
# └───────┴─────┴─────┘

Parameters:

  • name (String) (defaults to: "index")

    Name of the column to add.

  • offset (Integer) (defaults to: 0)

    Start the row count at this offset.

Returns:



4075
4076
4077
# File 'lib/polars/lazy_frame.rb', line 4075

def with_row_index(name: "index", offset: 0)
  _from_rbldf(_ldf.with_row_index(name, offset))
end