Class: Polars::LazyGroupBy

Inherits:
Object
  • Object
show all
Defined in:
lib/polars/lazy_group_by.rb

Overview

Created by df.lazy.group_by("foo").

Instance Method Summary collapse

Instance Method Details

#agg(*aggs, **named_aggs) ⇒ LazyFrame

Compute aggregations for each group of a group by operation.

Examples:

Compute the aggregation of the columns for each group.

ldf = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "c"],
    "b" => [1, 2, 1, 3, 3],
    "c" => [5, 4, 3, 2, 1]
  }
).lazy
ldf.group_by("a").agg(
  [Polars.col("b"), Polars.col("c")]
).collect
# =>
# shape: (3, 3)
# ┌─────┬───────────┬───────────┐
# │ a   ┆ b         ┆ c         │
# │ --- ┆ ---       ┆ ---       │
# │ str ┆ list[i64] ┆ list[i64] │
# ╞═════╪═══════════╪═══════════╡
# │ a   ┆ [1, 1]    ┆ [5, 3]    │
# │ b   ┆ [2, 3]    ┆ [4, 2]    │
# │ c   ┆ [3]       ┆ [1]       │
# └─────┴───────────┴───────────┘

Compute the sum of a column for each group.

ldf.group_by("a").agg(
  Polars.col("b").sum
).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 2   │
# │ b   ┆ 5   │
# │ c   ┆ 3   │
# └─────┴─────┘

Compute multiple aggregates at once by passing a list of expressions.

ldf.group_by("a").agg(
  [Polars.sum("b"), Polars.mean("c")]
).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ f64 │
# ╞═════╪═════╪═════╡
# │ c   ┆ 3   ┆ 1.0 │
# │ a   ┆ 2   ┆ 4.0 │
# │ b   ┆ 5   ┆ 3.0 │
# └─────┴─────┴─────┘

Or use positional arguments to compute multiple aggregations in the same way.

ldf.group_by("a").agg(
  Polars.sum("b").name.suffix("_sum"),
  (Polars.col("c") ** 2).mean.name.suffix("_mean_squared")
).collect
# =>
# shape: (3, 3)
# ┌─────┬───────┬────────────────┐
# │ a   ┆ b_sum ┆ c_mean_squared │
# │ --- ┆ ---   ┆ ---            │
# │ str ┆ i64   ┆ f64            │
# ╞═════╪═══════╪════════════════╡
# │ a   ┆ 2     ┆ 17.0           │
# │ c   ┆ 3     ┆ 1.0            │
# │ b   ┆ 5     ┆ 10.0           │
# └─────┴───────┴────────────────┘

Use keyword arguments to easily name your expression inputs.

ldf.group_by("a").agg(
  b_sum: Polars.sum("b"),
  c_mean_squared: (Polars.col("c") ** 2).mean
).collect
# =>
# shape: (3, 3)
# ┌─────┬───────┬────────────────┐
# │ a   ┆ b_sum ┆ c_mean_squared │
# │ --- ┆ ---   ┆ ---            │
# │ str ┆ i64   ┆ f64            │
# ╞═════╪═══════╪════════════════╡
# │ a   ┆ 2     ┆ 17.0           │
# │ c   ┆ 3     ┆ 1.0            │
# │ b   ┆ 5     ┆ 10.0           │
# └─────┴───────┴────────────────┘

Parameters:

  • aggs (Array)

    Aggregations to compute for each group of the group by operation, specified as positional arguments. Accepts expression input. Strings are parsed as column names.

  • named_aggs (Hash)

    Additional aggregations, specified as keyword arguments. The resulting columns will be renamed to the keyword used.

Returns:



148
149
150
151
# File 'lib/polars/lazy_group_by.rb', line 148

def agg(*aggs, **named_aggs)
  rbexprs = Utils.parse_into_list_of_expressions(*aggs, **named_aggs)
  Utils.wrap_ldf(@lgb.agg(rbexprs))
end

#allLazyFrame

Aggregate the groups into Series.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => ["one", "two", "one", "two"],
    "b" => [1, 2, 3, 4]
  }
).lazy
ldf.group_by("a", maintain_order: true).all.collect
# =>
# shape: (2, 2)
# ┌─────┬───────────┐
# │ a   ┆ b         │
# │ --- ┆ ---       │
# │ str ┆ list[i64] │
# ╞═════╪═══════════╡
# │ one ┆ [1, 3]    │
# │ two ┆ [2, 4]    │
# └─────┴───────────┘

Returns:



298
299
300
# File 'lib/polars/lazy_group_by.rb', line 298

def all
  agg(F.all)
end

#first(ignore_nulls: false) ⇒ LazyFrame

Aggregate the first values in the group.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "c" => [true, true, true, false, false, true],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).first.collect
# =>
# shape: (3, 4)
# ┌────────┬─────┬──────┬───────┐
# │ d      ┆ a   ┆ b    ┆ c     │
# │ ---    ┆ --- ┆ ---  ┆ ---   │
# │ str    ┆ i64 ┆ f64  ┆ bool  │
# ╞════════╪═════╪══════╪═══════╡
# │ Apple  ┆ 1   ┆ 0.5  ┆ true  │
# │ Orange ┆ 2   ┆ 0.5  ┆ true  │
# │ Banana ┆ 4   ┆ 13.0 ┆ false │
# └────────┴─────┴──────┴───────┘

Parameters:

  • ignore_nulls (Boolean) (defaults to: false)

    Ignore null values (default false). If set to true, the first non-null value for each aggregation is returned, otherwise nil is returned if no non-null value exists.

Returns:



373
374
375
# File 'lib/polars/lazy_group_by.rb', line 373

def first(ignore_nulls: false)
  agg(F.all.first(ignore_nulls: ignore_nulls))
end

#having(*predicates) ⇒ LazyGroupBy

Filter groups with a list of predicates after aggregation.

Using this method is equivalent to adding the predicates to the aggregation and filtering afterwards.

This method can be chained and all conditions will be combined using &.

Examples:

Only keep groups that contain more than one element.

ldf = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "c"]
  }
).lazy
ldf.group_by("a").having(
  Polars.len > 1
).agg.collect
# =>
# shape: (2, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ str │
# ╞═════╡
# │ b   │
# │ a   │
# └─────┘

Parameters:

  • predicates (Array)

    Expressions that evaluate to a boolean value for each group. Typically, this requires the use of an aggregation function. Multiple predicates are combined using &.

Returns:



42
43
44
45
46
# File 'lib/polars/lazy_group_by.rb', line 42

def having(*predicates)
  rbexprs = Utils.parse_into_list_of_expressions(*predicates)
  @lgb = @lgb.having(rbexprs)
  self
end

#head(n = 5) ⇒ LazyFrame

Get the first n rows of each group.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["c", "c", "a", "c", "a", "b"],
    "nrs" => [1, 2, 3, 4, 5, 6]
  }
)
df.group_by("letters").head(2).sort("letters")
# =>
# shape: (5, 2)
# ┌─────────┬─────┐
# │ letters ┆ nrs │
# │ ---     ┆ --- │
# │ str     ┆ i64 │
# ╞═════════╪═════╡
# │ a       ┆ 3   │
# │ a       ┆ 5   │
# │ b       ┆ 6   │
# │ c       ┆ 1   │
# │ c       ┆ 2   │
# └─────────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



240
241
242
# File 'lib/polars/lazy_group_by.rb', line 240

def head(n = 5)
  Utils.wrap_ldf(@lgb.head(n))
end

#last(ignore_nulls: false) ⇒ LazyFrame

Aggregate the last values in the group.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 14, 13],
    "c" => [true, true, true, false, false, true],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).last.collect
# =>
# shape: (3, 4)
# ┌────────┬─────┬──────┬───────┐
# │ d      ┆ a   ┆ b    ┆ c     │
# │ ---    ┆ --- ┆ ---  ┆ ---   │
# │ str    ┆ i64 ┆ f64  ┆ bool  │
# ╞════════╪═════╪══════╪═══════╡
# │ Apple  ┆ 3   ┆ 10.0 ┆ false │
# │ Orange ┆ 2   ┆ 0.5  ┆ true  │
# │ Banana ┆ 5   ┆ 13.0 ┆ true  │
# └────────┴─────┴──────┴───────┘

Parameters:

  • ignore_nulls (Boolean) (defaults to: false)

    Ignore null values (default false). If set to true, the last non-null value for each aggregation is returned, otherwise nil is returned if no non-null value exists.

Returns:



408
409
410
# File 'lib/polars/lazy_group_by.rb', line 408

def last(ignore_nulls: false)
  agg(F.all.last(ignore_nulls: ignore_nulls))
end

#len(name: nil) ⇒ LazyFrame

Return the number of rows in each group.

Examples:

lf = Polars::LazyFrame.new({"a" => ["Apple", "Apple", "Orange"], "b" => [1, nil, 2]})
lf.group_by("a").len.collect
# =>
# shape: (2, 2)
# ┌────────┬─────┐
# │ a      ┆ len │
# │ ---    ┆ --- │
# │ str    ┆ u32 │
# ╞════════╪═════╡
# │ Apple  ┆ 2   │
# │ Orange ┆ 1   │
# └────────┴─────┘
lf.group_by("a").len(name: "n").collect
# =>
# shape: (2, 2)
# ┌────────┬─────┐
# │ a      ┆ n   │
# │ ---    ┆ --- │
# │ str    ┆ u32 │
# ╞════════╪═════╡
# │ Apple  ┆ 2   │
# │ Orange ┆ 1   │
# └────────┴─────┘

Parameters:

  • name (String) (defaults to: nil)

    Assign a name to the resulting column; if unset, defaults to "len".

Returns:



335
336
337
338
339
340
341
# File 'lib/polars/lazy_group_by.rb', line 335

def len(name: nil)
  len_expr = F.len
  if !name.nil?
    len_expr = len_expr.alias(name)
  end
  agg(len_expr)
end

#map_groups(schema, &function) ⇒ LazyFrame

Note:

This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.

Apply a custom/user-defined function (UDF) over the groups as a new DataFrame.

Using this is considered an anti-pattern as it will be very slow because:

  • it forces the engine to materialize the whole DataFrames for the groups.
  • it is not parallelized
  • it blocks optimizations as the passed python function is opaque to the optimizer

The idiomatic way to apply custom functions over multiple columns is using:

Polars.struct([my_columns]).apply { |struct_series| ... }

Examples:

df = Polars::DataFrame.new(
  {
    "id" => [0, 1, 2, 3, 4],
    "color" => ["red", "green", "green", "red", "red"],
    "shape" => ["square", "triangle", "square", "triangle", "square"]
  }
)
(
  df.lazy
    .group_by("color")
    .map_groups(nil) { |group_df| group_df.sample(n: 2) }
    .collect
)
# =>
# shape: (4, 3)
# ┌─────┬───────┬──────────┐
# │ id  ┆ color ┆ shape    │
# │ --- ┆ ---   ┆ ---      │
# │ i64 ┆ str   ┆ str      │
# ╞═════╪═══════╪══════════╡
# │ 1   ┆ green ┆ triangle │
# │ 2   ┆ green ┆ square   │
# │ 4   ┆ red   ┆ square   │
# │ 3   ┆ red   ┆ triangle │
# └─────┴───────┴──────────┘

Parameters:

  • schema (Object)

    Schema of the output function. This has to be known statically. If the given schema is incorrect, this is a bug in the caller's query and may lead to errors. If set to None, polars assumes the schema is unchanged.

Returns:



203
204
205
206
207
208
209
210
# File 'lib/polars/lazy_group_by.rb', line 203

def map_groups(
  schema,
  &function
)
  Utils.wrap_ldf(
    @lgb.map_groups(->(df) { function.(Utils.wrap_df(df))._df }, schema)
  )
end

#maxLazyFrame

Reduce the groups to the maximal value.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "c" => [true, true, true, false, false, true],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).max.collect
# =>
# shape: (3, 4)
# ┌────────┬─────┬──────┬──────┐
# │ d      ┆ a   ┆ b    ┆ c    │
# │ ---    ┆ --- ┆ ---  ┆ ---  │
# │ str    ┆ i64 ┆ f64  ┆ bool │
# ╞════════╪═════╪══════╪══════╡
# │ Apple  ┆ 3   ┆ 10.0 ┆ true │
# │ Orange ┆ 2   ┆ 0.5  ┆ true │
# │ Banana ┆ 5   ┆ 14.0 ┆ true │
# └────────┴─────┴──────┴──────┘

Returns:



437
438
439
# File 'lib/polars/lazy_group_by.rb', line 437

def max
  agg(F.all.max)
end

#meanLazyFrame

Reduce the groups to the mean values.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "c" => [true, true, true, false, false, true],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).mean.collect
# =>
# shape: (3, 4)
# ┌────────┬─────┬──────────┬──────────┐
# │ d      ┆ a   ┆ b        ┆ c        │
# │ ---    ┆ --- ┆ ---      ┆ ---      │
# │ str    ┆ f64 ┆ f64      ┆ f64      │
# ╞════════╪═════╪══════════╪══════════╡
# │ Apple  ┆ 2.0 ┆ 4.833333 ┆ 0.666667 │
# │ Orange ┆ 2.0 ┆ 0.5      ┆ 1.0      │
# │ Banana ┆ 4.5 ┆ 13.5     ┆ 0.5      │
# └────────┴─────┴──────────┴──────────┘

Returns:



466
467
468
# File 'lib/polars/lazy_group_by.rb', line 466

def mean
  agg(F.all.mean)
end

#medianLazyFrame

Return the median per group.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "d" => ["Apple", "Banana", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).median.collect
# =>
# shape: (2, 3)
# ┌────────┬─────┬──────┐
# │ d      ┆ a   ┆ b    │
# │ ---    ┆ --- ┆ ---  │
# │ str    ┆ f64 ┆ f64  │
# ╞════════╪═════╪══════╡
# │ Apple  ┆ 2.0 ┆ 4.0  │
# │ Banana ┆ 4.0 ┆ 13.0 │
# └────────┴─────┴──────┘

Returns:



493
494
495
# File 'lib/polars/lazy_group_by.rb', line 493

def median
  agg(F.all.median)
end

#minLazyFrame

Reduce the groups to the minimal value.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "c" => [true, true, true, false, false, true],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).min.collect
# =>
# shape: (3, 4)
# ┌────────┬─────┬──────┬───────┐
# │ d      ┆ a   ┆ b    ┆ c     │
# │ ---    ┆ --- ┆ ---  ┆ ---   │
# │ str    ┆ i64 ┆ f64  ┆ bool  │
# ╞════════╪═════╪══════╪═══════╡
# │ Apple  ┆ 1   ┆ 0.5  ┆ false │
# │ Orange ┆ 2   ┆ 0.5  ┆ true  │
# │ Banana ┆ 4   ┆ 13.0 ┆ false │
# └────────┴─────┴──────┴───────┘

Returns:



522
523
524
# File 'lib/polars/lazy_group_by.rb', line 522

def min
  agg(F.all.min)
end

#n_uniqueLazyFrame

Count the unique values per group.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 1, 3, 4, 5],
    "b" => [0.5, 0.5, 0.5, 10, 13, 14],
    "d" => ["Apple", "Banana", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).n_unique.collect
# =>
# shape: (2, 3)
# ┌────────┬─────┬─────┐
# │ d      ┆ a   ┆ b   │
# │ ---    ┆ --- ┆ --- │
# │ str    ┆ u32 ┆ u32 │
# ╞════════╪═════╪═════╡
# │ Apple  ┆ 2   ┆ 2   │
# │ Banana ┆ 3   ┆ 3   │
# └────────┴─────┴─────┘

Returns:



549
550
551
# File 'lib/polars/lazy_group_by.rb', line 549

def n_unique
  agg(F.all.n_unique)
end

#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame

Compute the quantile per group.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).quantile(1).collect
# =>
# shape: (3, 3)
# ┌────────┬─────┬──────┐
# │ d      ┆ a   ┆ b    │
# │ ---    ┆ --- ┆ ---  │
# │ str    ┆ f64 ┆ f64  │
# ╞════════╪═════╪══════╡
# │ Apple  ┆ 3.0 ┆ 10.0 │
# │ Orange ┆ 2.0 ┆ 0.5  │
# │ Banana ┆ 5.0 ┆ 14.0 │
# └────────┴─────┴──────┘

Parameters:

  • quantile (Float)

    Quantile between 0.0 and 1.0.

  • interpolation ('nearest', 'higher', 'lower', 'midpoint', 'linear', 'equiprobable') (defaults to: "nearest")

    Interpolation method.

Returns:



582
583
584
# File 'lib/polars/lazy_group_by.rb', line 582

def quantile(quantile, interpolation: "nearest")
  agg(F.all.quantile(quantile, interpolation: interpolation))
end

#sumLazyFrame

Reduce the groups to the sum.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 4, 10, 13, 14],
    "c" => [true, true, true, false, false, true],
    "d" => ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"]
  }
).lazy
ldf.group_by("d", maintain_order: true).sum.collect
# =>
# shape: (3, 4)
# ┌────────┬─────┬──────┬─────┐
# │ d      ┆ a   ┆ b    ┆ c   │
# │ ---    ┆ --- ┆ ---  ┆ --- │
# │ str    ┆ i64 ┆ f64  ┆ u32 │
# ╞════════╪═════╪══════╪═════╡
# │ Apple  ┆ 6   ┆ 14.5 ┆ 2   │
# │ Orange ┆ 2   ┆ 0.5  ┆ 1   │
# │ Banana ┆ 9   ┆ 27.0 ┆ 1   │
# └────────┴─────┴──────┴─────┘

Returns:



611
612
613
# File 'lib/polars/lazy_group_by.rb', line 611

def sum
  agg(F.all.sum)
end

#tail(n = 5) ⇒ LazyFrame

Get the last n rows of each group.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["c", "c", "a", "c", "a", "b"],
    "nrs" => [1, 2, 3, 4, 5, 6]
  }
)
df.group_by("letters").tail(2).sort("letters")
# =>
# shape: (5, 2)
# ┌─────────┬─────┐
# │ letters ┆ nrs │
# │ ---     ┆ --- │
# │ str     ┆ i64 │
# ╞═════════╪═════╡
# │ a       ┆ 3   │
# │ a       ┆ 5   │
# │ b       ┆ 6   │
# │ c       ┆ 2   │
# │ c       ┆ 4   │
# └─────────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



272
273
274
# File 'lib/polars/lazy_group_by.rb', line 272

def tail(n = 5)
  Utils.wrap_ldf(@lgb.tail(n))
end