Class: Polars::LazyFrame
- Inherits:
-
Object
- Object
- Polars::LazyFrame
- Defined in:
- lib/polars/lazy_frame.rb
Overview
Representation of a Lazy computation graph/query against a DataFrame.
Class Method Summary collapse
-
.deserialize(source, format: "binary") ⇒ LazyFrame
Read a logical plan from a file to construct a LazyFrame.
Instance Method Summary collapse
-
#bottom_k(k, by:, reverse: false) ⇒ LazyFrame
Return the
ksmallest rows. -
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
-
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ LazyFrame
Create an empty copy of the current LazyFrame.
-
#collect(engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Materialize this LazyFrame into a DataFrame.
-
#collect_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Evaluate the query in streaming mode and get a generator that returns chunks.
-
#collect_schema ⇒ Schema
Resolve the schema of this LazyFrame.
-
#columns ⇒ Array
Get or set column names.
-
#count ⇒ LazyFrame
Return the number of non-null elements for each column.
-
#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
-
#drop(*columns, strict: true) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
-
#drop_nans(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more NaN values.
-
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more null values.
-
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
-
#explain(format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ String
Create a string representation of the query plan.
-
#explode(columns, *more_columns, empty_as_null: true, keep_nulls: true) ⇒ LazyFrame
Explode lists to long format.
-
#fill_nan(value) ⇒ LazyFrame
Fill floating point NaN values.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ LazyFrame
Fill null values using the specified value or strategy.
-
#filter(*predicates, **constraints) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#first ⇒ LazyFrame
Get the first row of the DataFrame.
-
#gather(indices, null_on_oob: false) ⇒ LazyFrame
Selects rows from this LazyFrame at the given indices.
-
#gather_every(n, offset: 0) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
-
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame
Group based on a time value (or index value of type Int32, Int64).
-
#head(n = 5) ⇒ LazyFrame
Get the first
nrows. -
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil) ⇒ LazyFrame
constructor
Create a new LazyFrame.
-
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame
Perform an asof join.
-
#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame
Perform a join based on one or multiple (in)equality predicates.
-
#last ⇒ LazyFrame
Get the last row of the DataFrame.
-
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
-
#limit(n = 5) ⇒ LazyFrame
Get the first
nrows. -
#map_batches(predicate_pushdown: false, projection_pushdown: false, slice_pushdown: false, no_optimizations: nil, schema: nil, validate_output_schema: true, streamable: false, &function) ⇒ LazyFrame
Apply a custom function.
-
#match_to_schema(schema, missing_columns: "raise", missing_struct_fields: "raise", extra_columns: "raise", extra_struct_fields: "raise", integer_cast: "forbid", float_cast: "forbid") ⇒ LazyFrame
Match or evolve the schema of a LazyFrame into a specific schema.
-
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
-
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
-
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
-
#merge_sorted(other, key, maintain_order: false) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
-
#null_count ⇒ LazyFrame
Aggregate the columns in the LazyFrame as the sum of their null value count.
-
#pipe(function, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#pipe_with_schema(function) ⇒ LazyFrame
Allows to alter the lazy frame during the plan stage with the resolved schema.
-
#pivot(on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_", column_naming: "auto") ⇒ LazyFrame
Create a spreadsheet-style pivot table as a DataFrame.
-
#profile(engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Array
Profile a LazyFrame.
-
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
-
#remove(*predicates, **constraints) ⇒ LazyFrame
Remove rows, dropping those that match the given predicate expression(s).
-
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
-
#reverse ⇒ LazyFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ LazyFrame
Create rolling groups based on a time column.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
-
#select_seq(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this LazyFrame.
-
#serialize(file = nil, format: "binary") ⇒ Object
Serialize the logical plan of this LazyFrame to a file or string.
-
#set_sorted(column, *more_columns, descending: false, nulls_last: false) ⇒ LazyFrame
Flag a column as sorted.
-
#shift(n = 1, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
-
#show_graph(optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Show a plot of the query plan.
-
#sink_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, &function) ⇒ Object
Evaluate the query and call a user-defined function for every ready batch.
-
#sink_csv(path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
-
#sink_delta(target, mode: "error", storage_options: nil, credential_provider: "auto", delta_write_options: nil, delta_merge_options: nil, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Sink DataFrame as delta table.
-
#sink_ipc(path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
-
#sink_ndjson(path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
-
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _sinked_paths_callback: nil) ⇒ DataFrame
Persists a LazyFrame at the provided path.
-
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
-
#sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
-
#sql(query, table_name: "self") ⇒ Expr
Execute a SQL query against the LazyFrame.
-
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
-
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
-
#tail(n = 5) ⇒ LazyFrame
Get the last
nrows. -
#to_s ⇒ String
Returns a string representing the LazyFrame.
-
#top_k(k, by:, reverse: false) ⇒ LazyFrame
Return the
klargest rows. -
#unique(maintain_order: false, subset: nil, keep: "any") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
-
#unnest(columns = nil, *more_columns, separator: nil) ⇒ LazyFrame
Decompose a struct into its fields.
-
#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
-
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame
Update the values in this
LazyFramewith the values inother. -
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
-
#width ⇒ Integer
Get the width of the LazyFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
-
#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame
Add columns to this LazyFrame.
-
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
Add a column at index 0 that counts the rows.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil) ⇒ LazyFrame
Create a new LazyFrame.
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# File 'lib/polars/lazy_frame.rb', line 8 def initialize( data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil ) self._ldf = ( DataFrame.new( data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length, nan_to_null: nan_to_null, height: height ) .lazy ._ldf ) end |
Class Method Details
.deserialize(source, format: "binary") ⇒ LazyFrame
This function uses marshaling if the logical plan contains Ruby UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Read a logical plan from a file to construct a LazyFrame.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/polars/lazy_frame.rb', line 76 def self.deserialize(source, format: "binary") if Utils.pathlike?(source) source = Utils.normalize_filepath(source) end if format == "binary" raise Todo unless RbLazyFrame.respond_to?(:deserialize_binary) deserializer = RbLazyFrame.method(:deserialize_binary) elsif format == "json" deserializer = RbLazyFrame.method(:deserialize_json) else msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}" raise ArgumentError, msg end _from_rbldf(deserializer.(source)) end |
Instance Method Details
#bottom_k(k, by:, reverse: false) ⇒ LazyFrame
Return the k smallest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call :func:sort after this function if you wish the
output to be sorted.
902 903 904 905 906 907 908 909 910 |
# File 'lib/polars/lazy_frame.rb', line 902 def bottom_k( k, by:, reverse: false ) by = Utils.parse_into_list_of_expressions(by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") _from_rbldf(_ldf.bottom_k(k, by, reverse)) end |
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
1915 1916 1917 |
# File 'lib/polars/lazy_frame.rb', line 1915 def cache _from_rbldf(_ldf.cache) end |
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 |
# File 'lib/polars/lazy_frame.rb', line 1968 def cast(dtypes, strict: true) if !dtypes.is_a?(Hash) return _from_rbldf(_ldf.cast_all(dtypes, strict)) end cast_map = {} dtypes.each do |c, dtype| dtype = Utils.parse_into_dtype(dtype) cast_map.merge!( c.is_a?(::String) ? {c => dtype} : Utils.(self, c).to_h { |x| [x, dtype] } ) end _from_rbldf(_ldf.cast(cast_map, strict)) end |
#clear(n = 0) ⇒ LazyFrame
Create an empty copy of the current LazyFrame.
The copy has an identical schema but no data.
2020 2021 2022 |
# File 'lib/polars/lazy_frame.rb', line 2020 def clear(n = 0) DataFrame.new(schema: schema).clear(n).lazy end |
#collect(engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Materialize this LazyFrame into a DataFrame.
By default, all query optimizations are enabled. Individual optimizations may
be disabled by setting the corresponding parameter to false.
1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 |
# File 'lib/polars/lazy_frame.rb', line 1019 def collect( engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if engine == "streaming" Utils.issue_unstable_warning("streaming mode is considered unstable.") end ldf = _ldf.with_optimizations(optimizations._rboptflags) if background Utils.issue_unstable_warning("background mode is considered unstable.") return InProcessQuery.new(ldf.collect_concurrently) end Utils.wrap_df(ldf.collect(engine)) end |
#collect_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.
Evaluate the query in streaming mode and get a generator that returns chunks.
This allows streaming results that are larger than RAM to be written to disk.
The query will always be fully executed unless stop is called, so you should
call next until all chunks have been seen.
1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 |
# File 'lib/polars/lazy_frame.rb', line 1875 def collect_batches( chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) ldf = _ldf.with_optimizations(optimizations._rboptflags) inner = ldf.collect_batches( engine, maintain_order, chunk_size, lazy ) CollectBatches.new(inner) end |
#collect_schema ⇒ Schema
Resolve the schema of this LazyFrame.
1070 1071 1072 |
# File 'lib/polars/lazy_frame.rb', line 1070 def collect_schema Schema.new(_ldf.collect_schema, check_dtypes: false) end |
#columns ⇒ Array
Get or set column names.
112 113 114 |
# File 'lib/polars/lazy_frame.rb', line 112 def columns _ldf.collect_schema.keys end |
#count ⇒ LazyFrame
Return the number of non-null elements for each column.
5453 5454 5455 |
# File 'lib/polars/lazy_frame.rb', line 5453 def count _from_rbldf(_ldf.count) end |
#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame
The median is included by default as the 50% percentile.
This method does not maintain the laziness of the frame, and will collect
the final result. This could potentially be an expensive operation.
We do not guarantee the output of describe to be stable. It will show
statistics that we deem informative, and may be updated in the future.
Using describe programmatically (versus interactive exploration) is
not recommended for this reason.
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 |
# File 'lib/polars/lazy_frame.rb', line 394 def describe( percentiles: [0.25, 0.5, 0.75], interpolation: "nearest" ) schema = collect_schema.to_h if schema.empty? msg = "cannot describe a LazyFrame that has no columns" raise TypeError, msg end # create list of metrics metrics = ["count", "null_count", "mean", "std", "min"] if (quantiles = Utils.parse_percentiles(percentiles)).any? metrics.concat(quantiles.map { |q| "%g%%" % [q * 100] }) end metrics.append("max") skip_minmax = lambda do |dt| dt.nested? || [Categorical, Enum, Null, Object, Unknown].include?(dt) end # determine which columns will produce std/mean/percentile/etc # statistics in a single pass over the frame schema has_numeric_result, sort_cols = Set.new, Set.new metric_exprs = [] null = F.lit(nil) schema.each do |c, dtype| is_numeric = dtype.numeric? is_temporal = !is_numeric && dtype.temporal? # counts count_exprs = [ F.col(c).count.name.prefix("count:"), F.col(c).null_count.name.prefix("null_count:") ] # mean mean_expr = if is_temporal || is_numeric || dtype == Boolean F.col(c).mean else null end # standard deviation, min, max expr_std = is_numeric ? F.col(c).std : null min_expr = !skip_minmax.(dtype) ? F.col(c).min : null max_expr = !skip_minmax.(dtype) ? F.col(c).max : null # percentiles pct_exprs = [] quantiles.each do |p| if is_numeric || is_temporal pct_expr = if is_temporal F.col(c).to_physical.quantile(p, interpolation: interpolation).cast(dtype) else F.col(c).quantile(p, interpolation: interpolation) end sort_cols.add(c) else pct_expr = null end pct_exprs << pct_expr.alias("#{p}:#{c}") end if is_numeric || dtype.nested? || [Null, Boolean].include?(dtype) has_numeric_result.add(c) end # add column expressions (in end-state 'metrics' list order) metric_exprs.concat( [ *count_exprs, mean_expr.alias("mean:#{c}"), expr_std.alias("std:#{c}"), min_expr.alias("min:#{c}"), *pct_exprs, max_expr.alias("max:#{c}") ] ) end # calculate requested metrics in parallel, then collect the result df_metrics = ( ( # if more than one quantile, sort the relevant columns to make them O(1) # TODO: drop sort once we have efficient retrieval of multiple quantiles sort_cols ? with_columns(sort_cols.map { |c| F.col(c).sort }) : self ) .select(*metric_exprs) .collect ) # reshape wide result n_metrics = metrics.length column_metrics = schema.length.times.map do |n| df_metrics.row(0)[(n * n_metrics)...((n + 1) * n_metrics)] end summary = schema.keys.zip(column_metrics).to_h # cast by column type (numeric/bool -> float), (other -> string) schema.each_key do |c| summary[c] = summary[c].map do |v| if v.nil? || v.is_a?(Hash) nil else if has_numeric_result.include?(c) if v == true 1.0 elsif v == false 0.0 else v.to_f end else "#{v}" end end end end # return results as a DataFrame df_summary = Polars.from_hash(summary) df_summary.insert_column(0, Polars::Series.new("statistic", metrics)) df_summary end |
#drop(*columns, strict: true) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 |
# File 'lib/polars/lazy_frame.rb', line 3688 def drop(*columns, strict: true) selectors = [] columns.each do |c| if c.is_a?(Enumerable) selectors += c else selectors += [c] end end drop_cols = Utils.parse_list_into_selector(selectors, strict: strict) _from_rbldf(_ldf.drop(drop_cols._rbselector)) end |
#drop_nans(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
4633 4634 4635 4636 4637 4638 4639 |
# File 'lib/polars/lazy_frame.rb', line 4633 def drop_nans(subset: nil) selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.drop_nans(selector_subset)) end |
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
4682 4683 4684 4685 4686 4687 4688 |
# File 'lib/polars/lazy_frame.rb', line 4682 def drop_nulls(subset: nil) selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.drop_nulls(selector_subset)) end |
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
130 131 132 |
# File 'lib/polars/lazy_frame.rb', line 130 def dtypes _ldf.collect_schema.values end |
#explain(format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ String
Create a string representation of the query plan.
Different optimizations can be turned on or off.
543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 |
# File 'lib/polars/lazy_frame.rb', line 543 def explain( format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if engine == "streaming" Utils.issue_unstable_warning("streaming mode is considered unstable.") end if optimized ldf = _ldf.with_optimizations(optimizations._rboptflags) if format == "tree" return ldf.describe_optimized_plan_tree else return ldf.describe_optimized_plan end end if format == "tree" _ldf.describe_plan_tree else _ldf.describe_plan end end |
#explode(columns, *more_columns, empty_as_null: true, keep_nulls: true) ⇒ LazyFrame
Explode lists to long format.
4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 |
# File 'lib/polars/lazy_frame.rb', line 4510 def explode( columns, *more_columns, empty_as_null: true, keep_nulls: true ) subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector( more_columns ) _from_rbldf(_ldf.explode(subset._rbselector, empty_as_null, keep_nulls)) end |
#fill_nan(value) ⇒ LazyFrame
Note that floating point NaN (Not a Number) are not missing values!
To replace missing values, use fill_null instead.
Fill floating point NaN values.
4249 4250 4251 4252 4253 4254 |
# File 'lib/polars/lazy_frame.rb', line 4249 def fill_nan(value) if !value.is_a?(Expr) value = F.lit(value) end _from_rbldf(_ldf.fill_nan(value._rbexpr)) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ LazyFrame
Fill null values using the specified value or strategy.
4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 |
# File 'lib/polars/lazy_frame.rb', line 4174 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) if !value.nil? if value.is_a?(Expr) dtypes = nil elsif value.is_a?(TrueClass) || value.is_a?(FalseClass) dtypes = [Boolean] elsif matches_supertype && (value.is_a?(Integer) || value.is_a?(Float)) dtypes = [ Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt64, Float32, Float64, Decimal.new ] elsif value.is_a?(Integer) dtypes = [Int64] elsif value.is_a?(Float) dtypes = [Float64] elsif value.is_a?(::Date) dtypes = [Date] elsif value.is_a?(::String) dtypes = [String, Categorical] else # fallback; anything not explicitly handled above dtypes = nil end if dtypes return with_columns( F.col(dtypes).fill_null(value, strategy: strategy, limit: limit) ) end end select(Polars.all.fill_null(value, strategy: strategy, limit: limit)) end |
#filter(*predicates, **constraints) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 |
# File 'lib/polars/lazy_frame.rb', line 2139 def filter(*predicates, **constraints) if constraints.empty? # early-exit conditions (exclude/include all rows) if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass)) return dup end if predicates.length == 1 && predicates[0].is_a?(FalseClass) return clear end end _filter( predicates: predicates, constraints: constraints, invert: false ) end |
#first ⇒ LazyFrame
Get the first row of the DataFrame.
4039 4040 4041 |
# File 'lib/polars/lazy_frame.rb', line 4039 def first slice(0, 1) end |
#gather(indices, null_on_oob: false) ⇒ LazyFrame
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Selects rows from this LazyFrame at the given indices.
3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 |
# File 'lib/polars/lazy_frame.rb', line 3543 def gather(indices, null_on_oob: false) if !indices.is_a?(LazyFrame) if indices.is_a?(::Array) indices_expr = F.lit(Series.new("", indices, dtype: Int64)) else indices_expr = wrap_expr(Utils.parse_into_expression(indices)) end indices = select(indices_expr) end _from_rbldf(_ldf.gather(indices._ldf, null_on_oob)) end |
#gather_every(n, offset: 0) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
4101 4102 4103 |
# File 'lib/polars/lazy_frame.rb', line 4101 def gather_every(n, offset: 0) select(F.col("*").gather_every(n, offset)) end |
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
Start a group by operation.
2443 2444 2445 2446 2447 |
# File 'lib/polars/lazy_frame.rb', line 2443 def group_by(*by, maintain_order: false, **named_by) exprs = Utils.parse_into_list_of_expressions(*by, **named_by) lgb = _ldf.group_by(exprs, maintain_order) LazyGroupBy.new(lgb) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every, period and offset arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 |
# File 'lib/polars/lazy_frame.rb', line 2806 def group_by_dynamic( index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window" ) index_column = Utils.parse_into_expression(index_column, str_as_lit: false) if offset.nil? offset = period.nil? ? "-#{every}" : "0ns" end if period.nil? period = every end period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) every = Utils.parse_as_duration_string(every) rbexprs_by = group_by.nil? ? [] : Utils.parse_into_list_of_expressions(group_by) lgb = _ldf.group_by_dynamic( index_column, every, period, offset, label, include_boundaries, closed, rbexprs_by, start_by ) LazyGroupBy.new(lgb) end |
#head(n = 5) ⇒ LazyFrame
Get the first n rows.
3944 3945 3946 |
# File 'lib/polars/lazy_frame.rb', line 3944 def head(n = 5) slice(0, n) end |
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
167 168 169 |
# File 'lib/polars/lazy_frame.rb', line 167 def include?(key) columns.include?(key) end |
#interpolate ⇒ LazyFrame
Interpolate intermediate values. The interpolation method is linear.
5042 5043 5044 |
# File 'lib/polars/lazy_frame.rb', line 5042 def interpolate select(F.col("*").interpolate) end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 |
# File 'lib/polars/lazy_frame.rb', line 3317 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if maintain_order.nil? maintain_order = "none" end if how == "outer" how = "full" elsif how == "cross" return _from_rbldf( _ldf.join( other._ldf, [], [], allow_parallel, nulls_equal, force_parallel, how, suffix, validate, maintain_order, coalesce ) ) end if !on.nil? rbexprs = Utils.parse_into_list_of_expressions(on) rbexprs_left = rbexprs rbexprs_right = rbexprs elsif !left_on.nil? && !right_on.nil? rbexprs_left = Utils.parse_into_list_of_expressions(left_on) rbexprs_right = Utils.parse_into_list_of_expressions(right_on) else raise ArgumentError, "must specify `on` OR `left_on` and `right_on`" end _from_rbldf( self._ldf.join( other._ldf, rbexprs_left, rbexprs_right, allow_parallel, force_parallel, nulls_equal, how, suffix, validate, maintain_order, coalesce ) ) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 |
# File 'lib/polars/lazy_frame.rb', line 3104 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if on.is_a?(::String) left_on = on right_on = on end if left_on.nil? || right_on.nil? raise ArgumentError, "You should pass the column to join on as an argument." end if by_left.is_a?(::String) || by_left.is_a?(Expr) by_left_ = [by_left] else by_left_ = by_left end if by_right.is_a?(::String) || by_right.is_a?(Expr) by_right_ = [by_right] else by_right_ = by_right end if by.is_a?(::String) by_left_ = [by] by_right_ = [by] elsif by.is_a?(::Array) by_left_ = by by_right_ = by end tolerance_str = nil tolerance_num = nil if tolerance.is_a?(::String) tolerance_str = tolerance else tolerance_num = tolerance end _from_rbldf( _ldf.join_asof( other._ldf, Polars.col(left_on)._rbexpr, Polars.col(right_on)._rbexpr, by_left_, by_right_, allow_parallel, force_parallel, suffix, strategy, tolerance_num, tolerance_str, coalesce, allow_exact_matches, check_sortedness ) ) end |
#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame
The row order of the input DataFrames is not preserved.
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 |
# File 'lib/polars/lazy_frame.rb', line 3466 def join_where( other, *predicates, suffix: "_right" ) Utils.require_same_type(self, other) rbexprs = Utils.parse_into_list_of_expressions(*predicates) _from_rbldf( _ldf.join_where( other._ldf, rbexprs, suffix ) ) end |
#last ⇒ LazyFrame
Get the last row of the DataFrame.
4014 4015 4016 |
# File 'lib/polars/lazy_frame.rb', line 4014 def last tail(1) end |
#lazy ⇒ LazyFrame
Return lazy representation, i.e. itself.
Useful for writing code that expects either a DataFrame or
LazyFrame.
1908 1909 1910 |
# File 'lib/polars/lazy_frame.rb', line 1908 def lazy self end |
#limit(n = 5) ⇒ LazyFrame
Get the first n rows.
Alias for #head.
3899 3900 3901 |
# File 'lib/polars/lazy_frame.rb', line 3899 def limit(n = 5) head(n) end |
#map_batches(predicate_pushdown: false, projection_pushdown: false, slice_pushdown: false, no_optimizations: nil, schema: nil, validate_output_schema: true, streamable: false, &function) ⇒ LazyFrame
The schema of a LazyFrame must always be correct. It is up to the caller
of this function to ensure that this invariant is upheld.
It is important that the optimization flags are correct. If the custom function
for instance does an aggregation of a column, predicate_pushdown should not
be allowed, as this prunes rows and will influence your aggregation results.
A UDF passed to map_batches must be pure, meaning that it cannot modify or
depend on state other than its arguments.
Apply a custom function.
It is important that the function returns a Polars DataFrame.
4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 |
# File 'lib/polars/lazy_frame.rb', line 4986 def map_batches( predicate_pushdown: false, projection_pushdown: false, slice_pushdown: false, no_optimizations: nil, schema: nil, validate_output_schema: true, streamable: false, &function ) raise Todo if !schema.nil? if no_optimizations predicate_pushdown = false projection_pushdown = false slice_pushdown = false end _from_rbldf( _ldf.map_batches( function, predicate_pushdown, projection_pushdown, slice_pushdown, streamable, schema, validate_output_schema, ) ) end |
#match_to_schema(schema, missing_columns: "raise", missing_struct_fields: "raise", extra_columns: "raise", extra_struct_fields: "raise", integer_cast: "forbid", float_cast: "forbid") ⇒ LazyFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Match or evolve the schema of a LazyFrame into a specific schema.
By default, match_to_schema returns an error if the input schema does not exactly match the target schema. It also allows columns to be freely reordered, with additional coercion rules available through optional parameters.
5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 |
# File 'lib/polars/lazy_frame.rb', line 5585 def match_to_schema( schema, missing_columns: "raise", missing_struct_fields: "raise", extra_columns: "raise", extra_struct_fields: "raise", integer_cast: "forbid", float_cast: "forbid" ) prepare_missing_columns = lambda do |value| if value.is_a?(Expr) value._rbexpr else value end end if schema.is_a?(Hash) schema_prep = Schema.new(schema) else schema_prep = schema end if missing_columns.is_a?(Hash) missing_columns_rbexpr = missing_columns.to_h do |key, value| [key.to_s, prepare_missing_columns.(value)] end elsif missing_columns.is_a?(Expr) missing_columns_rbexpr = prepare_missing_columns.(missing_columns) else missing_columns_rbexpr = missing_columns end LazyFrame._from_rbldf( _ldf.match_to_schema( schema_prep, missing_columns_rbexpr, missing_struct_fields, extra_columns, extra_struct_fields, integer_cast, float_cast ) ) end |
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
4336 4337 4338 |
# File 'lib/polars/lazy_frame.rb', line 4336 def max _from_rbldf(_ldf.max) end |
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
4396 4397 4398 |
# File 'lib/polars/lazy_frame.rb', line 4396 def mean _from_rbldf(_ldf.mean) end |
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
4416 4417 4418 |
# File 'lib/polars/lazy_frame.rb', line 4416 def median _from_rbldf(_ldf.median) end |
#merge_sorted(other, key, maintain_order: false) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
5151 5152 5153 |
# File 'lib/polars/lazy_frame.rb', line 5151 def merge_sorted(other, key, maintain_order: false) _from_rbldf(_ldf.merge_sorted(other._ldf, key, maintain_order)) end |
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
4356 4357 4358 |
# File 'lib/polars/lazy_frame.rb', line 4356 def min _from_rbldf(_ldf.min) end |
#null_count ⇒ LazyFrame
Aggregate the columns in the LazyFrame as the sum of their null value count.
4442 4443 4444 |
# File 'lib/polars/lazy_frame.rb', line 4442 def null_count _from_rbldf(_ldf.null_count) end |
#pipe(function, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
261 262 263 |
# File 'lib/polars/lazy_frame.rb', line 261 def pipe(function, *args, **kwargs, &block) function.(self, *args, **kwargs, &block) end |
#pipe_with_schema(function) ⇒ LazyFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Allows to alter the lazy frame during the plan stage with the resolved schema.
In contrast to pipe, this method does not execute function immediately but
only during the plan stage. This allows to use the resolved schema of the input
to dynamically alter the lazy frame. This also means that any exceptions raised
by function will only be emitted during the plan stage.
305 306 307 308 309 310 311 312 313 314 315 316 |
# File 'lib/polars/lazy_frame.rb', line 305 def pipe_with_schema(function) wrapper = lambda do |lf_and_schema| # The last index is because we return a list for multiple inputs # to make `pipe_with_schemas` (plural) work, but we don't use that function.( _from_rbldf(lf_and_schema[0][0]), lf_and_schema[1][0] )._ldf end _from_rbldf(_ldf.pipe_with_schema(wrapper)) end |
#pivot(on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_", column_naming: "auto") ⇒ LazyFrame
In some other frameworks, you might know this operation as pivot_wider.
Create a spreadsheet-style pivot table as a DataFrame.
4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 |
# File 'lib/polars/lazy_frame.rb', line 4754 def pivot( on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_", column_naming: "auto" ) if index.nil? && values.nil? msg = "`pivot` needs either `index or `values` needs to be specified" raise InvalidOperationError, msg end on_selector = Utils.parse_list_into_selector(on) if !values.nil? values_selector = Utils.parse_list_into_selector(values) end if !index.nil? index_selector = Utils.parse_list_into_selector(index) end if values.nil? values_selector = Polars.cs.all - on_selector - index_selector end if index.nil? index_selector = Polars.cs.all - on_selector - values_selector end agg = F.element if aggregate_function.is_a?(::String) if aggregate_function == "first" agg = agg.first elsif aggregate_function == "item" agg = agg.item elsif aggregate_function == "sum" agg = agg.sum elsif aggregate_function == "max" agg = agg.max elsif aggregate_function == "min" agg = agg.min elsif aggregate_function == "mean" agg = agg.mean elsif aggregate_function == "median" agg = agg.median elsif aggregate_function == "last" agg = agg.last elsif aggregate_function == "len" agg = agg.len elsif aggregate_function == "count" Utils.issue_deprecation_warning( "`aggregate_function='count'` input for `pivot` is deprecated." + " Please use `aggregate_function='len'`." ) agg = agg.len else msg = "invalid input for `aggregate_function` argument: #{aggregate_function.inspect}" raise ArgumentError, msg end elsif aggregate_function.nil? agg = agg.item(allow_empty: true) else agg = aggregate_function end if on_columns.is_a?(DataFrame) on_cols = on_columns elsif on_columns.is_a?(Series) on_cols = on_columns.to_frame else on_cols = Series.new(on_columns).to_frame end _from_rbldf( _ldf.pivot( on_selector._rbselector, on_cols._df, index_selector._rbselector, values_selector._rbselector, agg._rbexpr, maintain_order, separator, column_naming ) ) end |
#profile(engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Array
Profile a LazyFrame.
This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.
The units of the timings are microseconds.
964 965 966 967 968 969 970 971 972 973 974 |
# File 'lib/polars/lazy_frame.rb', line 964 def profile( engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) ldf = _ldf.with_optimizations(optimizations._rboptflags) df_rb, timings_rb = ldf.profile [Utils.wrap_df(df_rb), Utils.wrap_df(timings_rb)] end |
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
4467 4468 4469 4470 |
# File 'lib/polars/lazy_frame.rb', line 4467 def quantile(quantile, interpolation: "nearest") quantile = Utils.parse_into_expression(quantile, str_as_lit: false) _from_rbldf(_ldf.quantile(quantile, interpolation)) end |
#remove(*predicates, **constraints) ⇒ LazyFrame
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to true are retained
(this includes rows where the predicate evaluates as null).
2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 |
# File 'lib/polars/lazy_frame.rb', line 2270 def remove( *predicates, **constraints ) if constraints.empty? # early-exit conditions (exclude/include all rows) if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass)) return clear end if predicates.length == 1 && predicates[0].is_a?(FalseClass) return dup end end _filter( predicates: predicates, constraints: constraints, invert: true ) end |
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
3733 3734 3735 3736 3737 3738 3739 3740 3741 |
# File 'lib/polars/lazy_frame.rb', line 3733 def rename(mapping, strict: true) if mapping.respond_to?(:call) select(F.all.name.map(&mapping)) else existing = mapping.keys _new = mapping.values _from_rbldf(_ldf.rename(existing, _new, strict)) end end |
#reverse ⇒ LazyFrame
Reverse the DataFrame.
3766 3767 3768 |
# File 'lib/polars/lazy_frame.rb', line 3766 def reverse _from_rbldf(_ldf.reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ LazyFrame
Create rolling groups based on a time column.
Different from a dynamic_group_by the windows are now determined by the
individual values and are not of constant intervals. For constant intervals
use group_by_dynamic.
The period and offset arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 |
# File 'lib/polars/lazy_frame.rb', line 2531 def rolling( index_column:, period:, offset: nil, closed: "right", group_by: nil ) index_column = Utils.parse_into_expression(index_column) if offset.nil? offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period)) end rbexprs_by = ( !group_by.nil? ? Utils.parse_into_list_of_expressions(group_by) : [] ) period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by) LazyGroupBy.new(lgb) end |
#schema ⇒ Hash
Get the schema.
148 149 150 |
# File 'lib/polars/lazy_frame.rb', line 148 def schema _ldf.collect_schema end |
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
2379 2380 2381 2382 2383 2384 2385 2386 |
# File 'lib/polars/lazy_frame.rb', line 2379 def select(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select(rbexprs)) end |
#select_seq(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this LazyFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
2402 2403 2404 2405 2406 2407 2408 2409 |
# File 'lib/polars/lazy_frame.rb', line 2402 def select_seq(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0 rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select_seq(rbexprs)) end |
#serialize(file = nil, format: "binary") ⇒ Object
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Serialize the logical plan of this LazyFrame to a file or string.
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
# File 'lib/polars/lazy_frame.rb', line 214 def serialize(file = nil, format: "binary") if format == "binary" raise Todo unless _ldf.respond_to?(:serialize_binary) serializer = _ldf.method(:serialize_binary) elsif format == "json" msg = "'json' serialization format of LazyFrame is deprecated" warn msg serializer = _ldf.method(:serialize_json) else msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}" raise ArgumentError, msg end Utils.serialize_polars_object(serializer, file) end |
#set_sorted(column, *more_columns, descending: false, nulls_last: false) ⇒ LazyFrame
This can lead to incorrect results if the data is NOT sorted! Use with care!
Flag a column as sorted.
This can speed up future operations.
5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 |
# File 'lib/polars/lazy_frame.rb', line 5172 def set_sorted( column, *more_columns, descending: false, nulls_last: false ) if !Utils.strlike?(column) msg = "expected a 'str' for argument 'column' in 'set_sorted'" raise TypeError, msg end if Utils.bool?(descending) ds = [descending] else ds = descending end if Utils.bool?(nulls_last) nl = [nulls_last] else nl = nulls_last end _from_rbldf( _ldf.hint_sorted( [column] + more_columns, ds, nl ) ) end |
#shift(n = 1, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
3812 3813 3814 3815 3816 3817 3818 |
# File 'lib/polars/lazy_frame.rb', line 3812 def shift(n = 1, fill_value: nil) if !fill_value.nil? fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true) end n = Utils.parse_into_expression(n) _from_rbldf(_ldf.shift(n, fill_value)) end |
#show_graph(optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
Show a plot of the query plan.
Note that Graphviz must be installed to render the visualization (if not already present, you can download it here: https://graphviz.org/download.
612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 |
# File 'lib/polars/lazy_frame.rb', line 612 def show_graph( optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if engine == "streaming" issue_unstable_warning("streaming mode is considered unstable.") end optimizations = optimizations.dup optimizations._rboptflags.streaming = engine == "streaming" _ldf = self._ldf.with_optimizations(optimizations._rboptflags) if plan_stage == "ir" dot = _ldf.to_dot(optimized) elsif plan_stage == "physical" if engine == "streaming" dot = _ldf.to_dot_streaming_phys(optimized) else dot = _ldf.to_dot(optimized) end else error_msg = "invalid plan stage '#{plan_stage}'" raise TypeError, error_msg end Utils.display_dot_graph( dot: dot, show: show, output_path: output_path, raw_output: raw_output ) end |
#sink_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, &function) ⇒ Object
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.
Evaluate the query and call a user-defined function for every ready batch.
This allows streaming results that are larger than RAM in certain cases.
1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 |
# File 'lib/polars/lazy_frame.rb', line 1813 def sink_batches( chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, &function ) _wrap = lambda do |rbdf| df = Utils.wrap_df(rbdf) !!function.(df) end ldf = _ldf.sink_batches( _wrap, maintain_order, chunk_size ) if !lazy ldf = ldf.with_optimizations(optimizations._rboptflags) lf = LazyFrame._from_rbldf(ldf) lf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf) end |
#sink_csv(path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 |
# File 'lib/polars/lazy_frame.rb', line 1580 def sink_csv( path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, false) engine = _select_engine(engine) _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_csv" ) target = _to_sink_target(path) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_csv( target, , include_bom, compression, compression_level, check_extension, include_header, separator.ord, line_terminator, quote_char.ord, batch_size, datetime_format, date_format, time_format, float_scientific, float_precision, decimal_comma, null_value, quote_style ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#sink_delta(target, mode: "error", storage_options: nil, credential_provider: "auto", delta_write_options: nil, delta_merge_options: nil, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Sink DataFrame as delta table.
1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 |
# File 'lib/polars/lazy_frame.rb', line 1278 def sink_delta( target, mode: "error", storage_options: nil, credential_provider: "auto", delta_write_options: nil, delta_merge_options: nil, optimizations: DEFAULT_QUERY_OPT_FLAGS ) Polars.send(:_check_if_delta_available) # TODO # _check_for_unsupported_types(collect_schema.dtypes) if Utils.pathlike?(target) target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false) end _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) if !target.is_a?(DeltaLake::Table) credential_provider_builder = _init_credential_provider_builder.( credential_provider, target, , "sink_delta" ) elsif !credential_provider.nil? && credential_provider != "auto" msg = "cannot use credential_provider when passing a DeltaTable object" raise ArgumentError, msg else credential_provider_builder = nil end credential_provider_creds = {} if credential_provider_builder raise Todo end # We aren't calling into polars-native write functions so we just update # the storage_options here. = if !.nil? || !credential_provider_builder.nil? ( || {}).merge(credential_provider_creds) else nil end stream = collect_batches( engine: "streaming", maintain_order: true, chunk_size: nil, lazy: true, optimizations: optimizations ) if mode == "merge" if .nil? msg = "you need to pass delta_merge_options with at least a given predicate for `MERGE` to work." raise ArgumentError, msg end if target.is_a?(::String) dt = DeltaLake::Table.new(target, storage_options: ) else dt = target end dt.merge(stream, **) else if .nil? = {} end DeltaLake.write( target, stream, mode: mode, storage_options: , ** ) nil end end |
#sink_ipc(path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 |
# File 'lib/polars/lazy_frame.rb', line 1415 def sink_ipc( path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false ) engine = _select_engine(engine) _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_ipc" ) target = _to_sink_target(path) compat_level_rb = nil if compat_level.nil? compat_level_rb = true else raise Todo end if compression.nil? compression = "uncompressed" end = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_ipc( target, , compression, compat_level_rb, record_batch_size, _record_batch_statistics ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#sink_ndjson(path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 |
# File 'lib/polars/lazy_frame.rb', line 1719 def sink_ndjson( path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS ) engine = _select_engine(engine) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_ndjson" ) target = _to_sink_target(path) = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder ) ldf_rb = _ldf.sink_ndjson( target, compression, compression_level, check_extension, ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _sinked_paths_callback: nil) ⇒ DataFrame
Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 |
# File 'lib/polars/lazy_frame.rb', line 1148 def sink_parquet( path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _sinked_paths_callback: nil ) engine = _select_engine(engine) if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder) if !retries.nil? msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead." Utils.issue_deprecation_warning(msg) = || {} ["max_retries"] = retries end credential_provider_builder = _init_credential_provider_builder.( credential_provider, path, , "sink_parquet" ) target = _to_sink_target(path) = SinkOptions.new( mkdir: mkdir, maintain_order: maintain_order, sync_on_close: sync_on_close, storage_options: , credential_provider: credential_provider_builder, sinked_paths_callback: _sinked_paths_callback ) ldf_rb = _ldf.sink_parquet( target, , compression, compression_level, statistics, row_group_size, data_page_size, , arrow_schema ) if !lazy ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags) ldf = LazyFrame._from_rbldf(ldf_rb) ldf.collect(engine: engine) return nil end LazyFrame._from_rbldf(ldf_rb) end |
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
3849 3850 3851 3852 3853 3854 |
# File 'lib/polars/lazy_frame.rb', line 3849 def slice(offset, length = nil) if length && length < 0 raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame" end _from_rbldf(_ldf.slice(offset, length)) end |
#sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
Sorting can be done by:
- A single column name
- An expression
- Multiple expressions
698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 |
# File 'lib/polars/lazy_frame.rb', line 698 def sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) if by.is_a?(::String) && more_by.empty? return _from_rbldf( _ldf.sort( by, descending, nulls_last, maintain_order, multithreaded ) ) end by = Utils.parse_into_list_of_expressions(by, *more_by) descending = Utils.extend_bool(descending, by.length, "descending", "by") nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by") _from_rbldf( _ldf.sort_by_exprs( by, descending, nulls_last, maintain_order, multithreaded ) ) end |
#sql(query, table_name: "self") ⇒ Expr
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- The calling frame is automatically registered as a table in the SQL context
under the name "self". If you want access to the DataFrames and LazyFrames
found in the current globals, use the top-level
Polars.sql. - More control over registration and execution behaviour is available by
using the
SQLContextobject.
Execute a SQL query against the LazyFrame.
777 778 779 780 781 782 |
# File 'lib/polars/lazy_frame.rb', line 777 def sql(query, table_name: "self") ctx = Polars::SQLContext.new name = table_name || "self" ctx.register(name, self) ctx.execute(query) end |
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
4284 4285 4286 |
# File 'lib/polars/lazy_frame.rb', line 4284 def std(ddof: 1) _from_rbldf(_ldf.std(ddof)) end |
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
4376 4377 4378 |
# File 'lib/polars/lazy_frame.rb', line 4376 def sum _from_rbldf(_ldf.sum) end |
#tail(n = 5) ⇒ LazyFrame
Get the last n rows.
3989 3990 3991 |
# File 'lib/polars/lazy_frame.rb', line 3989 def tail(n = 5) _from_rbldf(_ldf.tail(n)) end |
#to_s ⇒ String
Returns a string representing the LazyFrame.
176 177 178 179 180 181 182 |
# File 'lib/polars/lazy_frame.rb', line 176 def to_s <<~EOS naive plan: (run LazyFrame#explain(optimized: true) to see the optimized plan) #{explain(optimized: false)} EOS end |
#top_k(k, by:, reverse: false) ⇒ LazyFrame
Return the k largest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call :func:sort after this function if you wish the
output to be sorted.
838 839 840 841 842 843 844 845 846 |
# File 'lib/polars/lazy_frame.rb', line 838 def top_k( k, by:, reverse: false ) by = Utils.parse_into_list_of_expressions(by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") _from_rbldf(_ldf.top_k(k, by, reverse)) end |
#unique(maintain_order: false, subset: nil, keep: "any") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List in the DataFrame or
subset.
4583 4584 4585 4586 4587 4588 4589 |
# File 'lib/polars/lazy_frame.rb', line 4583 def unique(maintain_order: false, subset: nil, keep: "any") parsed_subset = nil if !subset.nil? parsed_subset = Utils.parse_into_list_of_expressions(subset, __require_selectors: true) end _from_rbldf(_ldf.unique(maintain_order, parsed_subset, keep)) end |
#unnest(columns = nil, *more_columns, separator: nil) ⇒ LazyFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame on the location of the
struct type.
5102 5103 5104 5105 5106 5107 |
# File 'lib/polars/lazy_frame.rb', line 5102 def unnest(columns = nil, *more_columns, separator: nil) subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector( more_columns ) _from_rbldf(_ldf.unnest(subset._rbselector, separator)) end |
#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 |
# File 'lib/polars/lazy_frame.rb', line 4891 def unpivot( on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true ) if !streamable warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated" end selector_on = on.nil? ? nil : Utils.parse_list_into_selector(on)._rbselector selector_index = index.nil? ? Selectors.empty : Utils.parse_list_into_selector(index) _from_rbldf( _ldf.unpivot( selector_on, selector_index._rbselector, value_name, variable_name ) ) end |
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This is syntactic sugar for a left/inner join that preserves the order
of the left DataFrame by default, with an optional coalesce when
include_nulls: false.
Update the values in this LazyFrame with the values in other.
5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 |
# File 'lib/polars/lazy_frame.rb', line 5311 def update( other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left" ) Utils.require_same_type(self, other) if ["outer", "outer_coalesce"].include?(how) how = "full" end if !["left", "inner", "full"].include?(how) msg = "`how` must be one of {{'left', 'inner', 'full'}}; found #{how.inspect}" raise ArgumentError, msg end slf = self row_index_used = false if on.nil? if left_on.nil? && right_on.nil? # no keys provided--use row index row_index_used = true row_index_name = "__POLARS_ROW_INDEX" slf = slf.with_row_index(name: row_index_name) other = other.with_row_index(name: row_index_name) left_on = right_on = [row_index_name] else # one of left or right is missing, raise error if left_on.nil? msg = "missing join columns for left frame" raise ArgumentError, msg end if right_on.nil? msg = "missing join columns for right frame" raise ArgumentError, msg end end else # move on into left/right_on to simplify logic left_on = right_on = on end if left_on.is_a?(::String) left_on = [left_on] end if right_on.is_a?(::String) right_on = [right_on] end left_schema = slf.collect_schema left_on.each do |name| if !left_schema.include?(name) msg = "left join column #{name.inspect} not found" raise ArgumentError, msg end end right_schema = other.collect_schema right_on.each do |name| if !right_schema.include?(name) msg = "right join column #{name.inspect} not found" raise ArgumentError, msg end end # no need to join if *only* join columns are in other (inner/left update only) if how != "full" && right_schema.length == right_on.length if row_index_used return slf.drop(row_index_name) end return slf end # only use non-idx right columns present in left frame right_other = Set.new(right_schema.to_h.keys).intersection(left_schema.to_h.keys) - Set.new(right_on) # When include_nulls is true, we need to distinguish records after the join that # were originally null in the right frame, as opposed to records that were null # because the key was missing from the right frame. # Add a validity column to track whether row was matched or not. if include_nulls validity = ["__POLARS_VALIDITY"] other = other.with_columns(F.lit(true).alias(validity[0])) else validity = [] end tmp_name = "__POLARS_RIGHT" drop_columns = right_other.map { |name| "#{name}#{tmp_name}" } + validity result = ( slf.join( other.select(*right_on, *right_other, *validity), left_on: left_on, right_on: right_on, how: how, suffix: tmp_name, coalesce: true, maintain_order: maintain_order ) .with_columns( right_other.map do |name| ( if include_nulls # use left value only when right value failed to join F.when(F.col(validity).is_null) .then(F.col(name)) .otherwise(F.col("#{name}#{tmp_name}")) else F.coalesce(["#{name}#{tmp_name}", F.col(name)]) end ).alias(name) end ) .drop(drop_columns) ) if row_index_used result = result.drop(row_index_name) end _from_rbldf(result._ldf) end |
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
4316 4317 4318 |
# File 'lib/polars/lazy_frame.rb', line 4316 def var(ddof: 1) _from_rbldf(_ldf.var(ddof)) end |
#width ⇒ Integer
Get the width of the LazyFrame.
160 161 162 |
# File 'lib/polars/lazy_frame.rb', line 160 def width _ldf.collect_schema.length end |
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
3593 3594 3595 3596 3597 3598 3599 |
# File 'lib/polars/lazy_frame.rb', line 3593 def with_columns(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify) _from_rbldf(_ldf.with_columns(rbexprs)) end |
#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 |
# File 'lib/polars/lazy_frame.rb', line 3617 def with_columns_seq( *exprs, **named_exprs ) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0 rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.with_columns_seq(rbexprs)) end |
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Add a column at index 0 that counts the rows.
4075 4076 4077 |
# File 'lib/polars/lazy_frame.rb', line 4075 def with_row_index(name: "index", offset: 0) _from_rbldf(_ldf.with_row_index(name, offset)) end |