Class: SparkConnect::DataFrameNaFunctions

Inherits:
Object
  • Object
show all
Defined in:
lib/spark_connect/na_functions.rb

Overview

Missing-data helpers, returned by SparkConnect::DataFrame#na. Mirrors PySpark’s ‘DataFrame.na` (`DataFrameNaFunctions`).

Examples:

df.na.drop(how: :any)
df.na.fill(0)
df.na.fill({ "name" => "unknown", "age" => 0 })
df.na.replace("UNKNOWN", nil, subset: ["name"])

Constant Summary collapse

Proto =
SparkConnect::Proto

Instance Method Summary collapse

Constructor Details

#initialize(df) ⇒ DataFrameNaFunctions

Returns a new instance of DataFrameNaFunctions.

Parameters:



16
17
18
# File 'lib/spark_connect/na_functions.rb', line 16

def initialize(df)
  @df = df
end

Instance Method Details

#drop(how: :any, thresh: nil, subset: nil) ⇒ DataFrame

Drop rows containing null values.

Parameters:

  • how (Symbol) (defaults to: :any)

    ‘:any` (drop if any field is null) or `:all`.

  • thresh (Integer, nil) (defaults to: nil)

    keep rows with at least this many non-null values (overrides ‘how` when given).

  • subset (Array<String>, nil) (defaults to: nil)

    only consider these columns.

Returns:



27
28
29
30
31
32
33
34
35
36
37
# File 'lib/spark_connect/na_functions.rb', line 27

def drop(how: :any, thresh: nil, subset: nil)
  cols = Array(subset).map(&:to_s)
  min_non_nulls = thresh || (if how.to_sym == :all
                               1
                             else
                               (cols.empty? ? nil : cols.size)
                             end)
  nd = Proto::NADrop.new(input: @df.relation, cols: cols)
  nd.min_non_nulls = min_non_nulls if min_non_nulls
  @df.build(drop_na: nd)
end

#fill(value, subset: nil) ⇒ DataFrame #fill(value_map) ⇒ DataFrame

Replace null values.

Overloads:

  • #fill(value, subset: nil) ⇒ DataFrame

    Parameters:

    • value (Object)

      a scalar used to fill all (or ‘subset`) columns.

  • #fill(value_map) ⇒ DataFrame

    Parameters:

    • value_map (Hash{String=>Object})

      per-column fill values.

Returns:



46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/spark_connect/na_functions.rb', line 46

def fill(value, subset: nil)
  cols, values =
    if value.is_a?(Hash)
      [value.keys.map(&:to_s), value.values]
    else
      [Array(subset).map(&:to_s), Array(subset).empty? ? [value] : Array(subset).map { value }]
    end
  nf = Proto::NAFill.new(
    input: @df.relation, cols: cols, values: values.map { |v| na_literal(v) }
  )
  @df.build(fill_na: nf)
end

#replace(to_replace, value = nil, subset: nil) ⇒ DataFrame

Replace specific values with others.

Parameters:

  • to_replace (Object, Array, Hash)

    value(s) to replace, or a ‘=> new` mapping.

  • value (Object, Array, nil) (defaults to: nil)

    replacement value(s) when ‘to_replace` is not a Hash.

  • subset (Array<String>, nil) (defaults to: nil)

Returns:



67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/spark_connect/na_functions.rb', line 67

def replace(to_replace, value = nil, subset: nil)
  mapping =
    if to_replace.is_a?(Hash)
      to_replace
    else
      Array(to_replace).zip(Array(value)).to_h
    end
  replacements = mapping.map do |old, new_value|
    Proto::NAReplace::Replacement.new(
      old_value: na_literal(old), new_value: na_literal(new_value)
    )
  end
  nr = Proto::NAReplace.new(
    input: @df.relation, cols: Array(subset).map(&:to_s), replacements: replacements
  )
  @df.build(replace: nr)
end