Class: DaruLite::DataFrame

Inherits:
Object show all
Extended by:
Gem::Deprecate
Includes:
Aggregatable, Calculatable, Convertible, Duplicatable, Fetchable, Filterable, IOAble, Indexable, Iterable, Joinable, Missable, Pivotable, Queryable, Setable, Sortable, Maths::Arithmetic::DataFrame, Maths::Statistics::DataFrame
Defined in:
lib/daru_lite/dataframe.rb,
lib/daru_lite/data_frame/setable.rb,
lib/daru_lite/data_frame/i_o_able.rb,
lib/daru_lite/data_frame/iterable.rb,
lib/daru_lite/data_frame/joinable.rb,
lib/daru_lite/data_frame/missable.rb,
lib/daru_lite/data_frame/sortable.rb,
lib/daru_lite/data_frame/fetchable.rb,
lib/daru_lite/data_frame/indexable.rb,
lib/daru_lite/data_frame/pivotable.rb,
lib/daru_lite/data_frame/queryable.rb,
lib/daru_lite/extensions/which_dsl.rb,
lib/daru_lite/data_frame/filterable.rb,
lib/daru_lite/data_frame/convertible.rb,
lib/daru_lite/data_frame/aggregatable.rb,
lib/daru_lite/data_frame/calculatable.rb,
lib/daru_lite/data_frame/duplicatable.rb

Overview

rubocop:disable Metrics/ClassLength

Defined Under Namespace

Modules: Aggregatable, Calculatable, Convertible, Duplicatable, Fetchable, Filterable, IOAble, Indexable, Iterable, Joinable, Missable, Pivotable, Queryable, Setable, Sortable

Constant Summary collapse

AXES =
%i[row vector].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Maths::Statistics::DataFrame

#acf, #correlation, #count, #covariance, #cumsum, #describe, #ema, #max, #mean, #median, #min, #mode, #percent_change, #product, #range, #rolling_count, #rolling_max, #rolling_mean, #rolling_median, #rolling_min, #rolling_std, #rolling_variance, #standardize, #std, #sum, #variance_sample

Methods included from Maths::Arithmetic::DataFrame

#%, #*, #**, #+, #-, #/, #exp, #round, #sqrt

Methods included from Queryable

#all?, #any?, #has_vector?, #include_values?

Methods included from Sortable

#order=, #rotate_vectors, #sort, #sort!

Methods included from Setable

#[]=, #add_row, #add_vector, #insert_vector, #set_at, #set_row_at

Methods included from Pivotable

#pivot_table

Methods included from Missable

#has_missing_data?, #missing_values_rows, #rolling_fillna, #rolling_fillna!

Methods included from Joinable

#concat, #join, #merge, #one_to_many, #union

Methods included from IOAble

#_dump, included, #save, #write_csv, #write_excel, #write_sql

Methods included from Iterable

#apply_method, #collect, #collect_matrix, #collect_row_with_index, #collect_rows, #collect_vector_with_index, #collect_vectors, #each, #each_index, #each_row, #each_row_with_index, #each_vector, #each_vector_with_index, #map, #map!, #map_rows, #map_rows!, #map_rows_with_index, #map_vectors, #map_vectors!, #map_vectors_with_index, #recode, #recode_rows, #recode_vectors, #replace_values, #verify

Methods included from Indexable

#index=, #reindex, #reindex_vectors, #reset_index, #set_index, #vectors=

Methods included from Filterable

#filter, #filter_rows, #filter_vector, #filter_vectors, #keep_row_if, #keep_vector_if, #reject_values, #uniq, #where

Methods included from Fetchable

#[], #access_row_tuples_by_indexs, #at, #get_sub_dataframe, #get_vector_anyways, #head, #numeric_vector_names, #numeric_vectors, #only_numerics, #row_at, #split_by_category, #tail

Methods included from Duplicatable

#clone, #clone_only_valid, #clone_structure, #dup, #dup_only_valid

Methods included from Convertible

#create_sql, #to_a, #to_df, #to_h, #to_html, #to_html_tbody, #to_html_thead, #to_json, #to_matrix, #to_s

Methods included from Calculatable

#compute, #summary, #vector_by_calculation, #vector_count_characters, #vector_mean, #vector_sum

Methods included from Aggregatable

#aggregate, #group_by, #group_by_and_aggregate

Constructor Details

#initialize(source = {}, opts = {}) ⇒ DataFrame

DataFrame basically consists of an Array of Vector objects. These objects are indexed by row and column by vectors and index Index objects.

Arguments

  • source - Source from the DataFrame is to be initialized. Can be a Hash

of names and vectors (array or DaruLite::Vector), an array of arrays or array of DaruLite::Vectors.

Options

:order - An Array/DaruLite::Index/DaruLite::MultiIndex containing the order in which Vectors should appear in the DataFrame.

:index - An Array/DaruLite::Index/DaruLite::MultiIndex containing the order in which rows of the DataFrame will be named.

:name - A name for the DataFrame.

:clone - Specify as true or false. When set to false, and Vector objects are passed for the source, the Vector objects will not duplicated when creating the DataFrame. Will have no effect if Array is passed in the source, or if the passed DaruLite::Vectors have different indexes. Default to true.

Usage

df = DaruLite::DataFrame.new
# =>
# <DaruLite::DataFrame(0x0)>
# Creates an empty DataFrame with no rows or columns.

df = DaruLite::DataFrame.new({}, order: [:a, :b])
#<DaruLite::DataFrame(0x2)>
  a   b
# Creates a DataFrame with no rows and columns :a and :b

df = DaruLite::DataFrame.new({a: [1,2,3,4], b: [6,7,8,9]}, order: [:b, :a],
  index: [:a, :b, :c, :d], name: :spider_man)

# =>
# <DaruLite::DataFrame:80766980 @name = spider_man @size = 4>
#             b          a
#  a          6          1
#  b          7          2
#  c          8          3
#  d          9          4

df = DaruLite::DataFrame.new([[1,2,3,4],[6,7,8,9]], name: :bat_man)

# =>
# #<DaruLite::DataFrame: bat_man (4x2)>
#             0          1
#  0          1          6
#  1          2          7
#  2          3          8
#  3          4          9

# Dataframe having Index name

df = DaruLite::DataFrame.new({a: [1,2,3,4], b: [6,7,8,9]}, order: [:b, :a],
  index: DaruLite::Index.new([:a, :b, :c, :d], name: 'idx_name'),
  name: :spider_man)

# =>
# <DaruLite::DataFrame:80766980 @name = spider_man @size = 4>
# idx_name            b          a
#        a          6          1
#        b          7          2
#        c          8          3
#        d          9          4

idx = DaruLite::Index.new [100, 99, 101, 1, 2], name: "s1"
=> #<DaruLite::Index(5): s1 {100, 99, 101, 1, 2}>

df = DaruLite::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5],
  c: [11,22,33,44,55]},
  order: [:a, :b, :c],
  index: idx)
 # =>
 #<DaruLite::DataFrame(5x3)>
 #   s1   a   b   c
 #  100   1  11  11
 #   99   2  12  22
 #  101   3  13  33
 #    1   4  14  44
 #    2   5  15  55


237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# File 'lib/daru_lite/dataframe.rb', line 237

def initialize(source = {}, opts = {})
  vectors = opts[:order]
  index = opts[:index] # FIXME: just keyword arges after Ruby 2.1
  @data = []
  @name = opts[:name]

  case source
  when [], {}
    create_empty_vectors(vectors, index)
  when Array
    initialize_from_array source, vectors, index, opts
  when Hash
    initialize_from_hash source, vectors, index, opts
  end

  set_size
  validate
  update
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ Object



492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
# File 'lib/daru_lite/dataframe.rb', line 492

def method_missing(name, *args, &)
  stringified_name = name.to_s

  if /^([^=]+)=/.match?(stringified_name)
    name = stringified_name[/^([^=]+)=/].delete('=')
    name = name.to_sym unless has_vector?(name)
    insert_or_modify_vector [name], args[0]
  elsif has_vector?(name)
    self[name]
  elsif has_vector?(stringified_name)
    self[stringified_name]
  else
    super
  end
end

Instance Attribute Details

#dataObject (readonly)

TOREMOVE



137
138
139
# File 'lib/daru_lite/dataframe.rb', line 137

def data
  @data
end

#indexObject (readonly)

The index of the rows of the DataFrame



140
141
142
# File 'lib/daru_lite/dataframe.rb', line 140

def index
  @index
end

#nameObject (readonly)

The name of the DataFrame



143
144
145
# File 'lib/daru_lite/dataframe.rb', line 143

def name
  @name
end

#sizeObject (readonly)

The number of rows present in the DataFrame



146
147
148
# File 'lib/daru_lite/dataframe.rb', line 146

def size
  @size
end

#vectorsObject (readonly)

The vectors (columns) index of the DataFrame



135
136
137
# File 'lib/daru_lite/dataframe.rb', line 135

def vectors
  @vectors
end

Class Method Details

.crosstab_by_assignation(rows, columns, values) ⇒ Object

Generates a new dataset, using three vectors

  • Rows

  • Columns

  • Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases



84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/daru_lite/dataframe.rb', line 84

def crosstab_by_assignation(rows, columns, values)
  raise 'Three vectors should be equal size' if
    rows.size != columns.size || rows.size != values.size

  row_index = rows.uniq.to_a
  data = Hash.new do |h, col|
    h[col] = row_index.map { |r| [r, nil] }.to_h
  end
  validate_no_duplicate_pairs(rows, columns)
  columns.zip(rows, values).each { |c, r, v| data[c][r] = v }

  # FIXME: in fact, WITHOUT this line you'll obtain more "right"
  # data: with vectors having "rows" as an index...
  data = data.transform_values(&:values)
  data[:_id] = row_index

  DataFrame.new(data)
end

.rows(source, opts = {}) ⇒ Object

Create DataFrame by specifying rows as an Array of Arrays or Array of DaruLite::Vector objects.

Raises:



50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/daru_lite/dataframe.rb', line 50

def rows(source, opts = {})
  raise SizeError, 'All vectors must have same length' \
    unless source.all? { |v| v.size == source.first.size }

  opts[:order] ||= guess_order(source)

  if ArrayHelper.array_of?(source, Array) || source.empty?
    DataFrame.new(source.transpose, opts)
  elsif ArrayHelper.array_of?(source, Vector)
    from_vector_rows(source, opts)
  else
    raise ArgumentError, "Can't create DataFrame from #{source}"
  end
end

Instance Method Details

#==(other) ⇒ Object



467
468
469
470
471
472
473
# File 'lib/daru_lite/dataframe.rb', line 467

def ==(other)
  self.class == other.class   &&
    @size    == other.size    &&
    @index   == other.index   &&
    @vectors == other.vectors &&
    @vectors.to_a.all? { |v| self[v] == other[v] }
end

#add_level_to_vectors(top_level_label) ⇒ Object

Converts the vectors to a DaruLite::MultiIndex. The argument passed is used as the MultiIndex’s top level



408
409
410
411
# File 'lib/daru_lite/dataframe.rb', line 408

def add_level_to_vectors(top_level_label)
  tuples = vectors.map { |label| [top_level_label, *label] }
  self.vectors = DaruLite::MultiIndex.from_tuples(tuples)
end

#add_vectors_by_split(name, join = '-', sep = DaruLite::SPLIT_TOKEN) ⇒ Object



345
346
347
348
349
# File 'lib/daru_lite/dataframe.rb', line 345

def add_vectors_by_split(name, join = '-', sep = DaruLite::SPLIT_TOKEN)
  self[name]
    .split_by_separator(sep)
    .each { |k, v| self[:"#{name}#{join}#{k}"] = v }
end

#add_vectors_by_split_recode(nm, join = '-', sep = DaruLite::SPLIT_TOKEN) ⇒ Object



413
414
415
416
417
418
419
420
# File 'lib/daru_lite/dataframe.rb', line 413

def add_vectors_by_split_recode(nm, join = '-', sep = DaruLite::SPLIT_TOKEN)
  self[nm]
    .split_by_separator(sep)
    .each_with_index do |(k, v), i|
      v.rename "#{nm}:#{k}"
      self[:"#{nm}#{join}#{i + 1}"] = v
    end
end

#bootstrap(n = nil) ⇒ DaruLite::DataFrame

Creates a DataFrame with the random data, of n size. If n not given, uses original number of rows.

Returns:



313
314
315
316
317
318
319
320
321
# File 'lib/daru_lite/dataframe.rb', line 313

def bootstrap(n = nil)
  n ||= nrows
  DaruLite::DataFrame.new({}, order: @vectors).tap do |df_boot|
    n.times do
      df_boot.add_row(row[rand(n)])
    end
    df_boot.update
  end
end

#delete_at_position(position) ⇒ Object

Delete a row based on its position More robust than #delete_row when working with a CategoricalIndex or when the Index includes integers

Raises:

  • (IndexError)


300
301
302
303
304
305
306
307
# File 'lib/daru_lite/dataframe.rb', line 300

def delete_at_position(position)
  raise IndexError, "Position #{position} does not exist." unless position < size

  @index = @index.delete_at(position)
  each_vector { |vector| vector.delete_at_position(position) }

  set_size
end

#delete_row(index) ⇒ Object

Delete a row

Raises:

  • (IndexError)


284
285
286
287
288
289
290
291
292
293
294
295
# File 'lib/daru_lite/dataframe.rb', line 284

def delete_row(index)
  idx = named_index_for index

  raise IndexError, "Index #{index} does not exist." unless @index.include? idx

  @index = DaruLite::Index.new(@index.to_a - [idx])
  each_vector do |vector|
    vector.delete_at idx
  end

  set_size
end

#delete_vector(vector) ⇒ Object

Delete a vector

Raises:

  • (IndexError)


267
268
269
270
271
272
273
274
# File 'lib/daru_lite/dataframe.rb', line 267

def delete_vector(vector)
  raise IndexError, "Vector #{vector} does not exist." unless @vectors.include?(vector)

  @data.delete_at @vectors[vector]
  @vectors = DaruLite::Index.new @vectors.to_a - [vector]

  self
end

#delete_vectors(*vectors) ⇒ Object

Deletes a list of vectors



277
278
279
280
281
# File 'lib/daru_lite/dataframe.rb', line 277

def delete_vectors(*vectors)
  Array(vectors).each { |vec| delete_vector vec }

  self
end

#inspect(spacing = DaruLite.spacing, threshold = DaruLite.max_rows) ⇒ Object

Pretty print in a nice table format for the command line (irb/pry/iruby)



450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
# File 'lib/daru_lite/dataframe.rb', line 450

def inspect(spacing = DaruLite.spacing, threshold = DaruLite.max_rows)
  name_part = @name ? ": #{@name} " : ''
  spacing = [
    headers.to_a.map { |header| header.try(:length) || header.to_s.length }.max,
    spacing
  ].max

  "#<#{self.class}#{name_part}(#{nrows}x#{ncols})>#{$INPUT_RECORD_SEPARATOR}" +
    Formatters::Table.format(
      each_row.lazy,
      row_headers: row_headers,
      headers: headers,
      threshold: threshold,
      spacing: spacing
    )
end

#interact_code(vector_names, full) ⇒ Object



512
513
514
515
516
517
518
519
520
# File 'lib/daru_lite/dataframe.rb', line 512

def interact_code(vector_names, full)
  dfs = vector_names.zip(full).map do |vec_name, f|
    self[vec_name].contrast_code(full: f).each.to_a
  end

  all_vectors = recursive_product(dfs)
  DaruLite::DataFrame.new all_vectors,
                          order: all_vectors.map(&:name)
end

#ncolsObject

The number of vectors



362
363
364
# File 'lib/daru_lite/dataframe.rb', line 362

def ncols
  @vectors.size
end

#nest(*tree_keys, &block) ⇒ Object

Return a nested hash using vector names as keys and an array constructed of hashes with other values. If block provided, is used to provide the values, with parameters row of dataset, current last hash on hierarchy and name of the key to include



327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
# File 'lib/daru_lite/dataframe.rb', line 327

def nest(*tree_keys, &block)
  tree_keys = tree_keys[0] if tree_keys[0].is_a? Array

  each_row.with_object({}) do |row, current|
    # Create tree
    *keys, last = tree_keys
    current = keys.inject(current) { |c, f| c[row[f]] ||= {} }
    name = row[last]

    if block
      current[name] = yield(row, current, name)
    else
      current[name] ||= []
      current[name].push(row.to_h.delete_if { |key, _value| tree_keys.include? key })
    end
  end
end

#nrowsObject

The number of rows



357
358
359
# File 'lib/daru_lite/dataframe.rb', line 357

def nrows
  @index.size
end

#rename(new_name) ⇒ Object Also known as: name=

Rename the DataFrame.



432
433
434
435
# File 'lib/daru_lite/dataframe.rb', line 432

def rename(new_name)
  @name = new_name
  self
end

#rename_vectors(name_map) ⇒ Object

Renames the vectors

Arguments

  • name_map - A hash where the keys are the exising vector names and

    the values are the new names.  If a vector is renamed
    to a vector name that is already in use, the existing
    one is overwritten.
    

Usage

df = DaruLite::DataFrame.new({ a: [1,2,3,4], b: [:a,:b,:c,:d], c: [11,22,33,44] })
df.rename_vectors :a => :alpha, :c => :gamma
df.vectors.to_a #=> [:alpha, :b, :gamma]


380
381
382
383
384
385
386
# File 'lib/daru_lite/dataframe.rb', line 380

def rename_vectors(name_map)
  existing_targets = name_map.reject { |k, v| k == v }.values & vectors.to_a
  delete_vectors(*existing_targets)

  new_names = vectors.to_a.map { |v| name_map[v] || v }
  self.vectors = DaruLite::Index.new new_names
end

#rename_vectors!(name_map) ⇒ Object

Renames the vectors and returns itself

Arguments

  • name_map - A hash where the keys are the exising vector names and

    the values are the new names.  If a vector is renamed
    to a vector name that is already in use, the existing
    one is overwritten.
    

Usage

df = DaruLite::DataFrame.new({ a: [1,2,3,4], b: [:a,:b,:c,:d], c: [11,22,33,44] })
df.rename_vectors! :a => :alpha, :c => :gamma # df


401
402
403
404
# File 'lib/daru_lite/dataframe.rb', line 401

def rename_vectors!(name_map)
  rename_vectors(name_map)
  self
end

#respond_to_missing?(name, include_private = false) ⇒ Boolean

Returns:

  • (Boolean)


508
509
510
# File 'lib/daru_lite/dataframe.rb', line 508

def respond_to_missing?(name, include_private = false)
  name.to_s.end_with?('=') || has_vector?(name) || super
end

#rowObject

Access a row or set/create a row. Refer #[] and #[]= docs for details.

Usage

df.row[:a] # access row named ':a'
df.row[:b] = [1,2,3] # set row ':b' to [1,2,3]


262
263
264
# File 'lib/daru_lite/dataframe.rb', line 262

def row
  DaruLite::Accessors::DataFrameByRow.new(self)
end

#shapeObject

Return the number of rows and columns of the DataFrame in an Array.



352
353
354
# File 'lib/daru_lite/dataframe.rb', line 352

def shape
  [nrows, ncols]
end

#to_category(*names) ⇒ DaruLite::DataFrame

Converts the specified non category type vectors to category type vectors

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'a', 'b']
})
df.to_category :b
df[:b].type
# => :category

Parameters:

  • names (Array)

    of non category type vectors to be converted

Returns:

  • (DaruLite::DataFrame)

    data frame in which specified vectors have been converted to category type



487
488
489
490
# File 'lib/daru_lite/dataframe.rb', line 487

def to_category(*names)
  names.each { |n| self[n] = self[n].to_category }
  self
end

#transposeObject

Transpose a DataFrame, tranposing elements and row, column indexing.



439
440
441
442
443
444
445
446
447
# File 'lib/daru_lite/dataframe.rb', line 439

def transpose
  DaruLite::DataFrame.new(
    each_vector.map(&:to_a).transpose,
    index: @vectors,
    order: @index,
    dtype: @dtype,
    name: @name
  )
end

#updateObject

Method for updating the metadata (i.e. missing value positions) of the after assingment/deletion etc. are complete. This is provided so that time is not wasted in creating the metadata for the vector each time assignment/deletion of elements is done. Updating data this way is called lazy loading. To set or unset lazy loading, see the .lazy_update= method.



427
428
429
# File 'lib/daru_lite/dataframe.rb', line 427

def update
  @data.each(&:update) if DaruLite.lazy_update
end

#whichObject

a simple query DSL for accessing where(), inspired by gem “squeel” e.g.: df.which{ ‘FamilySize` == `FamilySize`.max } equals df.where( df.eq( df.max ) )

e.g.: df.which{ (‘NameTitle` == ’Dr’) & (‘Sex` == ’female’) } equals df.where( df.eq(‘Dr’) & df.eq(‘female’) )



15
16
17
# File 'lib/daru_lite/extensions/which_dsl.rb', line 15

def which(&)
  WhichQuery.new(self, &).exec
end