Class: Exwiw::MongodbParallelPlan

Inherits:
Object
  • Object
show all
Defined in:
lib/exwiw/mongodb_parallel_plan.rb

Overview

Classifies a MongoDB dump’s collections into the three dependency groups the inter-collection fork schedule needs, plus the derived adjacency that schedule consumes. See docs/mongodb-dump-parallelism-2x-notes.md for the why; this class is the static, config-derived half of that plan.

It is a pure function of the loaded configs and the dump target — no DB access — so it can be computed once up front and unit-tested without a live MongoDB. The fork orchestration (worker pools, LPT bin-packing on output-size weights, @state Marshal sidecars, the Phase-2 cascade loop) lives elsewhere and consumes the structures produced here.

Input contract: ‘configs` are MongodbCollectionConfig already passed through `#reject_ignored_members!` (exactly as Runner#load_table_config produces them), so every surviving belongs_to has a non-nil `table_name`. ignore:true collections are still present in `configs` — they contribute to the schema and to the file-index ordering, but their data extraction is skipped — and are therefore excluded from the three processing groups.

The three groups partition the extractable collections exactly:

  • genuine — reachable to the dump target by following belongs_to edges

    (the scoped DAG). Includes the target itself.
    
  • leaf — no belongs_to at all: reference/master data dumped in full,

    with no input dependencies (embarrassingly parallel).
    
  • ref_bt — has belongs_to but is NOT reachable to the target: reference

    data scoped by the adapter's strict-AND fallback. Its
    internal edges form shallow components.
    

‘reachable` mirrors MongodbAdapter#genuine_scope_set exactly (fixpoint over all non-embedded configs, including ignore:true ones), so the genuine set here matches the adapter’s runtime scoping classification.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(configs:, target_table_name:, logger: nil) ⇒ MongodbParallelPlan

Returns a new instance of MongodbParallelPlan.

Parameters:

  • configs (Array<MongodbCollectionConfig>)

    reject_ignored_members!‘d

  • target_table_name (String)

    the dump target collection

  • logger (Logger, nil) (defaults to: nil)

    forwarded to DetermineTableProcessingOrder



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 44

def initialize(configs:, target_table_name:, logger: nil)
  @by = configs.each_with_object({}) { |c, h| h[c.name] = c }
  @target_table_name = target_table_name

  dumpable = configs.reject(&:embedded?)
  # The file index (insert-NNN-) is taken over the FULL processing order,
  # including ignore:true collections, so the orchestrated run's filenames
  # are byte-identical to the serial Runner's (which numbers files the same
  # way). Data extraction, however, skips ignore:true — see #extractable.
  @ordered_all = DetermineTableProcessingOrder.run(dumpable, logger: logger).freeze
  @index_of = @ordered_all.each_with_index.to_h.freeze
  @extractable = @ordered_all.reject { |n| @by[n].ignore }.freeze

  @reachable = compute_reachable
  classify
  derive_consumed_leaves
  derive_cascade_adjacency
  @reference_components = compute_reference_components.freeze
end

Instance Attribute Details

#consumed_leavesObject (readonly)

Leaf collections referenced (via belongs_to) by some non-leaf extractable collection (genuine OR ref_bt). These are the only leaves whose captured must hand back (e.g. as a Marshal sidecar). Set<String>.



97
98
99
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 97

def consumed_leaves
  @consumed_leaves
end

#direct_leaf_genuineObject (readonly)

genuine collections that directly reference a leaf — the only genuine collections whose output can change once leaf @state is present (and only at runtime, when their genuine anchor turns out empty and they fall back to the leaf clause). These seed the Phase-2 cascade reprocess.



103
104
105
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 103

def direct_leaf_genuine
  @direct_leaf_genuine
end

#extractableObject (readonly)

#ordered_all minus ignore:true collections — the collections whose data is actually extracted. Union of the three groups below.



73
74
75
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 73

def extractable
  @extractable
end

#genuineObject (readonly)

genuine — reachable to the dump target (includes the target).



78
79
80
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 78

def genuine
  @genuine
end

#genuine_childrenObject (readonly)

name => genuine children (genuine collections that belongs_to it), keyed only by reachable parents. Drives the Phase-2 cascade: when a reprocessed collection’s row count changes, its genuine children are re-enqueued.



108
109
110
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 108

def genuine_children
  @genuine_children
end

#index_ofObject (readonly)

name => 0-based position in #ordered_all (the file index is position + 1).



69
70
71
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 69

def index_of
  @index_of
end

#leavesObject (readonly)

leaf — no belongs_to; reference/master data with no input dependencies.



81
82
83
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 81

def leaves
  @leaves
end

#ordered_allObject (readonly)

Full processing order, INCLUDING ignore:true collections — the sequence the file index (insert-NNN-) is numbered over.



66
67
68
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 66

def ordered_all
  @ordered_all
end

#reachableObject (readonly)

The set of collection names genuinely scoped by the target (the target plus everything that can reach it through belongs_to). Exposed for inspection.



112
113
114
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 112

def reachable
  @reachable
end

#ref_btObject (readonly)

ref_bt — has belongs_to but not reachable to the target.



84
85
86
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 84

def ref_bt
  @ref_bt
end

#reference_componentsObject (readonly)

ref_bt collections as dependency-closed weakly-connected components over intra-ref_bt belongs_to edges, each returned in a valid topological order (a parent before its child). A whole component can be processed serially by one worker with no cross-worker @state IPC and no level barriers, seeded only with the leaf @state its members reference.



91
92
93
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 91

def reference_components
  @reference_components
end

Instance Method Details

#summaryObject



114
115
116
117
118
119
120
121
122
123
124
# File 'lib/exwiw/mongodb_parallel_plan.rb', line 114

def summary
  {
    extractable: @extractable.size,
    genuine: @genuine.size,
    leaves: @leaves.size,
    ref_bt: @ref_bt.size,
    consumed_leaves: @consumed_leaves.size,
    direct_leaf_genuine: @direct_leaf_genuine.size,
    reference_components: @reference_components.map(&:size).sort.reverse,
  }
end