Class: Ucode::Coordinator

Inherits:
Object
  • Object
show all
Defined in:
lib/ucode/coordinator.rb,
lib/ucode/coordinator/indices.rb

Overview

Orchestrates the UCD + Unihan parsers and produces per-codepoint CodePoint records for a downstream sink (a writer, an aggregator, a database builder).

**Streaming architecture**:

1. Indices pass — load every range/point file into memory, keyed
   by codepoint (hash) or sorted by `range_first` (bsearch).
   Peak memory is ~10 MB of indices, NOT 160 k CodePoints.

2. Stream pass — `UnicodeData.each_record` drives the main loop.
   For each yielded CodePoint, the Coordinator merges in data from
   the indices, then yields to the sink. CodePoints are GC'd
   after the sink processes them.

Every data file is OPTIONAL — if a file is missing (partial fetch, incremental run), the corresponding indices stay empty and the matching CodePoint fields stay at their defaults. This makes the Coordinator resilient against partial fixtures and lets users run subsets.

Defined Under Namespace

Classes: Indices

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config = Ucode.configuration) ⇒ Coordinator

Returns a new instance of Coordinator.



36
37
38
# File 'lib/ucode/coordinator.rb', line 36

def initialize(config = Ucode.configuration)
  @config = config
end

Instance Attribute Details

#configObject (readonly)

Returns the value of attribute config.



34
35
36
# File 'lib/ucode/coordinator.rb', line 34

def config
  @config
end

Instance Method Details

#build(ucd_dir:, unihan_dir:, &block) ⇒ Object

Stream-driven build. Calls ‘block` once per assigned codepoint.



41
42
43
# File 'lib/ucode/coordinator.rb', line 41

def build(ucd_dir:, unihan_dir:, &block)
  each_codepoint(ucd_dir: ucd_dir, unihan_dir: unihan_dir, &block)
end

#each_codepoint(ucd_dir:, unihan_dir:) ⇒ Object

Iterates one enriched CodePoint per assigned codepoint. Returns a lazy Enumerator when called without a block.



47
48
49
50
51
52
53
54
55
56
# File 'lib/ucode/coordinator.rb', line 47

def each_codepoint(ucd_dir:, unihan_dir:)
  return enum_for(:each_codepoint, ucd_dir: ucd_dir, unihan_dir: unihan_dir) unless block_given?

  indices = build_indices(ucd_dir, unihan_dir)
  each_with_indices(ucd_dir: ucd_dir, unihan_dir: unihan_dir, indices: indices) do |cp|
    yield cp
  end

  nil
end

#each_codepoint_with_indices(ucd_dir:, unihan_dir:) ⇒ Object

Like #each_codepoint but yields ‘(indices, cp)` so callers that need the indices for a post-stream flush (e.g. ParseCommand) can reuse them instead of re-building. Returns an Enumerator when no block is given.



62
63
64
65
66
67
68
69
70
71
72
73
# File 'lib/ucode/coordinator.rb', line 62

def each_codepoint_with_indices(ucd_dir:, unihan_dir:)
  unless block_given?
    return enum_for(:each_codepoint_with_indices, ucd_dir: ucd_dir, unihan_dir: unihan_dir)
  end

  indices = build_indices(ucd_dir, unihan_dir)
  each_with_indices(ucd_dir: ucd_dir, unihan_dir: unihan_dir, indices: indices) do |cp|
    yield indices, cp
  end

  nil
end

#indices_for(ucd_dir:, unihan_dir:) ⇒ Object

Build (and return) the Coordinator::Indices for the given UCD + Unihan dirs. Useful when the caller needs the indices separately from the streaming pass (e.g. AggregateWriter#flush).



78
79
80
# File 'lib/ucode/coordinator.rb', line 78

def indices_for(ucd_dir:, unihan_dir:)
  build_indices(ucd_dir, unihan_dir)
end