Class: Rpdfium::Util::WordMerger
- Inherits:
-
Object
- Object
- Rpdfium::Util::WordMerger
- Defined in:
- lib/rpdfium/util/word_merger.rb
Overview
Merges adjacent words on the same row into a single word with an aggregated bbox and concatenated text.
Three strategies are available as separate methods:
-
‘merge_by_proximity` — merges all adjacent words that satisfy the proximity criterion. Base strategy.
-
‘merge_by_label` — merges only words that share the same “label” (external key computed by the caller). Useful for preserving semantics when different labels fall on the same row (e.g. flags in adjacent columns).
-
‘merge_unlabeled` — merges only “orphan” words (label nil), leaving labeled ones intact. Inverse of merge_by_label.
All return a new list of words, with merged ones represented as the hash ‘{ text:, x0:, x1:, top:, bottom: }`.
Constant Summary collapse
- DEFAULT_X_GAP =
20.0- DEFAULT_Y_TOL =
3.0
Instance Method Summary collapse
-
#initialize(x_gap: DEFAULT_X_GAP, y_tol: DEFAULT_Y_TOL) ⇒ WordMerger
constructor
A new instance of WordMerger.
-
#merge_by_label(words, labels_by_word) ⇒ Object
Merges only words with the same label.
-
#merge_by_proximity(words) ⇒ Object
Merges all adjacent words (same row + horizontal gap ≤ x_gap).
-
#merge_unlabeled(words, labels_by_word) ⇒ Object
Merges only words with a nil label (orphans).
Constructor Details
#initialize(x_gap: DEFAULT_X_GAP, y_tol: DEFAULT_Y_TOL) ⇒ WordMerger
Returns a new instance of WordMerger.
35 36 37 38 |
# File 'lib/rpdfium/util/word_merger.rb', line 35 def initialize(x_gap: DEFAULT_X_GAP, y_tol: DEFAULT_Y_TOL) @x_gap = x_gap @y_tol = y_tol end |
Instance Method Details
#merge_by_label(words, labels_by_word) ⇒ Object
Merges only words with the same label.
49 50 51 52 53 |
# File 'lib/rpdfium/util/word_merger.rb', line 49 def merge_by_label(words, labels_by_word) merge_groups(words) do |a, b| labels_by_word[a] == labels_by_word[b] end end |
#merge_by_proximity(words) ⇒ Object
Merges all adjacent words (same row + horizontal gap ≤ x_gap).
41 42 43 |
# File 'lib/rpdfium/util/word_merger.rb', line 41 def merge_by_proximity(words) merge_groups(words) { |a, b| true } end |
#merge_unlabeled(words, labels_by_word) ⇒ Object
Merges only words with a nil label (orphans).
56 57 58 59 60 |
# File 'lib/rpdfium/util/word_merger.rb', line 56 def merge_unlabeled(words, labels_by_word) merge_groups(words) do |a, b| labels_by_word[a].nil? && labels_by_word[b].nil? end end |