Class: Rpdfium::Util::WordMerger

Inherits:
Object
  • Object
show all
Defined in:
lib/rpdfium/util/word_merger.rb

Overview

Merges adjacent words on the same row into a single word with an aggregated bbox and concatenated text.

Three strategies are available as separate methods:

  • ‘merge_by_proximity` — merges all adjacent words that satisfy the proximity criterion. Base strategy.

  • ‘merge_by_label` — merges only words that share the same “label” (external key computed by the caller). Useful for preserving semantics when different labels fall on the same row (e.g. flags in adjacent columns).

  • ‘merge_unlabeled` — merges only “orphan” words (label nil), leaving labeled ones intact. Inverse of merge_by_label.

All return a new list of words, with merged ones represented as the hash ‘{ text:, x0:, x1:, top:, bottom: }`.

Examples:

merge by proximity

merger = Rpdfium::Util::WordMerger.new(x_gap: 20.0, y_tol: 3.0)
merged = merger.merge_by_proximity(words)

merge by label, with the label provided by the caller

labels_by_word = words.each_with_object({}) { |w, h| h[w] = compute_label(w) }
merged = merger.merge_by_label(words, labels_by_word)

Constant Summary collapse

DEFAULT_X_GAP =
20.0
DEFAULT_Y_TOL =
3.0

Instance Method Summary collapse

Constructor Details

#initialize(x_gap: DEFAULT_X_GAP, y_tol: DEFAULT_Y_TOL) ⇒ WordMerger

Returns a new instance of WordMerger.



35
36
37
38
# File 'lib/rpdfium/util/word_merger.rb', line 35

def initialize(x_gap: DEFAULT_X_GAP, y_tol: DEFAULT_Y_TOL)
  @x_gap = x_gap
  @y_tol = y_tol
end

Instance Method Details

#merge_by_label(words, labels_by_word) ⇒ Object

Merges only words with the same label.

Parameters:

  • labels_by_word (Hash)

    mapping word → label (any type). Words with the same label are merged; words with different labels are not.



49
50
51
52
53
# File 'lib/rpdfium/util/word_merger.rb', line 49

def merge_by_label(words, labels_by_word)
  merge_groups(words) do |a, b|
    labels_by_word[a] == labels_by_word[b]
  end
end

#merge_by_proximity(words) ⇒ Object

Merges all adjacent words (same row + horizontal gap ≤ x_gap).



41
42
43
# File 'lib/rpdfium/util/word_merger.rb', line 41

def merge_by_proximity(words)
  merge_groups(words) { |a, b| true }
end

#merge_unlabeled(words, labels_by_word) ⇒ Object

Merges only words with a nil label (orphans).



56
57
58
59
60
# File 'lib/rpdfium/util/word_merger.rb', line 56

def merge_unlabeled(words, labels_by_word)
  merge_groups(words) do |a, b|
    labels_by_word[a].nil? && labels_by_word[b].nil?
  end
end