Class: Rpdfium::Util::WordExtractor
- Inherits:
-
Object
- Object
- Rpdfium::Util::WordExtractor
- Defined in:
- lib/rpdfium/util/word_extractor.rb
Overview
Extracts “words” from a list of chars, faithfully to pdfplumber.WordExtractor.
Algorithm:
1. Sort the chars by (top, x0): rows top-to-bottom, chars
left-to-right within each row.
2. Cluster by top with `y_tolerance` → "logical rows" of chars.
3. Within each row, cluster by horizontal gap: two chars belong to
the same word if `next.x0 - prev.x1 <= x_tolerance`. A whitespace
char also separates the word (unless `keep_blank_chars`).
4. For each cluster of chars, emit a word: concatenated text, bbox.
Differences from pdfplumber (simplifications acceptable for our use):
- We do not handle rotated `line_dir`/`char_dir` (text rotated away
from horizontal ltr): not relevant for current use cases.
- We do not handle `use_text_flow` (ordering based on the content
stream): our chars already arrive from PDFium in geometric order
via `chars` (top, x0).
- We do not handle `expand_ligatures`: PDFium usually expands the
codepoints correctly already at the char level.
These differences are documented; if ever needed they can be added as feature toggles without changing the default path.
Constant Summary collapse
- DEFAULT_X_TOLERANCE =
3.0- DEFAULT_Y_TOLERANCE =
3.0
Instance Attribute Summary collapse
-
#keep_blank_chars ⇒ Object
readonly
Returns the value of attribute keep_blank_chars.
-
#x_tolerance ⇒ Object
readonly
Returns the value of attribute x_tolerance.
-
#y_tolerance ⇒ Object
readonly
Returns the value of attribute y_tolerance.
Instance Method Summary collapse
-
#extract_words(chars) ⇒ Object
Returns an Array of Hash: { text:, x0:, x1:, top:, bottom:, chars: }.
-
#initialize(x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false, extra_attrs: nil) ⇒ WordExtractor
constructor
A new instance of WordExtractor.
Constructor Details
#initialize(x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false, extra_attrs: nil) ⇒ WordExtractor
Returns a new instance of WordExtractor.
34 35 36 37 38 39 40 41 42 |
# File 'lib/rpdfium/util/word_extractor.rb', line 34 def initialize(x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false, extra_attrs: nil) @x_tolerance = x_tolerance.to_f @y_tolerance = y_tolerance.to_f @keep_blank_chars = keep_blank_chars @extra_attrs = extra_attrs || [] end |
Instance Attribute Details
#keep_blank_chars ⇒ Object (readonly)
Returns the value of attribute keep_blank_chars.
32 33 34 |
# File 'lib/rpdfium/util/word_extractor.rb', line 32 def keep_blank_chars @keep_blank_chars end |
#x_tolerance ⇒ Object (readonly)
Returns the value of attribute x_tolerance.
32 33 34 |
# File 'lib/rpdfium/util/word_extractor.rb', line 32 def x_tolerance @x_tolerance end |
#y_tolerance ⇒ Object (readonly)
Returns the value of attribute y_tolerance.
32 33 34 |
# File 'lib/rpdfium/util/word_extractor.rb', line 32 def y_tolerance @y_tolerance end |
Instance Method Details
#extract_words(chars) ⇒ Object
Returns an Array of Hash: { text:, x0:, x1:, top:, bottom:, chars: }. If ‘extra_attrs` is non-empty, each word also splits when these attributes change (e.g. different fontname/size → different words).
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# File 'lib/rpdfium/util/word_extractor.rb', line 47 def extract_words(chars) return [] if chars.empty? # Fast path: a single char → 1 trivial word (if not whitespace). if chars.size == 1 c = chars.first return [] if blank?(c) && !@keep_blank_chars return [build_word([c])] end # 1. Sort by (top, x0). Top-down, left-to-right. sorted = chars.sort_by { |c| [c[:top], c[:x0]] } # 2. Cluster into rows by `top`. # `presorted: true`: sorted is already ordered by [top, x0], hence # implicitly also by top — cluster_objects skips its own internal # sort. rows = Cluster.cluster_objects(sorted, :top, tolerance: @y_tolerance, presorted: true) words = [] rows.each do |row| # Re-sort by x0 within each clustered row. # # NOTE: in principle the input `sorted` is already ordered by # [top, x0], so the top clusters should already be in x0 order. # BUT the global sort `[top, x0]` strictly respects the order by # top — if two chars of the same visual row have different tops # within tolerance (e.g. the lowercase "i" often has a top higher # by 0.008pt than the other letters because of how PDFium computes # the bbox), the global sort interleaves them. cluster_objects by # :top does not internally reorder the chars, so a char with a # slightly lower top ends up AHEAD of all the other letters of the # word. # # Real example: "Categoria" where "i" has top=414.9789 and the # others 414.9869 → output `iCategora` instead of `Categoria`. # The fix is simply to re-sort by x0 within the row. row_sorted = row.sort_by { |c| c[:x0] } word_chars = [] row_sorted.each do |c| if char_begins_new_word?(word_chars.last, c) words << build_word(word_chars) unless word_chars.empty? word_chars = [] end # Whitespace: by default we use it as a separator (we discard it). # With keep_blank_chars=true we include it in the current word. if blank?(c) && !@keep_blank_chars words << build_word(word_chars) unless word_chars.empty? word_chars = [] else word_chars << c end end words << build_word(word_chars) unless word_chars.empty? end words end |