Module: Rpdfium::Util::TextExtraction

Defined in:: lib/rpdfium/util/text_extraction.rb

Overview

“Linear” text extraction from a collection of chars, layout=False. Equivalent of pdfplumber.utils.text.chars_to_textmap in the variant without preservation of the graphic layout.

Algorithm:

1. Extract words with WordExtractor (same tolerances).
2. Cluster words by `top` with y_tolerance → logical lines.
3. For each line, sort by x0 and join with a single space.
4. Join the lines with "\n".

NOTE on a subtlety: pdfplumber allows using an x_tolerance different from y_tolerance both for word-extraction and for line-clustering. We replicate this flexibility.

Constant Summary collapse

DEFAULT_X_TOLERANCE =

WordExtractor::DEFAULT_X_TOLERANCE

DEFAULT_Y_TOLERANCE =

WordExtractor::DEFAULT_Y_TOLERANCE

Class Method Summary collapse

.extract_text(chars, x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false) ⇒ Object

Class Method Details

.extract_text(chars, x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false) ⇒ `Object`