Class: Rpdfium::Util::WordExtractor

Inherits:

Object

Object
Rpdfium::Util::WordExtractor

show all

Defined in:: lib/rpdfium/util/word_extractor.rb

Overview

Estrae “words” da una lista di char, fedelmente a pdfplumber.WordExtractor.

Algoritmo:

1. Ordina i char per (top, x0): righe top-to-bottom, char left-to-right
   dentro ogni riga.
2. Cluster per top con `y_tolerance` → "righe logiche" di char.
3. Dentro ogni riga, cluster per gap orizzontale: due char sono nella
   stessa word se `next.x0 - prev.x1 <= x_tolerance`. Anche un char
   whitespace separa la word (a meno che `keep_blank_chars`).
4. Per ogni cluster di char emette una word: text concatenato, bbox.

Differenze da pdfplumber (semplificazioni accettabili per il nostro uso):

- Non gestiamo `line_dir`/`char_dir` rotated (testo ruotato non
  orizzontale ltr): non rilevante per i casi d'uso correnti.
- Non gestiamo `use_text_flow` (ordering basato sul content stream):
  i nostri char arrivano già da PDFium nell'ordine geometrico via
  `chars` (top, x0).
- Non gestiamo `expand_ligatures`: PDFium di solito espande i
  codepoint correttamente già a livello char.

Queste differenze sono documentate; se mai necessarie si aggiungono come feature toggles senza cambiare il path di default.

Constant Summary collapse

DEFAULT_X_TOLERANCE =

3.0

DEFAULT_Y_TOLERANCE =

3.0

Instance Attribute Summary collapse

#keep_blank_chars ⇒ Object readonly

Returns the value of attribute keep_blank_chars.
#x_tolerance ⇒ Object readonly

Returns the value of attribute x_tolerance.
#y_tolerance ⇒ Object readonly

Returns the value of attribute y_tolerance.

Instance Method Summary collapse

#extract_words(chars) ⇒ Object

Restituisce un Array di Hash: { text:, x0:, x1:, top:, bottom:, chars: }.
#initialize(x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false, extra_attrs: nil) ⇒ WordExtractor constructor

A new instance of WordExtractor.

Constructor Details

#initialize(x_tolerance: DEFAULT_X_TOLERANCE, y_tolerance: DEFAULT_Y_TOLERANCE, keep_blank_chars: false, extra_attrs: nil) ⇒ `WordExtractor`

Returns a new instance of WordExtractor.

# File 'lib/rpdfium/util/word_extractor.rb', line 33

def initialize(x_tolerance: DEFAULT_X_TOLERANCE,
               y_tolerance: DEFAULT_Y_TOLERANCE,
               keep_blank_chars: false,
               extra_attrs: nil)
  @x_tolerance = x_tolerance.to_f
  @y_tolerance = y_tolerance.to_f
  @keep_blank_chars = keep_blank_chars
  @extra_attrs = extra_attrs || []
end

Instance Attribute Details

#keep_blank_chars ⇒ `Object` (readonly)

Returns the value of attribute keep_blank_chars.



31
32
33

# File 'lib/rpdfium/util/word_extractor.rb', line 31

def keep_blank_chars
  @keep_blank_chars
end

#x_tolerance ⇒ `Object` (readonly)

Returns the value of attribute x_tolerance.



31
32
33

# File 'lib/rpdfium/util/word_extractor.rb', line 31

def x_tolerance
  @x_tolerance
end

#y_tolerance ⇒ `Object` (readonly)

Returns the value of attribute y_tolerance.



31
32
33

# File 'lib/rpdfium/util/word_extractor.rb', line 31

def y_tolerance
  @y_tolerance
end

Instance Method Details

#extract_words(chars) ⇒ `Object`

Restituisce un Array di Hash: { text:, x0:, x1:, top:, bottom:, chars: }. Se ‘extra_attrs` è non vuoto, ogni word splitta anche al cambio di questi attributi (es. fontname/size diversi → word diverse).

# File 'lib/rpdfium/util/word_extractor.rb', line 46

def extract_words(chars)
  return [] if chars.empty?

  # Fast path: 1 solo char → 1 word triviale (se non whitespace).
  if chars.size == 1
    c = chars.first
    return [] if blank?(c) && !@keep_blank_chars

    return [build_word([c])]
  end

  # 1. Ordina per (top, x0). Top-down, left-to-right.
  sorted = chars.sort_by { |c| [c[:top], c[:x0]] }

  # 2. Cluster in righe per `top`.
  # `presorted: true`: sorted è già ordinato per [top, x0], quindi
  # implicitamente anche per top — cluster_objects salta il proprio
  # sort interno.
  rows = Cluster.cluster_objects(sorted, :top,
                                  tolerance: @y_tolerance,
                                  presorted: true)

  words = []
  rows.each do |row|
    # Re-sort per x0 dentro ogni riga clusterizzata.
    #
    # NOTA: in linea di principio l'input `sorted` è già ordinato per
    # [top, x0], quindi i cluster di top dovrebbero essere già in
    # ordine x0. MA il sort globale `[top, x0]` rispetta strettamente
    # l'ordine per top — se due char della stessa riga visiva hanno
    # top diversi entro tolerance (es. la "i" minuscola spesso ha
    # top più alto di 0.008pt rispetto alle altre lettere a causa di
    # come PDFium calcola la bbox), il sort globale li interfoglia.
    # Il cluster_objects per :top non riordina internamente i char,
    # quindi un char con top leggermente minore finisce DAVANTI a
    # tutte le altre lettere della parola.
    #
    # Esempio reale: "Categoria" dove "i" ha top=414.9789 e le altre
    # 414.9869 → output `iCategora` invece di `Categoria`.
    # Il fix è semplicemente ri-sortare per x0 dentro la riga.
    row_sorted = row.sort_by { |c| c[:x0] }

    word_chars = []
    row_sorted.each do |c|
      if char_begins_new_word?(word_chars.last, c)
        words << build_word(word_chars) unless word_chars.empty?
        word_chars = []
      end
      # Whitespace: per default lo usiamo come separatore (lo scartiamo).
      # Con keep_blank_chars=true lo includiamo nella word corrente.
      if blank?(c) && !@keep_blank_chars
        words << build_word(word_chars) unless word_chars.empty?
        word_chars = []
      else
        word_chars << c
      end
    end
    words << build_word(word_chars) unless word_chars.empty?
  end

  words
end