Class: Rpdfium::Table::Extractor

Inherits:
Object
  • Object
show all
Defined in:
lib/rpdfium/table/extractor.rb

Overview

Trova tabelle su una pagina, fedele al ‘pdfplumber.TableFinder`.

Pipeline:

1. raccogli edges candidati per ogni asse, secondo strategia
   (`:lines` / `:lines_strict` / `:text` / `:explicit`)
2. merge_edges (snap collineari + join contigui)
3. filter per lunghezza minima
4. edges_to_intersections con tolerance
5. intersections_to_cells (smallest cell per ogni punto)
6. cells_to_tables (grouping per corner condivisi)

API pubblica:

ext = Rpdfium::Table::Extractor.new(page, **opts)
ext.tables           # => [Table, ...]   (oggetti Rpdfium::Table::Table)
ext.extract          # => [[[String]]]   (Array di tabelle, ogni tabella
                                           è Array di righe, ogni riga
                                           è Array di stringhe)
ext.find             # alias di .tables (compat back con 0.2.x)
ext.edges            # edges raffinati
ext.intersections    # Hash {[x,y] => {v:[],h:[]}}
ext.cells            # Array<bbox>

Constant Summary collapse

DEFAULTS =
{
  vertical_strategy:   :lines,
  horizontal_strategy: :lines,
  explicit_vertical_lines:   [],
  explicit_horizontal_lines: [],

  # Tolleranze. I `_x_` / `_y_` ereditano dal valore non-suffisso.
  snap_tolerance:           3.0,
  snap_x_tolerance:         nil,
  snap_y_tolerance:         nil,
  join_tolerance:           3.0,
  join_x_tolerance:         nil,
  join_y_tolerance:         nil,

  edge_min_length:           3.0,
  edge_min_length_prefilter: 1.0,

  min_words_vertical:   Edges::DEFAULT_MIN_WORDS_VERTICAL,
  min_words_horizontal: Edges::DEFAULT_MIN_WORDS_HORIZONTAL,

  intersection_tolerance:   3.0,
  intersection_x_tolerance: nil,
  intersection_y_tolerance: nil,

  # Settings testo (passati a TextExtraction quando si chiama .extract).
  # I default 3.0 sono quelli di pdfplumber.
  text_x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE,
  text_y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE,
  text_keep_blank_chars: false,

  # Auto-fallback: se :lines non produce edges, riprova con :text.
  # Manteniamo il flag (era già in 0.2.x) ma SOLO come fallback,
  # mai come "fix" su layout patologici — coerente con pdfplumber che
  # non lo ha (chi usa pdfplumber sa che deve scegliere la strategia).
  auto_fallback: true
}.freeze
VALID_STRATEGIES =
%i[lines lines_strict text explicit].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(page, **opts) ⇒ Extractor

Returns a new instance of Extractor.



68
69
70
71
72
# File 'lib/rpdfium/table/extractor.rb', line 68

def initialize(page, **opts)
  @page = page
  @settings = resolve_settings(DEFAULTS.merge(opts))
  validate_strategies!
end

Instance Attribute Details

#pageObject (readonly)

Returns the value of attribute page.



66
67
68
# File 'lib/rpdfium/table/extractor.rb', line 66

def page
  @page
end

#settingsObject (readonly)

Returns the value of attribute settings.



66
67
68
# File 'lib/rpdfium/table/extractor.rb', line 66

def settings
  @settings
end

Instance Method Details

#cellsObject



97
98
99
# File 'lib/rpdfium/table/extractor.rb', line 97

def cells
  @cells ||= Cells.intersections_to_cells(intersections)
end

#edgesObject

Pipeline completa, costruisce gli edges raffinati.



75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/rpdfium/table/extractor.rb', line 75

def edges
  @edges ||= build_edges(@settings[:vertical_strategy],
                         @settings[:horizontal_strategy]).then do |built|
    if built.empty? && @settings[:auto_fallback] &&
       (@settings[:vertical_strategy] != :text ||
        @settings[:horizontal_strategy] != :text)
      # Fallback: l'auto-fallback è LASCO, riprova tutto a :text.
      build_edges(:text, :text)
    else
      built
    end
  end
end

#extract(**text_opts) ⇒ Object

Estrai i dati di tutte le tabelle: Array<Array<Array<String>>>.



107
108
109
110
111
112
113
114
115
# File 'lib/rpdfium/table/extractor.rb', line 107

def extract(**text_opts)
  merged = {
    x_tolerance: @settings[:text_x_tolerance],
    y_tolerance: @settings[:text_y_tolerance],
    keep_blank_chars: @settings[:text_keep_blank_chars]
  }.merge(text_opts)

  tables.map { |t| t.extract(**merged) }
end

#intersectionsObject



89
90
91
92
93
94
95
# File 'lib/rpdfium/table/extractor.rb', line 89

def intersections
  @intersections ||= Edges.edges_to_intersections(
    edges,
    x_tolerance: @settings[:intersection_x_tolerance],
    y_tolerance: @settings[:intersection_y_tolerance]
  )
end

#tablesObject Also known as: find



101
102
103
# File 'lib/rpdfium/table/extractor.rb', line 101

def tables
  @tables ||= Cells.cells_to_tables(cells).map { |group| Table.new(@page, group) }
end