Class: Rpdfium::Table::Extractor
- Inherits:
-
Object
- Object
- Rpdfium::Table::Extractor
- Defined in:
- lib/rpdfium/table/extractor.rb
Overview
Trova tabelle su una pagina, fedele al ‘pdfplumber.TableFinder`.
Pipeline:
1. raccogli edges candidati per ogni asse, secondo strategia
(`:lines` / `:lines_strict` / `:text` / `:explicit`)
2. merge_edges (snap collineari + join contigui)
3. filter per lunghezza minima
4. edges_to_intersections con tolerance
5. intersections_to_cells (smallest cell per ogni punto)
6. cells_to_tables (grouping per corner condivisi)
API pubblica:
ext = Rpdfium::Table::Extractor.new(page, **opts)
ext.tables # => [Table, ...] (oggetti Rpdfium::Table::Table)
ext.extract # => [[[String]]] (Array di tabelle, ogni tabella
è Array di righe, ogni riga
è Array di stringhe)
ext.find # alias di .tables (compat back con 0.2.x)
ext.edges # edges raffinati
ext.intersections # Hash {[x,y] => {v:[],h:[]}}
ext.cells # Array<bbox>
Constant Summary collapse
- DEFAULTS =
{ vertical_strategy: :lines, horizontal_strategy: :lines, explicit_vertical_lines: [], explicit_horizontal_lines: [], # Tolleranze. I `_x_` / `_y_` ereditano dal valore non-suffisso. snap_tolerance: 3.0, snap_x_tolerance: nil, snap_y_tolerance: nil, join_tolerance: 3.0, join_x_tolerance: nil, join_y_tolerance: nil, edge_min_length: 3.0, edge_min_length_prefilter: 1.0, min_words_vertical: Edges::DEFAULT_MIN_WORDS_VERTICAL, min_words_horizontal: Edges::DEFAULT_MIN_WORDS_HORIZONTAL, intersection_tolerance: 3.0, intersection_x_tolerance: nil, intersection_y_tolerance: nil, # Settings testo (passati a TextExtraction quando si chiama .extract). # I default 3.0 sono quelli di pdfplumber. text_x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE, text_y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE, text_keep_blank_chars: false, # Auto-fallback: se :lines non produce edges, riprova con :text. # Manteniamo il flag (era già in 0.2.x) ma SOLO come fallback, # mai come "fix" su layout patologici — coerente con pdfplumber che # non lo ha (chi usa pdfplumber sa che deve scegliere la strategia). auto_fallback: true }.freeze
- VALID_STRATEGIES =
%i[lines lines_strict text explicit].freeze
Instance Attribute Summary collapse
-
#page ⇒ Object
readonly
Returns the value of attribute page.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #cells ⇒ Object
-
#edges ⇒ Object
Pipeline completa, costruisce gli edges raffinati.
-
#extract(**text_opts) ⇒ Object
Estrai i dati di tutte le tabelle: Array<Array<Array<String>>>.
-
#initialize(page, **opts) ⇒ Extractor
constructor
A new instance of Extractor.
- #intersections ⇒ Object
- #tables ⇒ Object (also: #find)
Constructor Details
Instance Attribute Details
#page ⇒ Object (readonly)
Returns the value of attribute page.
66 67 68 |
# File 'lib/rpdfium/table/extractor.rb', line 66 def page @page end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
66 67 68 |
# File 'lib/rpdfium/table/extractor.rb', line 66 def settings @settings end |
Instance Method Details
#cells ⇒ Object
97 98 99 |
# File 'lib/rpdfium/table/extractor.rb', line 97 def cells @cells ||= Cells.intersections_to_cells(intersections) end |
#edges ⇒ Object
Pipeline completa, costruisce gli edges raffinati.
75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/rpdfium/table/extractor.rb', line 75 def edges @edges ||= build_edges(@settings[:vertical_strategy], @settings[:horizontal_strategy]).then do |built| if built.empty? && @settings[:auto_fallback] && (@settings[:vertical_strategy] != :text || @settings[:horizontal_strategy] != :text) # Fallback: l'auto-fallback è LASCO, riprova tutto a :text. build_edges(:text, :text) else built end end end |
#extract(**text_opts) ⇒ Object
Estrai i dati di tutte le tabelle: Array<Array<Array<String>>>.
107 108 109 110 111 112 113 114 115 |
# File 'lib/rpdfium/table/extractor.rb', line 107 def extract(**text_opts) merged = { x_tolerance: @settings[:text_x_tolerance], y_tolerance: @settings[:text_y_tolerance], keep_blank_chars: @settings[:text_keep_blank_chars] }.merge(text_opts) tables.map { |t| t.extract(**merged) } end |
#intersections ⇒ Object
89 90 91 92 93 94 95 |
# File 'lib/rpdfium/table/extractor.rb', line 89 def intersections @intersections ||= Edges.edges_to_intersections( edges, x_tolerance: @settings[:intersection_x_tolerance], y_tolerance: @settings[:intersection_y_tolerance] ) end |
#tables ⇒ Object Also known as: find
101 102 103 |
# File 'lib/rpdfium/table/extractor.rb', line 101 def tables @tables ||= Cells.cells_to_tables(cells).map { |group| Table.new(@page, group) } end |