Class: Rpdfium::Table::Table
- Inherits:
-
Object
- Object
- Rpdfium::Table::Table
- Defined in:
- lib/rpdfium/table/table.rb
Overview
Represents a table found on a page. Exposes cells, rows, columns, bbox, and the ‘extract` method that returns the textual data.
Each cell is a bbox ‘[x0, top, x1, bottom]` (top-down). A “row” is the group of cells sharing the same `top`. A “column” is the group sharing the same `x0`.
Instance Attribute Summary collapse
-
#cells ⇒ Object
readonly
Returns the value of attribute cells.
-
#page ⇒ Object
readonly
Returns the value of attribute page.
Instance Method Summary collapse
- #bbox ⇒ Object
- #columns ⇒ Object
-
#extract(x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE, y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE, keep_blank_chars: false, cell_padding: 0.0) ⇒ Object
Extract data: Array<Array<String>>.
-
#initialize(page, cells) ⇒ Table
constructor
A new instance of Table.
-
#rows ⇒ Object
Returns the rows as Array<Array<bbox|nil>>.
Constructor Details
#initialize(page, cells) ⇒ Table
Returns a new instance of Table.
14 15 16 17 |
# File 'lib/rpdfium/table/table.rb', line 14 def initialize(page, cells) @page = page @cells = cells end |
Instance Attribute Details
#cells ⇒ Object (readonly)
Returns the value of attribute cells.
12 13 14 |
# File 'lib/rpdfium/table/table.rb', line 12 def cells @cells end |
#page ⇒ Object (readonly)
Returns the value of attribute page.
12 13 14 |
# File 'lib/rpdfium/table/table.rb', line 12 def page @page end |
Instance Method Details
#bbox ⇒ Object
19 20 21 22 23 24 25 26 27 28 |
# File 'lib/rpdfium/table/table.rb', line 19 def bbox @cells.each_with_object( [Float::INFINITY, Float::INFINITY, -Float::INFINITY, -Float::INFINITY] ) do |c, acc| acc[0] = c[0] if c[0] < acc[0] acc[1] = c[1] if c[1] < acc[1] acc[2] = c[2] if c[2] > acc[2] acc[3] = c[3] if c[3] > acc[3] end end |
#columns ⇒ Object
37 38 39 |
# File 'lib/rpdfium/table/table.rb', line 37 def columns rows_or_columns(:col) end |
#extract(x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE, y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE, keep_blank_chars: false, cell_padding: 0.0) ⇒ Object
Extract data: Array<Array<String>>. For each row, for each cell, filter the page chars whose MIDPOINT lies within the cell’s bbox, then reconstruct the text via Util::TextExtraction (which in turn goes through WordExtractor).
This is the pdfplumber.Table.extract path — for each row it first filters the row’s chars (optimization: nearly all chars from the other rows are discarded immediately), then for each cell filters again within the sub-bbox.
Optimization over the naïve path: the chars are sorted by their vertical midpoint only once; for each row bsearch is used to find the candidate chars in O(log n) instead of scanning the whole array O(n) for every row.
NOTE on the :text strategy: ‘words_to_edges_h` emits by design TWO edges per row (top and bottom of the cluster bbox). This means that a table detected by the text-strategy will have “real” rows interleaved with “empty” rows between the bottom-edge of row N and the top-edge of row N+1. This is identical to pdfplumber’s behavior. The caller may filter via ‘result.reject { |row| row.all?(&:empty?) }` if it wants to drop them. `cell_padding`: extends each cell’s bbox toward the left and toward the top by N points. Default 0 (= identical pdfplumber behavior). Useful for PDFs where chars protrude slightly past the cell border (e.g. the uppercase “I” of the “Intermediario” cell in a CR Banca d’Italia form has x0=24.0 but the cell border is at x=25.6 — it gets discarded by the midpoint filter, output “ntermediario:”). With ‘cell_padding: 2.0` the cell becomes [23.6, …, 100, …] and the “I” is captured.
Padding only on the “inner-left” and “inner-top” borders to avoid duplicating chars shared between adjacent cells (a char between cell A and cell B would end up in both if both padded on all sides).
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/rpdfium/table/table.rb', line 76 def extract(x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE, y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE, keep_blank_chars: false, cell_padding: 0.0) # `geometry: true`: the strongest lean mode — on top of skipping # font/weight/angle/hyphen/unicode-error it also drops the per-char # origin read and emits a minimal hash. It keeps only the fields the # table/word pipeline reads, cutting both FFI roundtrips and hash # allocation. On tables with thousands of chars this is the dominant # cost of extract_tables. See Page#chars. chars = @page.chars(lean: true, geometry: true) # Sort by vertical midpoint once; build a parallel array of vmid # for bsearch. Cost: O(n log n) one-time. sorted_chars = chars.sort_by { |c| (c[:top] + c[:bottom]) / 2.0 } vmids = sorted_chars.map { |c| (c[:top] + c[:bottom]) / 2.0 } # Instantiate WordExtractor ONCE and reuse it for all cells # (a table may have dozens of cells; avoid allocations). word_extractor = Util::WordExtractor.new( x_tolerance: x_tolerance, y_tolerance: y_tolerance, keep_blank_chars: keep_blank_chars ) all_rows = rows all_rows.map do |row| row_bbox = row_bounding_box(row) lo = vmids.bsearch_index { |v| v >= row_bbox[1] - cell_padding } || sorted_chars.size hi = vmids.bsearch_index { |v| v >= row_bbox[3] } || sorted_chars.size row_chars = sorted_chars[lo...hi] row.map do |cell| next nil if cell.nil? padded = cell_padding.zero? ? cell : pad_cell_bbox(cell, cell_padding) cell_chars = row_chars.select { |c| char_in_bbox?(c, padded) } if cell_chars.empty? "" else extract_text_with(cell_chars, word_extractor, y_tolerance) end end end end |
#rows ⇒ Object
Returns the rows as Array<Array<bbox|nil>>. The “missing” cells in a row (e.g. because the table has an irregular topology) are represented as nil — consistent with pdfplumber.
33 34 35 |
# File 'lib/rpdfium/table/table.rb', line 33 def rows rows_or_columns(:row) end |