Class: Rpdfium::Table::Table

Inherits:
Object
  • Object
show all
Defined in:
lib/rpdfium/table/table.rb

Overview

Represents a table found on a page. Exposes cells, rows, columns, bbox, and the ‘extract` method that returns the textual data.

Each cell is a bbox ‘[x0, top, x1, bottom]` (top-down). A “row” is the group of cells sharing the same `top`. A “column” is the group sharing the same `x0`.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(page, cells) ⇒ Table

Returns a new instance of Table.



14
15
16
17
# File 'lib/rpdfium/table/table.rb', line 14

def initialize(page, cells)
  @page = page
  @cells = cells
end

Instance Attribute Details

#cellsObject (readonly)

Returns the value of attribute cells.



12
13
14
# File 'lib/rpdfium/table/table.rb', line 12

def cells
  @cells
end

#pageObject (readonly)

Returns the value of attribute page.



12
13
14
# File 'lib/rpdfium/table/table.rb', line 12

def page
  @page
end

Instance Method Details

#bboxObject



19
20
21
22
23
24
25
26
27
28
# File 'lib/rpdfium/table/table.rb', line 19

def bbox
  @cells.each_with_object(
    [Float::INFINITY, Float::INFINITY, -Float::INFINITY, -Float::INFINITY]
  ) do |c, acc|
    acc[0] = c[0] if c[0] < acc[0]
    acc[1] = c[1] if c[1] < acc[1]
    acc[2] = c[2] if c[2] > acc[2]
    acc[3] = c[3] if c[3] > acc[3]
  end
end

#columnsObject



37
38
39
# File 'lib/rpdfium/table/table.rb', line 37

def columns
  rows_or_columns(:col)
end

#extract(x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE, y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE, keep_blank_chars: false, cell_padding: 0.0) ⇒ Object

Extract data: Array<Array<String>>. For each row, for each cell, filter the page chars whose MIDPOINT lies within the cell’s bbox, then reconstruct the text via Util::TextExtraction (which in turn goes through WordExtractor).

This is the pdfplumber.Table.extract path — for each row it first filters the row’s chars (optimization: nearly all chars from the other rows are discarded immediately), then for each cell filters again within the sub-bbox.

Optimization over the naïve path: the chars are sorted by their vertical midpoint only once; for each row bsearch is used to find the candidate chars in O(log n) instead of scanning the whole array O(n) for every row.

NOTE on the :text strategy: ‘words_to_edges_h` emits by design TWO edges per row (top and bottom of the cluster bbox). This means that a table detected by the text-strategy will have “real” rows interleaved with “empty” rows between the bottom-edge of row N and the top-edge of row N+1. This is identical to pdfplumber’s behavior. The caller may filter via ‘result.reject { |row| row.all?(&:empty?) }` if it wants to drop them. `cell_padding`: extends each cell’s bbox toward the left and toward the top by N points. Default 0 (= identical pdfplumber behavior). Useful for PDFs where chars protrude slightly past the cell border (e.g. the uppercase “I” of the “Intermediario” cell in a CR Banca d’Italia form has x0=24.0 but the cell border is at x=25.6 — it gets discarded by the midpoint filter, output “ntermediario:”). With ‘cell_padding: 2.0` the cell becomes [23.6, …, 100, …] and the “I” is captured.

Padding only on the “inner-left” and “inner-top” borders to avoid duplicating chars shared between adjacent cells (a char between cell A and cell B would end up in both if both padded on all sides).



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/rpdfium/table/table.rb', line 76

def extract(x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE,
            y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE,
            keep_blank_chars: false,
            cell_padding: 0.0)
  # `geometry: true`: the strongest lean mode — on top of skipping
  # font/weight/angle/hyphen/unicode-error it also drops the per-char
  # origin read and emits a minimal hash. It keeps only the fields the
  # table/word pipeline reads, cutting both FFI roundtrips and hash
  # allocation. On tables with thousands of chars this is the dominant
  # cost of extract_tables. See Page#chars.
  chars = @page.chars(lean: true, geometry: true)

  # Sort by vertical midpoint once; build a parallel array of vmid
  # for bsearch. Cost: O(n log n) one-time.
  sorted_chars = chars.sort_by { |c| (c[:top] + c[:bottom]) / 2.0 }
  vmids = sorted_chars.map { |c| (c[:top] + c[:bottom]) / 2.0 }

  # Instantiate WordExtractor ONCE and reuse it for all cells
  # (a table may have dozens of cells; avoid allocations).
  word_extractor = Util::WordExtractor.new(
    x_tolerance: x_tolerance,
    y_tolerance: y_tolerance,
    keep_blank_chars: keep_blank_chars
  )

  all_rows = rows
  all_rows.map do |row|
    row_bbox = row_bounding_box(row)
    lo = vmids.bsearch_index { |v| v >= row_bbox[1] - cell_padding } || sorted_chars.size
    hi = vmids.bsearch_index { |v| v >= row_bbox[3] } || sorted_chars.size
    row_chars = sorted_chars[lo...hi]

    row.map do |cell|
      next nil if cell.nil?

      padded = cell_padding.zero? ? cell : pad_cell_bbox(cell, cell_padding)
      cell_chars = row_chars.select { |c| char_in_bbox?(c, padded) }
      if cell_chars.empty?
        ""
      else
        extract_text_with(cell_chars, word_extractor, y_tolerance)
      end
    end
  end
end

#rowsObject

Returns the rows as Array<Array<bbox|nil>>. The “missing” cells in a row (e.g. because the table has an irregular topology) are represented as nil — consistent with pdfplumber.



33
34
35
# File 'lib/rpdfium/table/table.rb', line 33

def rows
  rows_or_columns(:row)
end