Module: Rpdfium::Table::Cells
- Defined in:
- lib/rpdfium/table/cells.rb
Overview
Builds cells from intersections and tables from cells. Algorithms 1:1 with pdfplumber.intersections_to_cells and pdfplumber.cells_to_tables.
Class Method Summary collapse
-
.cells_to_tables(cells) ⇒ Object
Groups cells into tables based on shared corners.
-
.intersections_to_cells(intersections) ⇒ Object
“Smallest cell” search for each intersection: given a point ‘pt = (x, y)`, find the minimal rectangle whose 4 corners are intersections and whose 4 sides have edges connecting them.
Class Method Details
.cells_to_tables(cells) ⇒ Object
Groups cells into tables based on shared corners.
Algorithm: Union-Find (disjoint set) on the corners — O(n α(n)) instead of pdfplumber’s greedy fixed-point O(n²). The result is identical: two cells end up in the same group if they share at least one corner.
Final filter: discard tables with a SINGLE cell (noise).
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/rpdfium/table/cells.rb', line 104 def cells_to_tables(cells) return [] if cells.empty? n = cells.size parent = Array.new(n) { |i| i } find = lambda do |i| i = parent[i] = parent[parent[i]] while parent[i] != i i end union = ->(a, b) { parent[find.call(a)] = find.call(b) } # For each corner, collect the indices of the cells that share it # and union them into the same component. corner_to_cells = Hash.new { |h, k| h[k] = [] } cells.each_with_index do |cell, idx| x0, top, x1, bottom = cell [[x0, top], [x0, bottom], [x1, top], [x1, bottom]].each do |corner| corner_to_cells[corner] << idx end end corner_to_cells.each_value do |idxs| idxs.each_cons(2) { |a, b| union.call(a, b) } end # Group by the Union-Find root groups = Hash.new { |h, k| h[k] = [] } cells.each_with_index { |cell, i| groups[find.call(i)] << cell } # Sort top-to-bottom, left-to-right; filter out single-cell. groups.values .sort_by { |t| t.map { |c| [c[1], c[0]] }.min } .reject { |t| t.size <= 1 } end |
.intersections_to_cells(intersections) ⇒ Object
“Smallest cell” search for each intersection: given a point ‘pt = (x, y)`, find the minimal rectangle whose 4 corners are intersections and whose 4 sides have edges connecting them.
The “edge connect” constraint is crucial: two intersections with the same x are not enough — they must SHARE at least one vertical edge (i.e. belong to the same continuous segment). Likewise horizontally. This avoids false positives such as “two distant columns accidentally aligned”.
‘intersections` is the Hash produced by Edges.edges_to_intersections, with keys `[x, y]` and values `{ v: [edges…], h: [edges…] }`.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/rpdfium/table/cells.rb', line 24 def intersections_to_cells(intersections) return [] if intersections.empty? # Adjacency indices: for each edge (a Hash object, ruby # identity), which intersection points does it contain? # Pdfplumber does this by comparing the bbox of the edges — we # have direct access to the edge objects inside # `intersections[pt]`, so it suffices to use identity. For "same # edge" we use `equal?` (object identity). edge_ids = intersections.transform_values do |val| { v: val[:v].map(&:object_id).to_set, h: val[:h].map(&:object_id).to_set } end edge_connects = lambda do |p1, p2| if p1[0] == p2[0] return !(edge_ids[p1][:v] & edge_ids[p2][:v]).empty? end if p1[1] == p2[1] return !(edge_ids[p1][:h] & edge_ids[p2][:h]).empty? end false end points = intersections.keys.sort npoints = points.size # Spatial indices: precompute points by column (same x) and by # row (same y), already ordered because `points` is sorted. # Allows O(log n) lookup via bsearch instead of O(n) via select. by_x = Hash.new { |h, k| h[k] = [] } by_y = Hash.new { |h, k| h[k] = [] } points.each { |p| by_x[p[0]] << p; by_y[p[1]] << p } cells = [] points.each_with_index do |pt, i| next if i == npoints - 1 # Points directly below `pt` (same x, greater y) col = by_x[pt[0]] below_start = col.bsearch_index { |q| q[1] > pt[1] } || col.size below = col[below_start..] # Points directly to the right of `pt` (same y, greater x) row_pts = by_y[pt[1]] right_start = row_pts.bsearch_index { |q| q[0] > pt[0] } || row_pts.size right = row_pts[right_start..] # Find the FIRST (== smallest, due to ordering) bottom-right # whose 4 corners are present and whose edges connect. found = nil below.each do |b| next unless edge_connects.call(pt, b) right.each do |r| next unless edge_connects.call(pt, r) br = [r[0], b[1]] next unless intersections.key?(br) next unless edge_connects.call(br, r) next unless edge_connects.call(br, b) found = [pt[0], pt[1], br[0], br[1]] break end break if found end cells << found if found end cells end |