Module: Rpdfium::Table::Edges
- Defined in:
- lib/rpdfium/table/edges.rb
Overview
Operations on edges (horizontal/vertical segments) used by the TableFinder. A direct mapping onto ‘pdfplumber/table.py`.
Internal conventions (aligned with pdfplumber):
- Each edge is a Hash with :orientation ("v" | "h"),
:x0, :x1, :top, :bottom (in top-down coordinates).
- Horizontal edge: top == bottom, x0 < x1.
- Vertical edge: x0 == x1, top < bottom.
The edges can come from:
- vector lines of the PDF (path segments)
- rectangles (decomposed into 4 sides)
- "implicit" lines inferred from word alignment (:text strategy)
- lines specified by the user (:explicit strategy)
Constant Summary collapse
- DEFAULT_MIN_WORDS_VERTICAL =
words → edges (strategia :text)
3- DEFAULT_MIN_WORDS_HORIZONTAL =
1
Class Method Summary collapse
-
.edges_to_intersections(edges, x_tolerance: 1.0, y_tolerance: 1.0) ⇒ Object
For each (h, v) pair that intersects within tolerance, it records an intersection ‘(v.x0, h.top)` with pointers to the source edges.
-
.filter_edges(edges, orientation: nil, min_length: 1.0) ⇒ Object
Filters out edges that are too short.
-
.join_edge_group(edges, orientation, tolerance: 3.0) ⇒ Object
Join: given a group of edges on the same infinite line (same top for horizontals, same x0 for verticals), merges those whose endpoints are within ‘tolerance`.
-
.merge_edges(edges, snap_x_tolerance: 3.0, snap_y_tolerance: 3.0, join_x_tolerance: 3.0, join_y_tolerance: 3.0) ⇒ Object
Complete pipeline: snap + join.
- .move_to_avg(cluster, orientation) ⇒ Object
-
.snap_edges(edges, x_tolerance: 3.0, y_tolerance: 3.0) ⇒ Object
Snap: cluster of near-collinear edges → common average coordinate.
-
.words_to_edges_h(words, word_threshold: DEFAULT_MIN_WORDS_HORIZONTAL) ⇒ Object
For each cluster of words aligned “at the top” (same top, within tol=1) with at least ‘word_threshold` members, it emits TWO horizontal edges (top and bottom of that cluster’s bbox).
-
.words_to_edges_v(words, word_threshold: DEFAULT_MIN_WORDS_VERTICAL) ⇒ Object
Three clusters of words by x: x0, x1, centerpoint.
Class Method Details
.edges_to_intersections(edges, x_tolerance: 1.0, y_tolerance: 1.0) ⇒ Object
For each (h, v) pair that intersects within tolerance, it records an intersection ‘(v.x0, h.top)` with pointers to the source edges. The value in `intersections[(x, y)] = { v: […], h: […] }` then allows the cell-builder to verify “edge connect”.
Optimization over the naïve O(|v|×|h|) loop: sorted_h is ordered by top; for each vertical edge a bsearch is used to find the first candidate h and the loop exits as soon as h exceeds v + y_tolerance, reducing the iterations to only the vertically relevant subset.
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
# File 'lib/rpdfium/table/edges.rb', line 196 def edges_to_intersections(edges, x_tolerance: 1.0, y_tolerance: 1.0) v_edges, h_edges = edges.partition { |e| e[:orientation] == "v" } intersections = {} sorted_v = v_edges.sort_by { |v| [v[:x0], v[:top]] } sorted_h = h_edges.sort_by { |h| [h[:top], h[:x0]] } h_tops = sorted_h.map { |h| h[:top] } sorted_v.each do |v| v_top_min = v[:top] - y_tolerance v_top_max = v[:bottom] + y_tolerance # Skip all the h whose top is still below the vertical window. start_idx = h_tops.bsearch_index { |t| t >= v_top_min } || sorted_h.size sorted_h[start_idx..].each do |h| # The remaining h are beyond the window: exit immediately. break if h[:top] > v_top_max next unless v[:x0] >= h[:x0] - x_tolerance next unless v[:x0] <= h[:x1] + x_tolerance key = [v[:x0], h[:top]] entry = intersections[key] ||= { v: [], h: [] } entry[:v] << v entry[:h] << h end end intersections end |
.filter_edges(edges, orientation: nil, min_length: 1.0) ⇒ Object
Filters out edges that are too short.
94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/rpdfium/table/edges.rb', line 94 def filter_edges(edges, orientation: nil, min_length: 1.0) edges.reject do |e| next true if orientation && e[:orientation] != orientation length = if e[:orientation] == "h" e[:x1] - e[:x0] else e[:bottom] - e[:top] end length < min_length end end |
.join_edge_group(edges, orientation, tolerance: 3.0) ⇒ Object
Join: given a group of edges on the same infinite line (same top for horizontals, same x0 for verticals), merges those whose endpoints are within ‘tolerance`.
Exact match of pdfplumber.join_edge_group behavior: iterates sorted by minprop, extends the “current” if overlap/contiguity is within tolerance, otherwise opens a new current.
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/rpdfium/table/edges.rb', line 53 def join_edge_group(edges, orientation, tolerance: 3.0) return [] if edges.empty? min_prop, max_prop = orientation == "h" ? [:x0, :x1] : [:top, :bottom] sorted = edges.sort_by { |e| e[min_prop] } joined = [sorted.first.dup] sorted[1..].each do |e| last = joined.last if e[min_prop] <= last[max_prop] + tolerance last[max_prop] = e[max_prop] if e[max_prop] > last[max_prop] else joined << e.dup end end joined end |
.merge_edges(edges, snap_x_tolerance: 3.0, snap_y_tolerance: 3.0, join_x_tolerance: 3.0, join_y_tolerance: 3.0) ⇒ Object
Complete pipeline: snap + join. Faithful to pdfplumber.merge_edges.
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/rpdfium/table/edges.rb', line 73 def merge_edges(edges, snap_x_tolerance: 3.0, snap_y_tolerance: 3.0, join_x_tolerance: 3.0, join_y_tolerance: 3.0) if snap_x_tolerance.positive? || snap_y_tolerance.positive? edges = snap_edges(edges, x_tolerance: snap_x_tolerance, y_tolerance: snap_y_tolerance) end # Group by (orientation, "line value") # h → top, v → x0 groups = edges.group_by do |e| e[:orientation] == "h" ? ["h", e[:top]] : ["v", e[:x0]] end groups.flat_map do |(orient, _key), group| tol = orient == "h" ? join_x_tolerance : join_y_tolerance join_edge_group(group, orient, tolerance: tol) end end |
.move_to_avg(cluster, orientation) ⇒ Object
35 36 37 38 39 40 41 42 43 44 |
# File 'lib/rpdfium/table/edges.rb', line 35 def move_to_avg(cluster, orientation) case orientation when "h" mean = cluster.sum { |e| e[:top] } / cluster.size.to_f cluster.map { |e| e.merge(top: mean, bottom: mean) } when "v" mean = cluster.sum { |e| e[:x0] } / cluster.size.to_f cluster.map { |e| e.merge(x0: mean, x1: mean) } end end |
.snap_edges(edges, x_tolerance: 3.0, y_tolerance: 3.0) ⇒ Object
Snap: cluster of near-collinear edges → common average coordinate. For horizontals it snaps the ‘top` (== `bottom`); for verticals the `x0`.
25 26 27 28 29 30 31 32 33 |
# File 'lib/rpdfium/table/edges.rb', line 25 def snap_edges(edges, x_tolerance: 3.0, y_tolerance: 3.0) v_edges, h_edges = edges.partition { |e| e[:orientation] == "v" } snapped_v = Util::Cluster.cluster_objects(v_edges, :x0, tolerance: x_tolerance) .flat_map { |g| move_to_avg(g, "v") } snapped_h = Util::Cluster.cluster_objects(h_edges, :top, tolerance: y_tolerance) .flat_map { |g| move_to_avg(g, "h") } snapped_v + snapped_h end |
.words_to_edges_h(words, word_threshold: DEFAULT_MIN_WORDS_HORIZONTAL) ⇒ Object
For each cluster of words aligned “at the top” (same top, within tol=1) with at least ‘word_threshold` members, it emits TWO horizontal edges (top and bottom of that cluster’s bbox). Having the bottom in addition to the top is critical: it guarantees that the last row of each table has a closing horizontal edge.
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# File 'lib/rpdfium/table/edges.rb', line 119 def words_to_edges_h(words, word_threshold: DEFAULT_MIN_WORDS_HORIZONTAL) by_top = Util::Cluster.cluster_objects(words, :top, tolerance: 1.0) large = by_top.select { |g| g.size >= word_threshold } rects = large.map { |g| Util::Cluster.objects_to_rect(g) } return [] if rects.empty? min_x0 = rects.map { |r| r[:x0] }.min max_x1 = rects.map { |r| r[:x1] }.max rects.flat_map do |r| [ { x0: min_x0, x1: max_x1, top: r[:top], bottom: r[:top], orientation: "h" }, { x0: min_x0, x1: max_x1, top: r[:bottom], bottom: r[:bottom], orientation: "h" } ] end end |
.words_to_edges_v(words, word_threshold: DEFAULT_MIN_WORDS_VERTICAL) ⇒ Object
Three clusters of words by x: x0, x1, centerpoint. Clusters with at least ‘word_threshold` members are column candidates. The bboxes of each cluster are “condensed”: if a bbox overlaps another already selected (more populated) one, it is discarded.
For each condensed bbox I emit a vertical edge at its x0 (left of the column). In addition, I emit a final “right” edge at the max x1 of all the bboxes: it visually closes the table on the right.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
# File 'lib/rpdfium/table/edges.rb', line 144 def words_to_edges_v(words, word_threshold: DEFAULT_MIN_WORDS_VERTICAL) by_x0 = Util::Cluster.cluster_objects(words, :x0, tolerance: 1.0) by_x1 = Util::Cluster.cluster_objects(words, :x1, tolerance: 1.0) center_fn = ->(w) { (w[:x0] + w[:x1]) / 2.0 } by_center = Util::Cluster.cluster_objects(words, center_fn, tolerance: 1.0) clusters = by_x0 + by_x1 + by_center # More populated first sorted = clusters.sort_by { |c| -c.size } large = sorted.select { |c| c.size >= word_threshold } bboxes = large.map { |c| Util::Cluster.objects_to_bbox(c) } condensed_bboxes = bboxes.each_with_object([]) do |b, acc| acc << b unless acc.any? { |c| Util::Cluster.bbox_overlaps?(b, c) } end return [] if condensed_bboxes.empty? # Sort left-to-right to emit edges in geometric order. condensed_rects = condensed_bboxes.map do |b| { x0: b[0], top: b[1], x1: b[2], bottom: b[3] } end.sort_by { |r| r[:x0] } max_x1, min_top, max_bottom = condensed_rects.each_with_object( [-Float::INFINITY, Float::INFINITY, -Float::INFINITY] ) do |r, acc| acc[0] = r[:x1] if r[:x1] > acc[0] acc[1] = r[:top] if r[:top] < acc[1] acc[2] = r[:bottom] if r[:bottom] > acc[2] end # "left" edge of each column + a final "right" edge. left_edges = condensed_rects.map do |r| { x0: r[:x0], x1: r[:x0], top: min_top, bottom: max_bottom, orientation: "v" } end right_edge = { x0: max_x1, x1: max_x1, top: min_top, bottom: max_bottom, orientation: "v" } left_edges + [right_edge] end |