Module: Rpdfium::Util::Cluster
- Defined in:
- lib/rpdfium/util/cluster.rb
Overview
1D clustering primitives used throughout the table pipeline. Direct mapping onto ‘pdfplumber.utils.clustering` (cluster_list, cluster_objects, make_cluster_dict).
KEY PROPERTY: these clusters are “1D agglomerative single-linkage”: two values end up in the same cluster if they are within ‘tolerance` of any value in the cluster. NOT only of the center/mean. As a result, chains of close values can extend the cluster well beyond `tolerance` (this is exactly pdfplumber’s behavior, on which its edge/intersection heuristics rely).
Class Method Summary collapse
-
.bbox_overlap(a, b) ⇒ Object
Overlapping bbox.
-
.bbox_overlaps?(a, b) ⇒ Boolean
True if two bbox overlap (even just at a point is no; there must be positive area).
-
.cluster_list(values, tolerance: 0) ⇒ Object
Groups scalar values into clusters.
-
.cluster_objects(objects, key_fn, tolerance: 0, presorted: false) ⇒ Object
Groups objects (Hash) into clusters based on an extraction function ‘key_fn` (or a Hash key symbol) and a tolerance.
-
.objects_to_bbox(objects) ⇒ Object
bbox = [x0, top, x1, bottom] (top-down).
-
.objects_to_rect(objects) ⇒ Object
Variant that returns a Hash instead of a tuple — handy in the edge context where we need to mix bbox+orientation.
Class Method Details
.bbox_overlap(a, b) ⇒ Object
Overlapping bbox. No overlap => nil. Matches pdfplumber’s get_bbox_overlap: returns the intersection bbox, or nil.
124 125 126 127 128 129 130 131 132 133 134 135 136 |
# File 'lib/rpdfium/util/cluster.rb', line 124 def bbox_overlap(a, b) ax0, atop, ax1, abot = a bx0, btop, bx1, bbot = b x0 = [ax0, bx0].max x1 = [ax1, bx1].min return nil if x0 >= x1 top = [atop, btop].max bot = [abot, bbot].min return nil if top >= bot [x0, top, x1, bot] end |
.bbox_overlaps?(a, b) ⇒ Boolean
True if two bbox overlap (even just at a point is no; there must be positive area).
140 141 142 |
# File 'lib/rpdfium/util/cluster.rb', line 140 def bbox_overlaps?(a, b) !bbox_overlap(a, b).nil? end |
.cluster_list(values, tolerance: 0) ⇒ Object
Groups scalar values into clusters. The values within the same cluster are within ‘tolerance` of at least one other value of the cluster.
Example:
cluster_list([1.0, 1.5, 2.0, 5.0], tolerance: 1.0)
#=> [[1.0, 1.5, 2.0], [5.0]]
NOTE: “Stepping stone” chains: [1, 2, 3, 4] with tol=1 form a SINGLE cluster, even though 1 and 4 are 3 apart. This is pdfplumber’s behavior, documented in its issues as potentially surprising but intentional. We keep it identical.
30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/rpdfium/util/cluster.rb', line 30 def cluster_list(values, tolerance: 0) return [] if values.empty? sorted = values.sort clusters = [[sorted.first]] sorted[1..].each do |v| if (v - clusters.last.last).abs <= tolerance clusters.last << v else clusters << [v] end end clusters end |
.cluster_objects(objects, key_fn, tolerance: 0, presorted: false) ⇒ Object
Groups objects (Hash) into clusters based on an extraction function ‘key_fn` (or a Hash key symbol) and a tolerance.
Example:
cluster_objects(words, ->(w) { w[:top] }, tolerance: 1)
cluster_objects(words, :top, tolerance: 1) # syntactic sugar
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/rpdfium/util/cluster.rb', line 51 def cluster_objects(objects, key_fn, tolerance: 0, presorted: false) return [] if objects.empty? # Fast path for the most common Symbol case (:top, :x0, :bottom): # direct Hash[symbol] access is ~2x faster than the lambda call. if key_fn.is_a?(Symbol) # If the caller guarantees that the input is already sorted by # key_fn (e.g. because it comes from a lexicographic sort # [key_fn, ...]) the internal sort can be skipped. A significant # saving when cluster_objects is called in a loop over many # small rows. sorted = presorted ? objects : objects.sort_by { |o| o[key_fn] } first = sorted.first last_key = first[key_fn] clusters = [[first]] tol = tolerance.to_f i = 1 n = sorted.size while i < n obj = sorted[i] curr_key = obj[key_fn] if (curr_key - last_key).abs <= tol clusters.last << obj else clusters << [obj] end last_key = curr_key i += 1 end return clusters end # Generic path with a callable accessor accessor = key_fn sorted = presorted ? objects : objects.sort_by { |o| accessor.call(o) } last_key = accessor.call(sorted.first) clusters = [[sorted.first]] sorted[1..].each do |obj| curr_key = accessor.call(obj) if (curr_key - last_key).abs <= tolerance clusters.last << obj else clusters << [obj] end last_key = curr_key end clusters end |
.objects_to_bbox(objects) ⇒ Object
bbox = [x0, top, x1, bottom] (top-down). Returns the bbox that encloses all the passed objects. Uses min/max of x0/top/x1/bottom.
103 104 105 106 107 108 109 110 111 112 |
# File 'lib/rpdfium/util/cluster.rb', line 103 def objects_to_bbox(objects) objects.each_with_object( [Float::INFINITY, Float::INFINITY, -Float::INFINITY, -Float::INFINITY] ) do |o, acc| acc[0] = o[:x0] if o[:x0] < acc[0] acc[1] = o[:top] if o[:top] < acc[1] acc[2] = o[:x1] if o[:x1] > acc[2] acc[3] = o[:bottom] if o[:bottom] > acc[3] end end |
.objects_to_rect(objects) ⇒ Object
Variant that returns a Hash instead of a tuple — handy in the edge context where we need to mix bbox+orientation.
116 117 118 119 120 |
# File 'lib/rpdfium/util/cluster.rb', line 116 def objects_to_rect(objects) x0, top, x1, bottom = objects_to_bbox(objects) { x0: x0, top: top, x1: x1, bottom: bottom, width: x1 - x0, height: bottom - top } end |