Class: Rpdfium::Util::LabelMatcher

Inherits:
Object
  • Object
show all
Defined in:
lib/rpdfium/util/label_matcher.rb

Overview

Associa label semantiche a valori inseriti su PDF di moduli compilati (F24, comunicazioni IVA, modelli 770) dove template e dati coesistono come testo grafico in font diversi.

Strategia base:

  1. Cluster le parole del template in “label coerenti”: word geometricamente vicine formano un’unica label.

  2. **Per ogni valore** cerca:

    • ‘:col` — label SOPRA in stessa colonna

    • ‘:row` — label A SINISTRA in stessa riga

  3. (Opzionale) **Riassegnazione per colonne**: usa ‘ColumnInference` per identificare colonne ripetitive (es. ST2..ST13 del 770 Quadro ST) e propaga l’header canonico a tutti i valori della colonna, superando il limite ‘col_max_dy`.

Examples:

uso base

matcher = Rpdfium::Util::LabelMatcher.new
matcher.match(value_words, anchor_words)

con tabelle ripetitive (header in cima alla colonna)

matcher = Rpdfium::Util::LabelMatcher.new(
  column_inference: Rpdfium::Util::ColumnInference.new
)
matcher.match(value_words, anchor_words)

Constant Summary collapse

DEFAULT_COL_MAX_DY =
80.0
DEFAULT_ROW_MAX_DX =
200.0
DEFAULT_COL_X_TOLERANCE =
10.0
DEFAULT_ROW_Y_TOLERANCE =
2.0
DEFAULT_CLUSTER_SAME_ROW_DY =
4.0
DEFAULT_CLUSTER_SAME_ROW_DX =
12.0
DEFAULT_CLUSTER_ADJ_ROW_DY =
4.0
DEFAULT_IGNORE_LABEL_PATTERN =
/\A\d{1,3}\z|\A[IVX]{1,5}\z/.freeze
WIDE_VALUE_THRESHOLD =
60.0

Instance Method Summary collapse

Constructor Details

#initialize(col_max_dy: DEFAULT_COL_MAX_DY, row_max_dx: DEFAULT_ROW_MAX_DX, col_x_tolerance: DEFAULT_COL_X_TOLERANCE, row_y_tolerance: DEFAULT_ROW_Y_TOLERANCE, cluster_same_row_dy: DEFAULT_CLUSTER_SAME_ROW_DY, cluster_same_row_dx: DEFAULT_CLUSTER_SAME_ROW_DX, cluster_adj_row_dy: DEFAULT_CLUSTER_ADJ_ROW_DY, ignore_label_pattern: DEFAULT_IGNORE_LABEL_PATTERN, column_inference: nil) ⇒ LabelMatcher

Returns a new instance of LabelMatcher.



43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/rpdfium/util/label_matcher.rb', line 43

def initialize(col_max_dy: DEFAULT_COL_MAX_DY,
               row_max_dx: DEFAULT_ROW_MAX_DX,
               col_x_tolerance: DEFAULT_COL_X_TOLERANCE,
               row_y_tolerance: DEFAULT_ROW_Y_TOLERANCE,
               cluster_same_row_dy: DEFAULT_CLUSTER_SAME_ROW_DY,
               cluster_same_row_dx: DEFAULT_CLUSTER_SAME_ROW_DX,
               cluster_adj_row_dy: DEFAULT_CLUSTER_ADJ_ROW_DY,
               ignore_label_pattern: DEFAULT_IGNORE_LABEL_PATTERN,
               column_inference: nil)
  @col_max_dy = col_max_dy
  @row_max_dx = row_max_dx
  @col_x_tolerance = col_x_tolerance
  @row_y_tolerance = row_y_tolerance
  @cluster_same_row_dy = cluster_same_row_dy
  @cluster_same_row_dx = cluster_same_row_dx
  @cluster_adj_row_dy = cluster_adj_row_dy
  @ignore_label_pattern = ignore_label_pattern
  @column_inference = column_inference
end

Instance Method Details

#cluster_anchors(anchor_words) ⇒ Object

Ricostruisce le label dal cluster delle word del template. Esposto pubblicamente per ispezione/debug.



97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/rpdfium/util/label_matcher.rb', line 97

def cluster_anchors(anchor_words)
  remaining = anchor_words.dup
  groups = []
  until remaining.empty?
    seed = remaining.shift
    group = [seed]
    grew = true
    while grew
      grew = false
      remaining.dup.each do |w|
        close = group.any? do |g|
          dx_horiz = [w[:x0] - g[:x1], g[:x0] - w[:x1]].max
          same_row = (w[:top] - g[:top]).abs < @cluster_same_row_dy &&
                     dx_horiz < @cluster_same_row_dx
          dy_above = (g[:top] - w[:bottom]).abs
          dy_below = (w[:top] - g[:bottom]).abs
          vertical_adjacent = [dy_above, dy_below].min < @cluster_adj_row_dy
          x_overlap = !(w[:x1] < g[:x0] - 3 || w[:x0] > g[:x1] + 3)
          adj_row = vertical_adjacent && x_overlap
          same_row || adj_row
        end
        if close
          group << w
          remaining.delete(w)
          grew = true
        end
      end
    end
    groups << group
  end
  labels = groups.map { |g| group_to_label(g) }
  if @ignore_label_pattern
    labels = labels.reject { |l| l[:text].match?(@ignore_label_pattern) }
  end
  labels
end

#match(values, anchors) ⇒ Array<Hash>

Calcola le associazioni label → valore.

Parameters:

  • values (Array<Hash>)

    word del layer “dati”

  • anchors (Array<Hash>)

    word del layer “template”

Returns:

  • (Array<Hash>)

    uno per valore: { value:, labels: { col:, row: }, geometry: }



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/rpdfium/util/label_matcher.rb', line 68

def match(values, anchors)
  labels = cluster_anchors(anchors)

  prelim = values.map do |v|
    col = find_col_label(v, labels)
    row = find_row_label(v, labels)
    { value: v, col: col, row: row }
  end

  # Riassegnazione opzionale per colonne ripetitive
  prelim = reassign_by_columns(prelim, labels, values) if @column_inference

  prelim.map do |entry|
    v = entry[:value]
    {
      value: v[:text],
      labels: {
        col: entry[:col]&.dig(:text),
        row: entry[:row]&.dig(:text)
      },
      geometry: {
        x0: v[:x0], x1: v[:x1], top: v[:top], bottom: v[:bottom]
      }
    }
  end
end