Module: AcroForge::Labels
- Defined in:
- lib/acroforge/labels.rb
Overview
Cleans up human-readable labels extracted from PDFs.
PDF text extraction often produces broken word fragments (e.g. a ligature like “fi” gets split, producing “Tax Identi fi cation No.”) and labels rendered in different casing conventions across vendors (ALL UPPER, mixed, sentence case). This module normalizes both: it fixes the typo fragments using Constants::TYPO_PHRASE_REPLACEMENTS and converts the result to consistent title case. Used by Engine, Schema, and Relabeler so the same corrections appear in the verbose log, schema variations, and mapping meta.
Constant Summary collapse
- TITLE_CASE_CONNECTORS =
Words that conventionally stay lowercase inside a title (except when they’re the first or last word).
%w[ a an the and but or nor for so yet of at by in on to up from with as vs ].to_set
Class Method Summary collapse
-
.acronym?(word) ⇒ Boolean
Treat as an acronym if it’s all uppercase AND short (<= 3 chars), OR if it has 4+ all-upper letters AND looks like a format/code rather than a word (e.g., “DDMMYYYY”).
- .capitalize_first(word) ⇒ Object
-
.fix_typos(label) ⇒ Object
Fix snake_case typo patterns from Constants::TYPO_PHRASE_REPLACEMENTS in the human-readable label too.
- .humanize(label) ⇒ Object
- .strip_punct(word) ⇒ Object
-
.title_case(text) ⇒ Object
Convert a label to standard title case: - First and last words always capitalized - Conventional connectors (of, the, to, …) lowercased mid-label - All other words capitalized on the first letter - Short all-uppercase tokens (<= 3 chars) preserved as acronyms (GHC, DOB, PDF, ID stay as-is; “DDMMYYYY” also preserved as a format) - A word immediately following an opening “(” or “[” is treated as starting a fresh title, so its first letter capitalizes even if it would otherwise be a connector (“(For Disbursement)” not “(for …)”).
Class Method Details
.acronym?(word) ⇒ Boolean
Treat as an acronym if it’s all uppercase AND short (<= 3 chars), OR if it has 4+ all-upper letters AND looks like a format/code rather than a word (e.g., “DDMMYYYY”). Mixed letters & digits also count.
86 87 88 89 90 91 92 93 94 |
# File 'lib/acroforge/labels.rb', line 86 def acronym?(word) core = strip_punct(word) return false if core.empty? return true if core.length <= 3 && core == core.upcase && core.match?(/[A-Z]/) # Longer all-upper tokens: keep as acronym only if they contain digits # (e.g. "DDMMYYYY", "ID2024") or repeat a pattern that suggests format code. return true if core.length >= 4 && core == core.upcase && core.match?(/\d|^([A-Z])\1+/) false end |
.capitalize_first(word) ⇒ Object
100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/acroforge/labels.rb', line 100 def capitalize_first(word) return word if word.empty? # Preserve trailing/embedded punctuation; only fix the casing of the # alphabetic part. word.sub(/^([[:punct:]]*)([A-Za-z])(.*)$/) do prefix = ::Regexp.last_match(1) first = ::Regexp.last_match(2).upcase rest = ::Regexp.last_match(3).downcase "#{prefix}#{first}#{rest}" end end |
.fix_typos(label) ⇒ Object
Fix snake_case typo patterns from Constants::TYPO_PHRASE_REPLACEMENTS in the human-readable label too. “Identi fi cation” -> “Identification”.
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/acroforge/labels.rb', line 35 def fix_typos(label) result = label.dup Constants::TYPO_PHRASE_REPLACEMENTS.each do |bad, good| parts = bad.split("_").reject(&:empty?).map { |p| Regexp.escape(p) } next if parts.empty? pattern = /\b#{parts.join('\s+')}\b/i result = result.gsub(pattern) do |match| replacement = good.tr("_", " ") if match[0] == match[0].upcase replacement[0].upcase + (replacement[1..] || "") else replacement end end end result end |
.humanize(label) ⇒ Object
25 26 27 28 29 30 31 |
# File 'lib/acroforge/labels.rb', line 25 def humanize(label) return label unless label.is_a?(String) && !label.empty? result = fix_typos(label) result = title_case(result) result.gsub(/\s+/, " ").strip end |
.strip_punct(word) ⇒ Object
96 97 98 |
# File 'lib/acroforge/labels.rb', line 96 def strip_punct(word) word.gsub(/[[:punct:]]/, "") end |
.title_case(text) ⇒ Object
Convert a label to standard title case:
- First and last words always capitalized
- Conventional connectors (of, the, to, ...) lowercased mid-label
- All other words capitalized on the first letter
- Short all-uppercase tokens (<= 3 chars) preserved as acronyms
(GHC, DOB, PDF, ID stay as-is; "DDMMYYYY" also preserved as a format)
- A word immediately following an opening "(" or "[" is treated as
starting a fresh title, so its first letter capitalizes even if it
would otherwise be a connector ("(For Disbursement)" not "(for ...)")
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/acroforge/labels.rb', line 62 def title_case(text) words = text.split(/(\s+)/) # preserve whitespace between words content_indices = words.each_index.select { |i| words[i].match?(/\S/) } first_idx = content_indices.first last_idx = content_indices.last words.each_with_index.map do |word, i| next word if word.match?(/^\s*$/) if acronym?(word) word elsif i == first_idx || i == last_idx || word.start_with?("(", "[", '"', "'") capitalize_first(word) elsif TITLE_CASE_CONNECTORS.include?(strip_punct(word).downcase) word.downcase else capitalize_first(word) end end.join end |