Module: AcroForge::Constants
- Defined in:
- lib/acroforge/constants.rb
Constant Summary collapse
- TYPO_PHRASE_REPLACEMENTS =
{ "identi_fi_cation" => "identification", "identi_cation" => "identification", "ide_ntity" => "identity", "contribu_on" => "contribution", "con_rmed" => "confirmed", "na_onal" => "national", "ocial" => "official", "modeof" => "mode_of", "modeofr" => "mode_of_r", "nameof" => "name_of" }.freeze
- UNICODE_REPLACEMENTS =
PDF text extraction returns Unicode quirks: ligatures (fi instead of f+i), fullwidth letters, curly quotes, etc. AllTextProcessor normalizes via Unicode NFKC first, which handles most of the “compatibility” subset automatically (ligatures, fullwidth, superscript digits, …). NFKC does NOT touch these characters — they’re separate codepoints, not compatibility decompositions — so we substitute them manually.
{ "\u{2018}" => "'", # left single quote "\u{2019}" => "'", # right single quote "\u{201A}" => "'", # single low-9 quote "\u{201C}" => '"', # left double quote "\u{201D}" => '"', # right double quote "\u{201E}" => '"', # double low-9 quote "\u{2013}" => "-", # en dash "\u{2014}" => "-", # em dash "\u{2010}" => "-", # hyphen "\u{2011}" => "-", # non-breaking hyphen "\u{2212}" => "-", # minus sign "\u{00AD}" => "", # soft hyphen (often invisible artifact) "\u{200B}" => "", # zero-width space "\u{200C}" => "", # zero-width non-joiner "\u{200D}" => "", # zero-width joiner "\u{FEFF}" => "", # zero-width no-break space (BOM) "\u{2026}" => "...", # ellipsis "\u{2022}" => "*", # bullet "\u{00B7}" => "*" # middle dot used as bullet }.freeze