Module: NEU::MODS::TextNormalizer
- Defined in:
- lib/neu/mods/canonicalize.rb
Overview
Normalises curator-authored freetext on the way into the JSON access copy (and Solr); the XML preservation copy stays untouched. Ported from Atlas’s TextNormalizer (which carries DRS v1 prior art) so the gem reproduces Atlas’s projection byte-for-byte.
IMPORTANT: every character-class regex is built programmatically from codepoint lists via ‘format(’\u%04X’, cp)‘, so this source file stays pure ASCII – no literal smart-quotes, dashes, or (critically) raw control bytes land on disk. Keep it that way.
Pipeline: force UTF-8 + scrub invalid bytes; NFC; map Unicode dashes to ‘-’ (swung-dash to ‘~’); transliterate the General Punctuation block (smart quotes, ellipsis, etc.) to ASCII; strip C0/C1 controls (keeping tab/newline); collapse horizontal-whitespace runs to one space; for paragraph fields, collapse 2+ newlines to exactly two; strip.
.normalize(str) -- single-line fields (newlines -> spaces)
.normalize_paragraphs(str) -- fields that may carry paragraph breaks
(abstract, accessCondition)
Constant Summary collapse
- DASH_CODEPOINTS =
NOTE: U+2053 (swung dash) is intentionally excluded from dashes – it is named “dash” but conventionally maps to ASCII ‘~’, not ‘-’ (V1 prior art).
[ 0x002D, 0x00AD, 0x058A, 0x05BE, 0x1400, 0x1806, 0x2010, 0x2011, 0x2012, 0x2013, 0x2014, 0x2015, 0x2043, 0x207B, 0x208B, 0x2212, 0x2E17, 0x2E1A, 0x2E3A, 0x2E3B, 0x2E40, 0x301C, 0x3030, 0x30A0, 0xFE31, 0xFE32, 0xFE58, 0xFE63, 0xFF0D ].freeze
- DASH_RE =
char_class(DASH_CODEPOINTS).freeze
- SWUNG_DASH_RE =
Regexp.new(format('\\u%04X', 0x2053)).freeze
- CONTROL_CODEPOINTS =
C0 (U+0000..U+0008, U+000B..U+001F) and C1 (U+007F..U+009F). U+0009 (tab) and U+000A (newline) are preserved.
((0x0000..0x0008).to_a + (0x000B..0x001F).to_a + (0x007F..0x009F).to_a).freeze
- CONTROL_RE =
char_class(CONTROL_CODEPOINTS).freeze
- HORIZONTAL_WS_CODEPOINTS =
[ 0x0009, 0x00A0, 0x1680, 0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007, 0x2008, 0x2009, 0x200A, 0x202F, 0x205F, 0x3000 ].freeze
- HORIZONTAL_WS_RE =
Leading literal space included in the class (the “ ” prefix); ‘+` so a run of horizontal whitespace collapses to a single space.
Regexp.new("#{char_class(HORIZONTAL_WS_CODEPOINTS, prefix: " ").source}+").freeze
- PARAGRAPH_RUN_RE =
/\n{2,}/- GENERAL_PUNCTUATION =
General Punctuation block (U+2000..U+206F). Codepoints not listed pass through unchanged. Empty-string values deliberately drop invisible/bidi/ format marks so they cannot leak into the access copy.
{ 0x2000 => " ", 0x2001 => " ", 0x2002 => " ", 0x2003 => " ", 0x2004 => " ", 0x2005 => " ", 0x2006 => " ", 0x2007 => " ", 0x2008 => " ", 0x2009 => " ", 0x200A => " ", 0x200B => "", 0x200C => "", 0x200D => "", 0x200E => "", 0x200F => "", 0x2018 => "'", 0x2019 => "'", 0x201A => ",", 0x201B => "'", 0x201C => '"', 0x201D => '"', 0x201E => '"', 0x201F => '"', 0x2020 => "+", 0x2021 => "+", 0x2022 => "*", 0x2023 => "*", 0x2024 => ".", 0x2025 => "..", 0x2026 => "...", 0x2028 => "\n", 0x2029 => "\n\n", 0x202A => "", 0x202B => "", 0x202C => "", 0x202D => "", 0x202E => "", 0x202F => " ", 0x2030 => "%", 0x2032 => "'", 0x2033 => '"', 0x2035 => "'", 0x2036 => '"', 0x2039 => "<", 0x203A => ">", 0x203C => "!!", 0x203D => "?", 0x2044 => "/", 0x2052 => "%", 0x205F => " ", 0x2060 => "", 0x2061 => "", 0x2062 => "", 0x2063 => "", 0x2064 => "", 0x206A => "", 0x206B => "", 0x206C => "", 0x206D => "", 0x206E => "", 0x206F => "" }.transform_keys { |cp| [cp].pack("U") }.freeze
- GENERAL_PUNCTUATION_RE =
Regexp.new("[#{format('\\u%04X-\\u%04X', 0x2000, 0x206F)}]").freeze
Class Method Summary collapse
- .base_normalize(str) ⇒ Object
-
.char_class(codepoints, prefix: "") ⇒ Object
Build a character-class Regexp from an array of integer codepoints, as uXXXX escapes (keeps this source ASCII).
- .normalize(str) ⇒ Object
- .normalize_paragraphs(str) ⇒ Object
Class Method Details
.base_normalize(str) ⇒ Object
132 133 134 135 136 137 138 139 140 |
# File 'lib/neu/mods/canonicalize.rb', line 132 def base_normalize(str) s = str.dup.force_encoding("UTF-8") s = s.scrub("") s = s.unicode_normalize(:nfc) s = s.gsub(DASH_RE, "-") s = s.gsub(SWUNG_DASH_RE, "~") s = s.gsub(GENERAL_PUNCTUATION_RE) { |c| GENERAL_PUNCTUATION.fetch(c, c) } s.gsub(CONTROL_RE, "") end |
.char_class(codepoints, prefix: "") ⇒ Object
Build a character-class Regexp from an array of integer codepoints, as uXXXX escapes (keeps this source ASCII).
52 53 54 |
# File 'lib/neu/mods/canonicalize.rb', line 52 def self.char_class(codepoints, prefix: "") Regexp.new("[#{prefix}#{codepoints.map { |cp| format('\\u%04X', cp) }.join}]") end |
.normalize(str) ⇒ Object
114 115 116 117 118 119 120 |
# File 'lib/neu/mods/canonicalize.rb', line 114 def normalize(str) return "" if str.nil? s = base_normalize(str.to_s) s = s.tr("\n", " ") s.gsub(HORIZONTAL_WS_RE, " ").strip end |
.normalize_paragraphs(str) ⇒ Object
122 123 124 125 126 127 128 129 130 |
# File 'lib/neu/mods/canonicalize.rb', line 122 def normalize_paragraphs(str) return "" if str.nil? s = base_normalize(str.to_s) s = s.gsub(HORIZONTAL_WS_RE, " ") s = s.gsub(/ *\n */, "\n") s.split(PARAGRAPH_RUN_RE).map { |p| p.tr("\n", " ").strip } .reject(&:empty?).join("\n\n") end |