Class: Kotoshu::Language::Normalizer::Base
- Inherits:
-
Object
- Object
- Kotoshu::Language::Normalizer::Base
- Defined in:
- lib/kotoshu/language/normalizer/base.rb
Overview
Abstract base class for text normalizers.
Normalizers transform text to a standard form for comparison. Different languages use different normalization strategies.
Examples of normalization:
-
Accent removal (café -> cafe)
-
Case folding (Hello -> hello)
-
Whitespace normalization
-
Punctuation normalization
Instance Method Summary collapse
-
#normalize(text, options = {}) ⇒ String
Normalize text.
-
#normalize_word(word) ⇒ String
Normalize a word.
-
#normalized_eql?(str1, str2) ⇒ Boolean
Check if two normalized strings are equal.
Instance Method Details
#normalize(text, options = {}) ⇒ String
Normalize text.
Default implementation:
-
Strip leading/trailing whitespace
-
Collapse multiple whitespace to single space
-
Downcase (optional)
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/kotoshu/language/normalizer/base.rb', line 37 def normalize(text, = {}) return "" if text.nil? defaults = { downcase: true, strip_punct: false, collapse_ws: true } opts = defaults.merge() result = text.dup # Strip whitespace result = result.strip # Collapse multiple whitespace result = result.gsub(/\s+/, " ") if opts[:collapse_ws] # Downcase result = result.downcase if opts[:downcase] # Strip punctuation result = strip_punctuation(result) if opts[:strip_punct] result end |
#normalize_word(word) ⇒ String
Normalize a word.
68 69 70 |
# File 'lib/kotoshu/language/normalizer/base.rb', line 68 def normalize_word(word) normalize(word) end |
#normalized_eql?(str1, str2) ⇒ Boolean
Check if two normalized strings are equal.
77 78 79 |
# File 'lib/kotoshu/language/normalizer/base.rb', line 77 def normalized_eql?(str1, str2) normalize(str1) == normalize(str2) end |