Class: Kotoshu::Language::Normalizer::Base

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/language/normalizer/base.rb

Overview

Abstract base class for text normalizers.

Normalizers transform text to a standard form for comparison. Different languages use different normalization strategies.

Examples of normalization:

  • Accent removal (café -> cafe)

  • Case folding (Hello -> hello)

  • Whitespace normalization

  • Punctuation normalization

Examples:

Implement a normalizer

class MyNormalizer < Normalizer::Base
  def normalize(text)
    super.downcase.gsub(/[áàâä]/, 'a')
  end
end

Instance Method Summary collapse

Instance Method Details

#normalize(text, options = {}) ⇒ String

Normalize text.

Default implementation:

  • Strip leading/trailing whitespace

  • Collapse multiple whitespace to single space

  • Downcase (optional)

Parameters:

  • text (String)

    Text to normalize

  • options (Hash) (defaults to: {})

    Normalization options

Options Hash (options):

  • :downcase (Boolean) — default: true

    Convert to lowercase

  • :strip_punct (Boolean) — default: false

    Remove punctuation

  • :collapse_ws (Boolean) — default: true

    Collapse whitespace

Returns:

  • (String)

    Normalized text



37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/kotoshu/language/normalizer/base.rb', line 37

def normalize(text, options = {})
  return "" if text.nil?

  defaults = {
    downcase: true,
    strip_punct: false,
    collapse_ws: true
  }
  opts = defaults.merge(options)

  result = text.dup

  # Strip whitespace
  result = result.strip

  # Collapse multiple whitespace
  result = result.gsub(/\s+/, " ") if opts[:collapse_ws]

  # Downcase
  result = result.downcase if opts[:downcase]

  # Strip punctuation
  result = strip_punctuation(result) if opts[:strip_punct]

  result
end

#normalize_word(word) ⇒ String

Normalize a word.

Parameters:

  • word (String)

    Word to normalize

Returns:

  • (String)

    Normalized word



68
69
70
# File 'lib/kotoshu/language/normalizer/base.rb', line 68

def normalize_word(word)
  normalize(word)
end

#normalized_eql?(str1, str2) ⇒ Boolean

Check if two normalized strings are equal.

Parameters:

  • str1 (String)

    First string

  • str2 (String)

    Second string

Returns:

  • (Boolean)

    True if equal after normalization



77
78
79
# File 'lib/kotoshu/language/normalizer/base.rb', line 77

def normalized_eql?(str1, str2)
  normalize(str1) == normalize(str2)
end