Class: Kotoshu::Algorithms::Capitalization::Casing

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/algorithms/capitalization.rb

Overview

Base class for casing-related algorithms specific for dictionary’s language.

This is a class (not a set of functions) because it needs to have subclasses for specific language casing, which have only some aspects different from generic one.

Direct Known Subclasses

GermanCasing, TurkicCasing

Instance Method Summary collapse

Instance Method Details

#capitalize(word) ⇒ Enumerator<String>

Capitalize (convert word to all lowercase and first letter uppercase). Returns a list of results for same reasons as lower.

Parameters:

  • word (String)

    The word to capitalize

Returns:

  • (Enumerator<String>)

    Enum of capitalized variants



85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/kotoshu/algorithms/capitalization.rb', line 85

def capitalize(word)
  return enum_for(:capitalize, word) unless block_given?

  if word.length == 1
    yield upper(word[0])
  else
    upper_first = upper(word[0])
    lower(word[1..]).each do |lowered|
      yield upper_first + lowered
    end
  end
end

#coerce(word, cap) ⇒ String

Used by suggest: by known (valid) suggestion, and initial word’s capitalization, produce proper suggestion capitalization.

Example: If misspelling was “Kiten” (INIT capitalization), found suggestion “kitten”, then this method makes it “Kitten”.

Parameters:

  • word (String)

    The valid suggestion word

  • cap (Symbol)

    Original word’s capitalization type

Returns:

  • (String)

    Properly capitalized suggestion



174
175
176
177
178
179
180
181
182
183
# File 'lib/kotoshu/algorithms/capitalization.rb', line 174

def coerce(word, cap)
  case cap
  when Type::INIT, Type::HUHINIT
    upper(word[0]) + word[1..]
  when Type::ALL
    upper(word)
  else
    word
  end
end

#corrections(word) ⇒ Array<Symbol, Array<String>>

Returns hypotheses of how the word might have been cased if it is a misspelling.

Example: “DiCtionary” (HUHINIT capitalization) produces hypotheses “DiCtionary”, “diCtionary”, “dictionary”, “Dictionary”, and all of them are checked by Suggest.

Parameters:

  • word (String)

    The word to analyze

Returns:

  • (Array<Symbol, Array<String>>)

    Pair of [captype, variants]



146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# File 'lib/kotoshu/algorithms/capitalization.rb', line 146

def corrections(word)
  captype = guess(word)

  result = case captype
           when Type::NO
             [word]
           when Type::INIT
             [word, *lower(word)]
           when Type::HUHINIT
             [word, *lowerfirst(word).to_a, *lower(word), *capitalize(word).to_a]
           when Type::HUH
             [word, *lower(word)]
           when Type::ALL
             [word, *lower(word), *capitalize(word).to_a]
           end

  [captype, result]
end

#guess(word) ⇒ Symbol

Guess word’s capitalization. Redefined in GermanCasing.

Parameters:

  • word (String)

    The word to analyze

Returns:

  • (Symbol)

    One of the Type constants



37
38
39
40
41
42
43
44
45
46
47
# File 'lib/kotoshu/algorithms/capitalization.rb', line 37

def guess(word)
  return Type::NO if word.downcase == word
  return Type::ALL if word.upcase == word
  return Type::INIT if word[0].upcase == word[0] && word[1..].downcase == word[1..]

  if word[0].upcase == word[0]
    Type::HUHINIT
  else
    Type::HUH
  end
end

#lower(word) ⇒ Array<String>

Lowercases the word. Returns list of possible lowercasings for all casing classes to behave consistently.

In GermanCasing (and only there), lowercasing word like “STRASSE” produces two possibilities: “strasse” and “ße” (ß is most of the time upcased to SS, so we can’t decide which of downcased words is “right” and need to check both).

Also redefined in TurkicCasing, because in Turkic languages lowercase “i” is uppercased as “İ”, and uppercase “I” is downcased as “ı”.

Parameters:

  • word (String)

    The word to lowercase

Returns:

  • (Array<String>)

    List of possible lowercasings



62
63
64
65
66
67
68
# File 'lib/kotoshu/algorithms/capitalization.rb', line 62

def lower(word)
  # Can't be properly lowercased in non-Turkic collation
  return [] if word.nil? || word.empty? || word[0] == 'İ'

  # Turkic "lowercase dot i" to latinic "i", just in case
  [word.downcase.gsub('', 'i')]
end

#lowerfirst(word) ⇒ Enumerator<String>

Just change the case of the first letter to lower. Returns a list of results for same reasons as lower.

Parameters:

  • word (String)

    The word to process

Returns:

  • (Enumerator<String>)

    Enum of variants with lowercased first letter



103
104
105
106
107
108
109
# File 'lib/kotoshu/algorithms/capitalization.rb', line 103

def lowerfirst(word)
  return enum_for(:lowerfirst, word) unless block_given?

  lower(word[0]).each do |lowered|
    yield lowered + word[1..]
  end
end

#upper(word) ⇒ String

Uppercase the word. Redefined in TurkicCasing, because in Turkic languages lowercase “i” is uppercased as “İ”, and uppercase “I” is downcased as “ı”.

Parameters:

  • word (String)

    The word to uppercase

Returns:

  • (String)

    Uppercased word



76
77
78
# File 'lib/kotoshu/algorithms/capitalization.rb', line 76

def upper(word)
  word.upcase
end

#variants(word) ⇒ Array<Symbol, Array<String>>

Returns hypotheses of how the word might have been cased (in dictionary), if we consider it is spelled correctly.

Example: If word is “Kitten”, hypotheses are “kitten”, “Kitten”.

Parameters:

  • word (String)

    The word to analyze

Returns:

  • (Array<Symbol, Array<String>>)

    Pair of [captype, variants]



118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/kotoshu/algorithms/capitalization.rb', line 118

def variants(word)
  captype = guess(word)

  result = case captype
           when Type::NO
             [word]
           when Type::INIT
             [word, *lower(word)]
           when Type::HUHINIT
             [word, *lowerfirst(word).to_a]
           when Type::HUH
             [word]
           when Type::ALL
             [word, *lower(word), *capitalize(word).to_a]
           end

  [captype, result]
end