Class: Kotoshu::Algorithms::Lookup::Lookuper

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/algorithms/lookup.rb

Overview

Main word correctness lookup class.

Typically, you would not use this directly.

Example:

dictionary = Kotoshu::Dictionary.load('en_US')
lookuper = dictionary.lookuper

lookuper.call('spylls')  # => false
lookuper.call('spells')  # => true

lookuper.good_forms('spells') do |form|
  puts form
end
# AffixForm(spells = spells)
# AffixForm(spells = spell + Suffix(s: S×, on [[^sxzhy]]$))

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(aff, dic) ⇒ Lookuper

Returns a new instance of Lookuper.



190
191
192
193
# File 'lib/kotoshu/algorithms/lookup.rb', line 190

def initialize(aff, dic)
  @aff = aff
  @dic = dic
end

Instance Attribute Details

#affHash (readonly)

Returns Aff data structure (from aff file).

Returns:

  • (Hash)

    Aff data structure (from aff file)



185
186
187
# File 'lib/kotoshu/algorithms/lookup.rb', line 185

def aff
  @aff
end

#dicHash (readonly)

Returns Dic data structure (from dic file).

Returns:

  • (Hash)

    Dic data structure (from dic file)



188
189
190
# File 'lib/kotoshu/algorithms/lookup.rb', line 188

def dic
  @dic
end

Instance Method Details

#break_word(text, depth = 0) {|Array<String>| ... } ⇒ Enumerator

Recursively produce all possible lists of word breaking by break patterns (like dashes).

Example: “pre-processed-meat” would produce:

["pre-processed-meat"]
["pre", "processed-meat"]
["pre", "processed", "meat"]
["pre-processed", "meat"]

This is necessary because dictionary might contain “pre-processed” as a separate entry.

Parameters:

  • text (String)

    Text to break

  • depth (Integer) (defaults to: 0)

    Current recursion depth

Yields:

  • (Array<String>)

    Each possible breaking

Returns:

  • (Enumerator)

    If no block given



254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# File 'lib/kotoshu/algorithms/lookup.rb', line 254

def break_word(text, depth = 0)
  return enum_for(:break_word, text, depth) unless block_given?
  return if depth > 10

  # Return whole text as first option
  yield [text]

  break_patterns = @aff[:BREAK] || []
  break_patterns.each do |pattern|
    str = text.to_s
    pos = 0

    while (match_data = pattern[:matcher].match(str, pos))
      start = str[0...match_data.begin(1)]
      rest = str[match_data.end(1)..]

      break_word(rest, depth + 1) do |breaking|
        yield [start, *breaking]
      end

      pos = match_data.end(0)
      break if pos >= str.length
    end
  end
end

#call(word, capitalization: true, allow_nosuggest: true) ⇒ Boolean

The outermost word correctness check.

Basically, prepares word for check (converting/removing chars), and then checks whether any good word form can be produced with good_forms. If there is none, also tries to break word by break-points.

Parameters:

  • word (String)

    Word to check

  • capitalization (Boolean) (defaults to: true)

    If false, check only exact capitalization

  • allow_nosuggest (Boolean) (defaults to: true)

    If false, don’t consider NOSUGGEST words as correct

Returns:

  • (Boolean)

    Whether word is correct



205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# File 'lib/kotoshu/algorithms/lookup.rb', line 205

def call(word, capitalization: true, allow_nosuggest: true)
  # Check if word is correct
  is_correct = ->(w) do
    good_forms(w, capitalization: capitalization, allow_nosuggest: allow_nosuggest).any?
  end

  # If all entries matching the word have FORBIDDENWORD flag, word can't be correct
  if @aff[:FORBIDDENWORD] && @dic[:has_flag]&.call(word, @aff[:FORBIDDENWORD], for_all: true)
    return false
  end

  # Convert word with ICONV table
  word_to_check = @aff[:ICONV] ? @aff[:ICONV].call(word) : word

  # Remove ignored characters
  if @aff[:IGNORE]
    ignore_chars = @aff[:IGNORE]
    word_to_check = word_to_check.chars.reject { |c| ignore_chars.include?(c) }.join
  end

  # Numbers are always good
  return true if NUMBER_REGEXP.match?(word_to_check)

  # Try breaking word by break patterns
  break_word(word_to_check).each do |parts|
    if parts.all? { |part| part.empty? || is_correct.call(part) }
      return true
    end
  end

  false
end

#correct?(word, capitalization: true, allow_nosuggest: true, affix_forms: true, compound_forms: true) ⇒ Boolean Also known as: is_correct?

Check if the word is correct without yielding forms.

Convenience method for simple correctness checks.

Parameters:

  • word (String)

    Word to check

  • capitalization (Boolean) (defaults to: true)

    Check capitalization variants

  • allow_nosuggest (Boolean) (defaults to: true)

    Include NOSUGGEST words

  • affix_forms (Boolean) (defaults to: true)

    Check affix forms

  • compound_forms (Boolean) (defaults to: true)

    Check compound forms

Returns:

  • (Boolean)

    Whether word is correct



354
355
356
357
358
359
360
361
362
363
364
# File 'lib/kotoshu/algorithms/lookup.rb', line 354

def correct?(word,
             capitalization: true,
             allow_nosuggest: true,
             affix_forms: true,
             compound_forms: true)
  good_forms(word,
             capitalization: capitalization,
             allow_nosuggest: allow_nosuggest,
             affix_forms: affix_forms,
             compound_forms: compound_forms).any?
end

#good_forms(word, capitalization: true, allow_nosuggest: true, affix_forms: true, compound_forms: true) {|AffixForm, CompoundForm| ... } ⇒ Object

The main producer of correct word forms.

Produces all ways the proposed string might correspond to dictionary/ affixes. If there is at least one, the word is correctly spelled.

Example:

lookuper.good_forms('building') do |form|
  puts form
end
# AffixForm(building = building)                              # noun
# AffixForm(building = build + Suffix(ing: G×, on [[^e]]$))   # verb

Parameters:

  • word (String)

    Word to check

  • capitalization (Boolean) (defaults to: true)

    If false, use only exact capitalization

  • allow_nosuggest (Boolean) (defaults to: true)

    If false, exclude NOSUGGEST words

  • affix_forms (Boolean) (defaults to: true)

    If false, only return compound forms

  • compound_forms (Boolean) (defaults to: true)

    If false, only return affix forms

Yields:



298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
# File 'lib/kotoshu/algorithms/lookup.rb', line 298

def good_forms(word,
               capitalization: true,
               allow_nosuggest: true,
               affix_forms: true,
               compound_forms: true)
  return enum_for(:good_forms, word,
                  capitalization: capitalization,
                  allow_nosuggest: allow_nosuggest,
                  affix_forms: affix_forms,
                  compound_forms: compound_forms) unless block_given?

  # Get capitalization variants
  if capitalization
    captype, variants = @aff[:casing].variants(word)
  else
    captype = @aff[:casing].guess(word)
    variants = [word]
  end

  # Check each variant
  variants.each do |variant|
    if affix_forms
      affix_forms_internal(variant, captype: captype, allow_nosuggest: allow_nosuggest) do |form|
        # Special German ß handling
        if @aff[:CHECKSHARPS] && @aff[:KEEPCASE]
          stem = form.in_dictionary ? form.in_dictionary[:stem] : form.stem
          if stem.include?('ß') &&
             captype == Capitalization::Type::ALL &&
             word.include?('ß') &&
             form.flags.include?(@aff[:KEEPCASE])
            next
          end
        end

        yield form
      end
    end

    if compound_forms
      compound_forms_internal(variant, captype: captype, allow_nosuggest: allow_nosuggest) do |form|
        yield form
      end
    end
  end
end