Class: Kotoshu::Language::Tokenizer::GermanTokenizer

Inherits:

Base

Object
Base
Kotoshu::Language::Tokenizer::GermanTokenizer

show all

Defined in:: lib/kotoshu/language/tokenizer/german_tokenizer.rb

Overview

Tokenizer for German text.

Ported from LanguageTool’s GermanWordTokenizer.

Handles:

Underscore as word character (not a separator)
Single low quote (‚) as word character (not a separator)
Umlauts (ä, ö, ü, ß)

The LanguageTool implementation adds two characters to the word characters: underscore (_) and single low quote (‚ - U+201A).

Direct Known Subclasses

Kotoshu::Languages::German::Tokenizer

Constant Summary collapse

WORD_SEPARATORS = German-specific word separators (exclude underscore and single low quote)

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*+\-·]/.freeze

Instance Method Summary collapse

#tokenize(text) ⇒ Object

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ `Object`

# File 'lib/kotoshu/language/tokenizer/german_tokenizer.rb', line 21

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Filter and normalize
  raw_tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end