Class: Kotoshu::Language::Tokenizer::GermanTokenizer

Inherits:
Base
  • Object
show all
Defined in:
lib/kotoshu/language/tokenizer/german_tokenizer.rb

Overview

Tokenizer for German text.

Ported from LanguageTool’s GermanWordTokenizer.

Handles:

  • Underscore as word character (not a separator)

  • Single low quote (‚) as word character (not a separator)

  • Umlauts (ä, ö, ü, ß)

The LanguageTool implementation adds two characters to the word characters: underscore (_) and single low quote (‚ - U+201A).

Direct Known Subclasses

Kotoshu::Languages::German::Tokenizer

Constant Summary collapse

WORD_SEPARATORS =

German-specific word separators (exclude underscore and single low quote)

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*+\-·]/.freeze

Instance Method Summary collapse

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ Object



21
22
23
24
25
26
27
28
29
30
31
# File 'lib/kotoshu/language/tokenizer/german_tokenizer.rb', line 21

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Filter and normalize
  raw_tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end