Class: Kotoshu::Language::Tokenizer::GermanTokenizer
- Defined in:
- lib/kotoshu/language/tokenizer/german_tokenizer.rb
Overview
Tokenizer for German text.
Ported from LanguageTool’s GermanWordTokenizer.
Handles:
-
Underscore as word character (not a separator)
-
Single low quote (‚) as word character (not a separator)
-
Umlauts (ä, ö, ü, ß)
The LanguageTool implementation adds two characters to the word characters: underscore (_) and single low quote (‚ - U+201A).
Direct Known Subclasses
Constant Summary collapse
- WORD_SEPARATORS =
German-specific word separators (exclude underscore and single low quote)
/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*+\-·]/.freeze
Instance Method Summary collapse
Methods inherited from Base
#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?
Instance Method Details
#tokenize(text) ⇒ Object
21 22 23 24 25 26 27 28 29 30 31 |
# File 'lib/kotoshu/language/tokenizer/german_tokenizer.rb', line 21 def tokenize(text) return [] if text.nil? || text.strip.empty? # Split on word boundaries raw_tokens = text.split(WORD_SEPARATORS) # Filter and normalize raw_tokens .map { |token| normalize(token) } .reject { |token| skip_token?(token) } end |