Class: Kotoshu::Language::Tokenizer::RussianTokenizer
- Defined in:
- lib/kotoshu/language/tokenizer/russian_tokenizer.rb
Overview
Tokenizer for Russian text.
Ported from LanguageTool’s RussianWordTokenizer.
Handles:
-
Apostrophe as word character
-
Dot as word character (for abbreviations)
-
Special abbreviations: б/у (second-hand), б/н (new)
-
Spaced dots: .. , .
Direct Known Subclasses
Constant Summary collapse
- WORD_SEPARATORS =
Russian-specific word separators (exclude apostrophe and dot)
/[\s"()\[\]{}<>,;:!?\\\/|`~@#$%^&*+\-·]/.freeze
- ABBREVIATION_PLACEHOLDERS =
Special abbreviations that should not be split Using non-printing characters as placeholders
{ "б/у" => "\u0001\u0001SOCR_BU\u0001\u0001", "б/н" => "\u0001\u0001SOCR_BN\u0001\u0001" }.freeze
- PLACEHOLDER_RESTORE =
Reverse placeholders for restoration
{ "\u0001\u0001SOCR_BU\u0001\u0001" => "б/у", "\u0001\u0001SOCR_BN\u0001\u0001" => "б/н", "\u0001\u0001SP_DDOT_SP\u0001\u0001" => " .. ", "\u0001\u0001SP_DOT_SP\u0001\u0001" => " . ", "\u0001\u0001SP_DOT\u0001\u0001" => "." }.freeze
Instance Method Summary collapse
Methods inherited from Base
#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?
Instance Method Details
#tokenize(text) ⇒ Object
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/kotoshu/language/tokenizer/russian_tokenizer.rb', line 35 def tokenize(text) return [] if text.nil? || text.strip.empty? # Replace abbreviations with placeholders text = replace_abbreviations(text) # Split on word boundaries raw_tokens = text.split(WORD_SEPARATORS) # Restore abbreviations and filter raw_tokens .map { |token| restore_abbreviations(token) } .map { |token| normalize(token) } .reject { |token| skip_token?(token) } end |