Class: Kotoshu::Components::Tokenizer Abstract
- Inherits:
-
Object
- Object
- Kotoshu::Components::Tokenizer
- Defined in:
- lib/kotoshu/components/tokenizer.rb
Overview
Subclasses must implement #tokenize
Base class for tokenizers.
Tokenizers split text into individual tokens (words, punctuation). Different languages use different tokenization strategies:
-
Latin scripts: Whitespace + punctuation
-
CJK: Morphological analysis
-
German: Compound word splitting
-
RTL: Right-to-left text handling
Direct Known Subclasses
Instance Method Summary collapse
-
#tokenize(text) ⇒ Array<Hash>
abstract
Split text into tokens.
-
#tokenize_to_strings(text) ⇒ Array<String>
Tokenize and return just the token strings.
Instance Method Details
#tokenize(text) ⇒ Array<Hash>
Subclasses must implement
Split text into tokens.
Each token is a hash with:
-
:token (String) - The token text
-
:position (Integer) - Character position in original text
-
:length (Integer) - Token length in characters
Additional keys may be added by subclasses:
-
:pos_tag (String) - Part of speech tag
-
:lemma (String) - Base form / lemma
-
:compound_part (Boolean) - Whether this is a compound word part
-
:script (Symbol) - Script type for multilingual text
43 44 45 |
# File 'lib/kotoshu/components/tokenizer.rb', line 43 def tokenize(text) raise NotImplementedError, "#{self.class} must implement #tokenize" end |
#tokenize_to_strings(text) ⇒ Array<String>
Tokenize and return just the token strings.
Convenience method for when you only need the text content.
53 54 55 |
# File 'lib/kotoshu/components/tokenizer.rb', line 53 def tokenize_to_strings(text) tokenize(text).map { |t| t[:token] } end |