Class: Kotoshu::Components::WhitespaceTokenizer
- Defined in:
- lib/kotoshu/components/whitespace_tokenizer.rb
Overview
Whitespace-based tokenizer for Latin-script languages.
Splits text on whitespace and separates punctuation. Suitable for languages with space-separated words (English, French, German, etc.).
This is a simple tokenizer that works well for most Latin-script languages. For more advanced tokenization (contractions, compounds), use language-specific tokenizers.
Direct Known Subclasses
Constant Summary collapse
- TOKEN_PATTERN =
Regex pattern for matching tokens (words or punctuation).
/[\w']+|[^\w\s]/.freeze
Instance Method Summary collapse
-
#initialize(pattern: TOKEN_PATTERN) ⇒ WhitespaceTokenizer
constructor
Create a new whitespace tokenizer.
-
#pattern ⇒ Regexp
Get the token pattern used by this tokenizer.
-
#punctuation?(char) ⇒ Boolean
Check if a character is punctuation.
-
#tokenize(text) ⇒ Array<Hash>
Split text into tokens.
-
#word_char?(char) ⇒ Boolean
Check if a character is a word character.
Methods inherited from Tokenizer
Constructor Details
#initialize(pattern: TOKEN_PATTERN) ⇒ WhitespaceTokenizer
Create a new whitespace tokenizer.
36 37 38 |
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 36 def initialize(pattern: TOKEN_PATTERN) @pattern = pattern end |
Instance Method Details
#pattern ⇒ Regexp
Get the token pattern used by this tokenizer.
75 76 77 |
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 75 def pattern @pattern end |
#punctuation?(char) ⇒ Boolean
Check if a character is punctuation.
91 92 93 |
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 91 def punctuation?(char) char.match?(/[^\w\s]/) end |
#tokenize(text) ⇒ Array<Hash>
Split text into tokens.
Each token is a hash with:
-
:token (String) - The token text
-
:position (Integer) - Character position in original text
-
:length (Integer) - Token length in characters
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 49 def tokenize(text) return [] if text.nil? || text.empty? tokens = [] position = 0 # Find all matches text.scan(@pattern) do |match| match_str = match.is_a?(Array) ? match.first : match start_pos = text.index(match_str, position) tokens << { token: match_str, position: start_pos, length: match_str.length } position = start_pos + match_str.length end tokens end |
#word_char?(char) ⇒ Boolean
Check if a character is a word character.
83 84 85 |
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 83 def word_char?(char) char.match?(/[\w]/) end |