Class: Kotoshu::Language::Tokenizer::Base
- Inherits:
-
Object
- Object
- Kotoshu::Language::Tokenizer::Base
- Defined in:
- lib/kotoshu/language/tokenizer/base.rb
Overview
Abstract base class for tokenizers.
Uses Strategy pattern to allow different tokenization approaches for different languages.
Subclasses must implement the tokenize method.
Direct Known Subclasses
FrenchTokenizer, GermanTokenizer, JapaneseTokenizer, LatinTokenizer, PortugueseTokenizer, RussianTokenizer, SpanishTokenizer
Instance Method Summary collapse
-
#normalize(token) ⇒ String
Normalize a token.
-
#skip_token?(token) ⇒ Boolean
Check if a token should be skipped.
-
#tokenize(text) ⇒ Array<String>
Tokenize text into words.
-
#tokenize_with_positions(text) ⇒ Array<Hash>
Tokenize text with positions.
-
#word_boundary_regex ⇒ Regexp
Get word boundary regex for this tokenizer.
-
#word_char?(char) ⇒ Boolean
Check if a character is a word character.
Instance Method Details
#normalize(token) ⇒ String
Normalize a token.
Subclasses can override this for language-specific normalization.
114 115 116 |
# File 'lib/kotoshu/language/tokenizer/base.rb', line 114 def normalize(token) token end |
#skip_token?(token) ⇒ Boolean
Check if a token should be skipped.
Subclasses can override this for language-specific filtering.
124 125 126 127 128 129 130 |
# File 'lib/kotoshu/language/tokenizer/base.rb', line 124 def skip_token?(token) return true if token.empty? return true if token.match?(/^\d+$/) # Pure numbers return true if token.length < 2 && token.match?(/^[^\p{L}]$/) false end |
#tokenize(text) ⇒ Array<String>
Tokenize text into words.
25 26 27 |
# File 'lib/kotoshu/language/tokenizer/base.rb', line 25 def tokenize(text) raise NotImplementedError, "#{self.class} must implement #tokenize" end |
#tokenize_with_positions(text) ⇒ Array<Hash>
Tokenize text with positions.
Returns tokens along with their position information.
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/kotoshu/language/tokenizer/base.rb', line 35 def tokenize_with_positions(text) return [] if text.nil? return [] if text.empty? tokens = [] line = 1 column = 1 position = 0 while position < text.length # Skip whitespace while position < text.length && text[position].match?(/\s/) if text[position] == "\n" line += 1 column = 1 else column += 1 end position += 1 end break if position >= text.length # Find token start_pos = position start_line = line start_column = column token_text = extract_next_token(text, position) if token_text tokens << { token: token_text, start: start_pos, end: start_pos + token_text.length, line: start_line, column: start_column } token_text.each_char do |char| column += 1 position += 1 if char == "\n" line += 1 column = 1 end end else position += 1 column += 1 end end tokens end |
#word_boundary_regex ⇒ Regexp
Get word boundary regex for this tokenizer.
Subclasses should override this to define word boundaries.
104 105 106 |
# File 'lib/kotoshu/language/tokenizer/base.rb', line 104 def word_boundary_regex raise NotImplementedError, "#{self.class} must implement #word_boundary_regex" end |
#word_char?(char) ⇒ Boolean
Check if a character is a word character.
95 96 97 |
# File 'lib/kotoshu/language/tokenizer/base.rb', line 95 def word_char?(char) match?(word_boundary_regex, char) end |