Class: Kotoshu::Language::Tokenizer::LatinTokenizer
- Defined in:
- lib/kotoshu/language/tokenizer/latin_tokenizer.rb
Overview
Tokenizer for Latin-script languages.
Base tokenizer for English, French, German, Spanish, Portuguese, and other European languages using Latin script.
Handles:
-
Standard word boundaries (whitespace, punctuation)
-
Apostrophes within words (contractions, elisions)
-
Hyphenated words
-
Numbers with units
Subclasses can override for language-specific handling.
Constant Summary collapse
- WORD_CHARS =
Latin word characters including accented characters
"a-zA-Zà-ÿ0-9'"- WORD_SEPARATORS =
Punctuation that separates words
/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*+\-=_]/- CONTRACTIONS =
Contractions that should stay together
%w[ I'm I'd I've I'll you're you'd you've you'll he's he'd he'll she's she'd she'll it's it'd we're we'd we've we'll they're they'd they've they'll that's that'd that'll who's who'd who'll what's what'd what'll where's where'd when's when'd why's why'd how's how'd can't won't don't shouldn't couldn't wouldn't didn't isn't aren't wasn't weren't hasn't haven't hadn't doesn't do doesn't didn't mightn't mustn't shan't shouldn't wouldn't ].freeze
Instance Method Summary collapse
-
#normalize(token) ⇒ String
Normalize token.
-
#skip_token?(token) ⇒ Boolean
Check if token should be skipped.
-
#tokenize(text) ⇒ Array<String>
Tokenize text into words.
-
#word_boundary_regex ⇒ Regexp
Get word boundary regex.
Methods inherited from Base
#tokenize_with_positions, #word_char?
Instance Method Details
#normalize(token) ⇒ String
Normalize token.
Subclasses can override for language-specific normalization.
65 66 67 |
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 65 def normalize(token) token.strip end |
#skip_token?(token) ⇒ Boolean
Check if token should be skipped.
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 73 def skip_token?(token) return true if super # Skip pure numbers return true if token.match?(/^\d+$/) # Skip single characters (unless a word) return true if token.length == 1 && token.match?(/[^a-zA-Zà-ÿ]/) # Skip empty tokens return true if token.empty? # Skip tokens with no letters return true unless token.match?(/[a-zA-Zà-ÿ]/) false end |
#tokenize(text) ⇒ Array<String>
Tokenize text into words.
40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 40 def tokenize(text) return [] if text.nil? || text.strip.empty? # Split on word boundaries raw_tokens = text.split(WORD_SEPARATORS) # Filter and normalize raw_tokens .map { |token| normalize(token) } .reject { |token| skip_token?(token) } end |
#word_boundary_regex ⇒ Regexp
Get word boundary regex.
55 56 57 |
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 55 def word_boundary_regex /[#{WORD_CHARS}]/ end |