Class: Kotoshu::Language::Tokenizer::LatinTokenizer

Inherits:

Base

Object
Base
Kotoshu::Language::Tokenizer::LatinTokenizer

show all

Defined in:: lib/kotoshu/language/tokenizer/latin_tokenizer.rb

Overview

Tokenizer for Latin-script languages.

Base tokenizer for English, French, German, Spanish, Portuguese, and other European languages using Latin script.

Handles:

Standard word boundaries (whitespace, punctuation)
Apostrophes within words (contractions, elisions)
Hyphenated words
Numbers with units

Subclasses can override for language-specific handling.

Constant Summary collapse

WORD_CHARS = Latin word characters including accented characters

"a-zA-Zà-ÿ0-9'"

WORD_SEPARATORS = Punctuation that separates words

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*+\-=_]/

CONTRACTIONS = Contractions that should stay together

%w[
  I'm I'd I've I'll you're you'd you've you'll he's he'd he'll
  she's she'd she'll it's it'd we're we'd we've we'll they're
  they'd they've they'll that's that'd that'll who's who'd who'll
  what's what'd what'll where's where'd when's when'd why's why'd
  how's how'd can't won't don't shouldn't couldn't wouldn't didn't
  isn't aren't wasn't weren't hasn't haven't hadn't doesn't do
  doesn't didn't mightn't mustn't shan't shouldn't wouldn't
].freeze

Instance Method Summary collapse

#normalize(token) ⇒ String

Normalize token.
#skip_token?(token) ⇒ Boolean

Check if token should be skipped.
#tokenize(text) ⇒ Array<String>

Tokenize text into words.
#word_boundary_regex ⇒ Regexp

Get word boundary regex.

Methods inherited from Base

#tokenize_with_positions, #word_char?

Instance Method Details

#normalize(token) ⇒ `String`

Normalize token.

Subclasses can override for language-specific normalization.

Parameters:

token (String) —

Token to normalize

Returns:

(String) —

Normalized token



65
66
67

# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 65

def normalize(token)
  token.strip
end

#skip_token?(token) ⇒ `Boolean`

Check if token should be skipped.

Parameters:

token (String) —

Token to check

Returns:

(Boolean) —

True if should skip

# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 73

def skip_token?(token)
  return true if super

  # Skip pure numbers
  return true if token.match?(/^\d+$/)

  # Skip single characters (unless a word)
  return true if token.length == 1 && token.match?(/[^a-zA-Zà-ÿ]/)

  # Skip empty tokens
  return true if token.empty?

  # Skip tokens with no letters
  return true unless token.match?(/[a-zA-Zà-ÿ]/)

  false
end

#tokenize(text) ⇒ `Array<String>`

Tokenize text into words.

Parameters:

text (String) —

Text to tokenize

Returns:

(Array<String>) —

Array of tokens

# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 40

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Filter and normalize
  raw_tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end

#word_boundary_regex ⇒ `Regexp`

Get word boundary regex.

Returns:

(Regexp) —

Word boundary regex



55
56
57

# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 55

def word_boundary_regex
  /[#{WORD_CHARS}]/
end

Class: Kotoshu::Language::Tokenizer::LatinTokenizer

Overview

Constant Summary collapse

Instance Method Summary collapse

Methods inherited from Base

Instance Method Details

#normalize(token) ⇒ String

#skip_token?(token) ⇒ Boolean

#tokenize(text) ⇒ Array<String>

#word_boundary_regex ⇒ Regexp

#normalize(token) ⇒ `String`

#skip_token?(token) ⇒ `Boolean`

#tokenize(text) ⇒ `Array<String>`

#word_boundary_regex ⇒ `Regexp`