Class: Kotoshu::Language::Tokenizer::LatinTokenizer

Inherits:
Base
  • Object
show all
Defined in:
lib/kotoshu/language/tokenizer/latin_tokenizer.rb

Overview

Tokenizer for Latin-script languages.

Base tokenizer for English, French, German, Spanish, Portuguese, and other European languages using Latin script.

Handles:

  • Standard word boundaries (whitespace, punctuation)

  • Apostrophes within words (contractions, elisions)

  • Hyphenated words

  • Numbers with units

Subclasses can override for language-specific handling.

Constant Summary collapse

WORD_CHARS =

Latin word characters including accented characters

"a-zA-Zà-ÿ0-9'"
WORD_SEPARATORS =

Punctuation that separates words

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*+\-=_]/
CONTRACTIONS =

Contractions that should stay together

%w[
  I'm I'd I've I'll you're you'd you've you'll he's he'd he'll
  she's she'd she'll it's it'd we're we'd we've we'll they're
  they'd they've they'll that's that'd that'll who's who'd who'll
  what's what'd what'll where's where'd when's when'd why's why'd
  how's how'd can't won't don't shouldn't couldn't wouldn't didn't
  isn't aren't wasn't weren't hasn't haven't hadn't doesn't do
  doesn't didn't mightn't mustn't shan't shouldn't wouldn't
].freeze

Instance Method Summary collapse

Methods inherited from Base

#tokenize_with_positions, #word_char?

Instance Method Details

#normalize(token) ⇒ String

Normalize token.

Subclasses can override for language-specific normalization.

Parameters:

  • token (String)

    Token to normalize

Returns:

  • (String)

    Normalized token



65
66
67
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 65

def normalize(token)
  token.strip
end

#skip_token?(token) ⇒ Boolean

Check if token should be skipped.

Parameters:

  • token (String)

    Token to check

Returns:

  • (Boolean)

    True if should skip



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 73

def skip_token?(token)
  return true if super

  # Skip pure numbers
  return true if token.match?(/^\d+$/)

  # Skip single characters (unless a word)
  return true if token.length == 1 && token.match?(/[^a-zA-Zà-ÿ]/)

  # Skip empty tokens
  return true if token.empty?

  # Skip tokens with no letters
  return true unless token.match?(/[a-zA-Zà-ÿ]/)

  false
end

#tokenize(text) ⇒ Array<String>

Tokenize text into words.

Parameters:

  • text (String)

    Text to tokenize

Returns:

  • (Array<String>)

    Array of tokens



40
41
42
43
44
45
46
47
48
49
50
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 40

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Filter and normalize
  raw_tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end

#word_boundary_regexRegexp

Get word boundary regex.

Returns:

  • (Regexp)

    Word boundary regex



55
56
57
# File 'lib/kotoshu/language/tokenizer/latin_tokenizer.rb', line 55

def word_boundary_regex
  /[#{WORD_CHARS}]/
end