Class: Kotoshu::Language::Tokenizer::SpanishTokenizer

Inherits:

Base

Object
Base
Kotoshu::Language::Tokenizer::SpanishTokenizer

show all

Defined in:: lib/kotoshu/language/tokenizer/spanish_tokenizer.rb

Overview

Tokenizer for Spanish text.

Ported from LanguageTool’s SpanishWordTokenizer.

Handles:

Decimal point between digits (3.14)
Decimal comma between digits (3,14)
Ordinals (1.º, 2.ª, 1.er, 1.os, 1.as)
Hyphens (with do-not-split list since no tagger)
Soft hyphens
Inverted punctuation (¡, ¿)

Direct Known Subclasses

Kotoshu::Languages::Spanish::Tokenizer

Constant Summary collapse

WORD_SEPARATORS = Spanish word separators - most punctuation and whitespace Note: We need to handle decimals specially, so we protect them first

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze

DECIMAL_POINT = Decimal point between digits: 3.14

/(\d)\.(\d)/

DECIMAL_COMMA = Decimal comma between digits: 3,14

/(\d),(\d)/

ORDINAL = Ordinal patterns: 1.º, 2.ª, 1.er, 1.os, 1.as

/\b(\d+)\.(º|ª|o|a|er|os|as)\b/

DECIMAL_POINT_PLACEHOLDER = Placeholders for special patterns

"\uE101"

DECIMAL_COMMA_PLACEHOLDER =

"\uE102"

ORDINAL_PLACEHOLDER =

"\uE103"

SOFT_HYPHEN = Soft hyphen

"\u00AD"

DO_NOT_SPLIT = Do-not-split list (from LanguageTool)

%w[
  mers-cov mcgraw-hill sars-cov-2 sars-cov
  ph-metre ph-metres
].freeze

Instance Method Summary collapse

#tokenize(text) ⇒ Object

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ `Object`

# File 'lib/kotoshu/language/tokenizer/spanish_tokenizer.rb', line 45

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Replace hyphen variants
  text = text.gsub("\u2010", "\u002d")  # hyphen to hyphen-minus
  text = text.gsub("\u2011", "\u002d")  # non-breaking hyphen to hyphen-minus

  # Protect decimal points
  text = text.gsub(DECIMAL_POINT, "\\1#{DECIMAL_POINT_PLACEHOLDER}\\2")

  # Protect decimal commas
  text = text.gsub(DECIMAL_COMMA, "\\1#{DECIMAL_COMMA_PLACEHOLDER}\\2")

  # Protect ordinals
  text = text.gsub(ORDINAL, "\\1#{ORDINAL_PLACEHOLDER}\\2")

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Process each token
  tokens = []
  raw_tokens.each do |token|
    next if token.empty?

    # Restore placeholders
    token = restore_placeholders(token)

    # Handle hyphenated words
    parts = words_to_add(token)
    tokens.concat(parts)
  end

  # Filter and normalize
  tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end