Class: Kotoshu::Language::Tokenizer::SpanishTokenizer

Inherits:
Base
  • Object
show all
Defined in:
lib/kotoshu/language/tokenizer/spanish_tokenizer.rb

Overview

Tokenizer for Spanish text.

Ported from LanguageTool’s SpanishWordTokenizer.

Handles:

  • Decimal point between digits (3.14)

  • Decimal comma between digits (3,14)

  • Ordinals (1.º, 2.ª, 1.er, 1.os, 1.as)

  • Hyphens (with do-not-split list since no tagger)

  • Soft hyphens

  • Inverted punctuation (¡, ¿)

Direct Known Subclasses

Kotoshu::Languages::Spanish::Tokenizer

Constant Summary collapse

WORD_SEPARATORS =

Spanish word separators - most punctuation and whitespace Note: We need to handle decimals specially, so we protect them first

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze
DECIMAL_POINT =

Decimal point between digits: 3.14

/(\d)\.(\d)/
DECIMAL_COMMA =

Decimal comma between digits: 3,14

/(\d),(\d)/
ORDINAL =

Ordinal patterns: 1.º, 2.ª, 1.er, 1.os, 1.as

/\b(\d+)\.(º|ª|o|a|er|os|as)\b/
DECIMAL_POINT_PLACEHOLDER =

Placeholders for special patterns

"\uE101"
DECIMAL_COMMA_PLACEHOLDER =
"\uE102"
ORDINAL_PLACEHOLDER =
"\uE103"
SOFT_HYPHEN =

Soft hyphen

"\u00AD"
DO_NOT_SPLIT =

Do-not-split list (from LanguageTool)

%w[
  mers-cov mcgraw-hill sars-cov-2 sars-cov
  ph-metre ph-metres
].freeze

Instance Method Summary collapse

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ Object



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/kotoshu/language/tokenizer/spanish_tokenizer.rb', line 45

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Replace hyphen variants
  text = text.gsub("\u2010", "\u002d")  # hyphen to hyphen-minus
  text = text.gsub("\u2011", "\u002d")  # non-breaking hyphen to hyphen-minus

  # Protect decimal points
  text = text.gsub(DECIMAL_POINT, "\\1#{DECIMAL_POINT_PLACEHOLDER}\\2")

  # Protect decimal commas
  text = text.gsub(DECIMAL_COMMA, "\\1#{DECIMAL_COMMA_PLACEHOLDER}\\2")

  # Protect ordinals
  text = text.gsub(ORDINAL, "\\1#{ORDINAL_PLACEHOLDER}\\2")

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Process each token
  tokens = []
  raw_tokens.each do |token|
    next if token.empty?

    # Restore placeholders
    token = restore_placeholders(token)

    # Handle hyphenated words
    parts = words_to_add(token)
    tokens.concat(parts)
  end

  # Filter and normalize
  tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end