Class: Kotoshu::Language::Tokenizer::SpanishTokenizer
- Defined in:
- lib/kotoshu/language/tokenizer/spanish_tokenizer.rb
Overview
Tokenizer for Spanish text.
Ported from LanguageTool’s SpanishWordTokenizer.
Handles:
-
Decimal point between digits (3.14)
-
Decimal comma between digits (3,14)
-
Ordinals (1.º, 2.ª, 1.er, 1.os, 1.as)
-
Hyphens (with do-not-split list since no tagger)
-
Soft hyphens
-
Inverted punctuation (¡, ¿)
Direct Known Subclasses
Constant Summary collapse
- WORD_SEPARATORS =
Spanish word separators - most punctuation and whitespace Note: We need to handle decimals specially, so we protect them first
/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze
- DECIMAL_POINT =
Decimal point between digits: 3.14
/(\d)\.(\d)/- DECIMAL_COMMA =
Decimal comma between digits: 3,14
/(\d),(\d)/- ORDINAL =
Ordinal patterns: 1.º, 2.ª, 1.er, 1.os, 1.as
/\b(\d+)\.(º|ª|o|a|er|os|as)\b/- DECIMAL_POINT_PLACEHOLDER =
Placeholders for special patterns
"\uE101"- DECIMAL_COMMA_PLACEHOLDER =
"\uE102"- ORDINAL_PLACEHOLDER =
"\uE103"- SOFT_HYPHEN =
Soft hyphen
"\u00AD"- DO_NOT_SPLIT =
Do-not-split list (from LanguageTool)
%w[ mers-cov mcgraw-hill sars-cov-2 sars-cov ph-metre ph-metres ].freeze
Instance Method Summary collapse
Methods inherited from Base
#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?
Instance Method Details
#tokenize(text) ⇒ Object
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/kotoshu/language/tokenizer/spanish_tokenizer.rb', line 45 def tokenize(text) return [] if text.nil? || text.strip.empty? # Replace hyphen variants text = text.gsub("\u2010", "\u002d") # hyphen to hyphen-minus text = text.gsub("\u2011", "\u002d") # non-breaking hyphen to hyphen-minus # Protect decimal points text = text.gsub(DECIMAL_POINT, "\\1#{DECIMAL_POINT_PLACEHOLDER}\\2") # Protect decimal commas text = text.gsub(DECIMAL_COMMA, "\\1#{DECIMAL_COMMA_PLACEHOLDER}\\2") # Protect ordinals text = text.gsub(ORDINAL, "\\1#{ORDINAL_PLACEHOLDER}\\2") # Split on word boundaries raw_tokens = text.split(WORD_SEPARATORS) # Process each token tokens = [] raw_tokens.each do |token| next if token.empty? # Restore placeholders token = restore_placeholders(token) # Handle hyphenated words parts = words_to_add(token) tokens.concat(parts) end # Filter and normalize tokens .map { |token| normalize(token) } .reject { |token| skip_token?(token) } end |