Class: Kotoshu::Language::Tokenizer::FrenchTokenizer

Inherits:
Base
  • Object
show all
Defined in:
lib/kotoshu/language/tokenizer/french_tokenizer.rb

Overview

Tokenizer for French text.

Ported from LanguageTool’s FrenchWordTokenizer.

Handles:

  • Apostrophes (l’, d’, qu’, c’est, j’ai, etc.)

  • Hyphens (c’est-à-dire, rendez-vous, etc.)

  • Decimal points/commas

  • Multiple contraction patterns (7 total)

Direct Known Subclasses

Kotoshu::Languages::French::Tokenizer

Constant Summary collapse

WORD_SEPARATORS =

French word separators - most punctuation and whitespace Note: apostrophe (‘) is NOT a separator in French (used for contractions)

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze
DO_NOT_SPLIT =

Do-not-split list (from LanguageTool)

%w[
  mers-cov mcgraw-hill sars-cov-2 sars-cov
  ph-metre ph-metres anti-ivg anti-uv anti-vih al-qaïda
  c'est-à-dire add-on add-ons rendez-vous garde-à-vous
  chez-eux chez-moi chez-nous chez-soi chez-toi chez-vous
  m'as-tu-vu
].freeze
CONTRACTION_PATTERNS =

Contraction patterns (from LanguageTool) French contractions are complex: l’, d’, qu’, c’est, j’ai, n’a, etc.

[
  # c' followed by word: c'est, c'était, etc.
  /^(c[''])$/i,
  # j' (je): j'ai, j'aime, etc.
  /^(j[''])$/i,
  # n' (ne): n'a, n'est, etc.
  /^(n[''])$/i,
  # m' (me): m'a, m'appelle, etc.
  /^(m[''])$/i,
  # t' (te): t'a, t'asseoir, etc.
  /^(t[''])$/i,
  # s' (se): s'a, s'appelle, etc.
  /^(s[''])$/i,
  # l' (le/la): l'a, l'homme, l'eau, etc.
  /^(l[''])$/i,
  # d' (de): d'un, d'une, d'abord, etc.
  /^(d[''])$/i,
  # qu' (que): qu'un, qu'une, qu'est, etc.
  /^(qu[''])$/i,
  # jusqu'à, jusqu'aux, etc.
  /^(jusqu[''])$/i,
  # puisque, puisqu'il, etc.
  /^(puisqu[''])$/i,
  # quoique, quoiqu'il, etc.
  /^(quoiqu[''])$/i,
  # lorsque, lorsqu'il, etc.
  /^(lorsqu[''])$/i,
].freeze

Instance Method Summary collapse

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ Object



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/kotoshu/language/tokenizer/french_tokenizer.rb', line 60

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Replace hyphen variants
  text = text.gsub("\u2010", "\u002d")
  text = text.gsub("\u2011", "\u002d")

  # Normalize apostrophes
  text = normalize_apostrophes(text)

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Process each token
  tokens = []
  raw_tokens.each do |token|
    next if token.empty?

    # Try to split contractions and hyphenated words
    parts = split_french_word(token)
    tokens.concat(parts)
  end

  # Filter and normalize
  tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end