Class: Kotoshu::Language::Tokenizer::FrenchTokenizer

Inherits:

Base

Object
Base
Kotoshu::Language::Tokenizer::FrenchTokenizer

show all

Defined in:: lib/kotoshu/language/tokenizer/french_tokenizer.rb

Overview

Tokenizer for French text.

Ported from LanguageTool’s FrenchWordTokenizer.

Handles:

Apostrophes (l’, d’, qu’, c’est, j’ai, etc.)
Hyphens (c’est-à-dire, rendez-vous, etc.)
Decimal points/commas
Multiple contraction patterns (7 total)

Direct Known Subclasses

Kotoshu::Languages::French::Tokenizer

Constant Summary collapse

WORD_SEPARATORS = French word separators - most punctuation and whitespace Note: apostrophe (‘) is NOT a separator in French (used for contractions)

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze

DO_NOT_SPLIT = Do-not-split list (from LanguageTool)

%w[
  mers-cov mcgraw-hill sars-cov-2 sars-cov
  ph-metre ph-metres anti-ivg anti-uv anti-vih al-qaïda
  c'est-à-dire add-on add-ons rendez-vous garde-à-vous
  chez-eux chez-moi chez-nous chez-soi chez-toi chez-vous
  m'as-tu-vu
].freeze

CONTRACTION_PATTERNS = Contraction patterns (from LanguageTool) French contractions are complex: l’, d’, qu’, c’est, j’ai, n’a, etc.

[
  # c' followed by word: c'est, c'était, etc.
  /^(c[''])$/i,
  # j' (je): j'ai, j'aime, etc.
  /^(j[''])$/i,
  # n' (ne): n'a, n'est, etc.
  /^(n[''])$/i,
  # m' (me): m'a, m'appelle, etc.
  /^(m[''])$/i,
  # t' (te): t'a, t'asseoir, etc.
  /^(t[''])$/i,
  # s' (se): s'a, s'appelle, etc.
  /^(s[''])$/i,
  # l' (le/la): l'a, l'homme, l'eau, etc.
  /^(l[''])$/i,
  # d' (de): d'un, d'une, d'abord, etc.
  /^(d[''])$/i,
  # qu' (que): qu'un, qu'une, qu'est, etc.
  /^(qu[''])$/i,
  # jusqu'à, jusqu'aux, etc.
  /^(jusqu[''])$/i,
  # puisque, puisqu'il, etc.
  /^(puisqu[''])$/i,
  # quoique, quoiqu'il, etc.
  /^(quoiqu[''])$/i,
  # lorsque, lorsqu'il, etc.
  /^(lorsqu[''])$/i,
].freeze

Instance Method Summary collapse

#tokenize(text) ⇒ Object

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ `Object`

# File 'lib/kotoshu/language/tokenizer/french_tokenizer.rb', line 60

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Replace hyphen variants
  text = text.gsub("\u2010", "\u002d")
  text = text.gsub("\u2011", "\u002d")

  # Normalize apostrophes
  text = normalize_apostrophes(text)

  # Split on word boundaries
  raw_tokens = text.split(WORD_SEPARATORS)

  # Process each token
  tokens = []
  raw_tokens.each do |token|
    next if token.empty?

    # Try to split contractions and hyphenated words
    parts = split_french_word(token)
    tokens.concat(parts)
  end

  # Filter and normalize
  tokens
    .map { |token| normalize(token) }
    .reject { |token| skip_token?(token) }
end