Class: Kotoshu::Components::WhitespaceTokenizer

Inherits:
Tokenizer
  • Object
show all
Defined in:
lib/kotoshu/components/whitespace_tokenizer.rb

Overview

Whitespace-based tokenizer for Latin-script languages.

Splits text on whitespace and separates punctuation. Suitable for languages with space-separated words (English, French, German, etc.).

This is a simple tokenizer that works well for most Latin-script languages. For more advanced tokenization (contractions, compounds), use language-specific tokenizers.

Examples:

Basic tokenization

tokenizer = WhitespaceTokenizer.new
tokens = tokenizer.tokenize("Hello, world!")
# => [
#      { token: "Hello", position: 0, length: 5 },
#      { token: ",", position: 5, length: 1 },
#      { token: "world", position: 7, length: 5 },
#      { token: "!", position: 12, length: 1 }
#    ]

Tokenizing to strings

tokenizer.tokenize_to_strings("Hello, world!")
# => ["Hello", ",", "world", "!"]

Direct Known Subclasses

Languages::English::Tokenizer

Constant Summary collapse

TOKEN_PATTERN =

Regex pattern for matching tokens (words or punctuation).

/[\w']+|[^\w\s]/.freeze

Instance Method Summary collapse

Methods inherited from Tokenizer

#tokenize_to_strings

Constructor Details

#initialize(pattern: TOKEN_PATTERN) ⇒ WhitespaceTokenizer

Create a new whitespace tokenizer.

Parameters:

  • pattern (Regexp) (defaults to: TOKEN_PATTERN)

    Optional custom token pattern



36
37
38
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 36

def initialize(pattern: TOKEN_PATTERN)
  @pattern = pattern
end

Instance Method Details

#patternRegexp

Get the token pattern used by this tokenizer.

Returns:

  • (Regexp)

    The token pattern



75
76
77
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 75

def pattern
  @pattern
end

#punctuation?(char) ⇒ Boolean

Check if a character is punctuation.

Parameters:

  • char (String)

    Single character

Returns:

  • (Boolean)

    True if punctuation



91
92
93
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 91

def punctuation?(char)
  char.match?(/[^\w\s]/)
end

#tokenize(text) ⇒ Array<Hash>

Split text into tokens.

Each token is a hash with:

  • :token (String) - The token text

  • :position (Integer) - Character position in original text

  • :length (Integer) - Token length in characters

Parameters:

  • text (String)

    The input text

Returns:

  • (Array<Hash>)

    Array of token hashes



49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 49

def tokenize(text)
  return [] if text.nil? || text.empty?

  tokens = []
  position = 0

  # Find all matches
  text.scan(@pattern) do |match|
    match_str = match.is_a?(Array) ? match.first : match
    start_pos = text.index(match_str, position)

    tokens << {
      token: match_str,
      position: start_pos,
      length: match_str.length
    }

    position = start_pos + match_str.length
  end

  tokens
end

#word_char?(char) ⇒ Boolean

Check if a character is a word character.

Parameters:

  • char (String)

    Single character

Returns:

  • (Boolean)

    True if word character



83
84
85
# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 83

def word_char?(char)
  char.match?(/[\w]/)
end