Class: Kotoshu::Components::WhitespaceTokenizer

Inherits:

Tokenizer

Object
Tokenizer
Kotoshu::Components::WhitespaceTokenizer

show all

Defined in:: lib/kotoshu/components/whitespace_tokenizer.rb

Overview

Whitespace-based tokenizer for Latin-script languages.

Splits text on whitespace and separates punctuation. Suitable for languages with space-separated words (English, French, German, etc.).

This is a simple tokenizer that works well for most Latin-script languages. For more advanced tokenization (contractions, compounds), use language-specific tokenizers.

Examples:

Basic tokenization

tokenizer = WhitespaceTokenizer.new
tokens = tokenizer.tokenize("Hello, world!")
# => [
#      { token: "Hello", position: 0, length: 5 },
#      { token: ",", position: 5, length: 1 },
#      { token: "world", position: 7, length: 5 },
#      { token: "!", position: 12, length: 1 }
#    ]

Tokenizing to strings

tokenizer.tokenize_to_strings("Hello, world!")
# => ["Hello", ",", "world", "!"]

Direct Known Subclasses

Languages::English::Tokenizer

Constant Summary collapse

TOKEN_PATTERN = Regex pattern for matching tokens (words or punctuation).

/[\w']+|[^\w\s]/.freeze

Instance Method Summary collapse

#initialize(pattern: TOKEN_PATTERN) ⇒ WhitespaceTokenizer constructor

Create a new whitespace tokenizer.
#pattern ⇒ Regexp

Get the token pattern used by this tokenizer.
#punctuation?(char) ⇒ Boolean

Check if a character is punctuation.
#tokenize(text) ⇒ Array<Hash>

Split text into tokens.
#word_char?(char) ⇒ Boolean

Check if a character is a word character.

Methods inherited from Tokenizer

#tokenize_to_strings

Constructor Details

#initialize(pattern: TOKEN_PATTERN) ⇒ `WhitespaceTokenizer`

Create a new whitespace tokenizer.

Parameters:

pattern (Regexp) (defaults to: TOKEN_PATTERN) —

Optional custom token pattern



36
37
38

# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 36

def initialize(pattern: TOKEN_PATTERN)
  @pattern = pattern
end

Instance Method Details

#pattern ⇒ `Regexp`

Get the token pattern used by this tokenizer.

Returns:

(Regexp) —

The token pattern



75
76
77

# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 75

def pattern
  @pattern
end

#punctuation?(char) ⇒ `Boolean`

Check if a character is punctuation.

Parameters:

char (String) —

Single character

Returns:

(Boolean) —

True if punctuation



91
92
93

# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 91

def punctuation?(char)
  char.match?(/[^\w\s]/)
end

#tokenize(text) ⇒ `Array<Hash>`

Split text into tokens.

Each token is a hash with:

:token (String) - The token text
:position (Integer) - Character position in original text
:length (Integer) - Token length in characters

Parameters:

text (String) —

The input text

Returns:

(Array<Hash>) —

Array of token hashes

# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 49

def tokenize(text)
  return [] if text.nil? || text.empty?

  tokens = []
  position = 0

  # Find all matches
  text.scan(@pattern) do |match|
    match_str = match.is_a?(Array) ? match.first : match
    start_pos = text.index(match_str, position)

    tokens << {
      token: match_str,
      position: start_pos,
      length: match_str.length
    }

    position = start_pos + match_str.length
  end

  tokens
end

#word_char?(char) ⇒ `Boolean`

Check if a character is a word character.

Parameters:

char (String) —

Single character

Returns:

(Boolean) —

True if word character



83
84
85

# File 'lib/kotoshu/components/whitespace_tokenizer.rb', line 83

def word_char?(char)
  char.match?(/[\w]/)
end

Class: Kotoshu::Components::WhitespaceTokenizer

Overview

Examples:

Basic tokenization

Tokenizing to strings

Direct Known Subclasses

Constant Summary collapse

Instance Method Summary collapse

Methods inherited from Tokenizer

Constructor Details

#initialize(pattern: TOKEN_PATTERN) ⇒ WhitespaceTokenizer

Instance Method Details

#pattern ⇒ Regexp

#punctuation?(char) ⇒ Boolean

#tokenize(text) ⇒ Array<Hash>

#word_char?(char) ⇒ Boolean

#initialize(pattern: TOKEN_PATTERN) ⇒ `WhitespaceTokenizer`

#pattern ⇒ `Regexp`

#punctuation?(char) ⇒ `Boolean`

#tokenize(text) ⇒ `Array<Hash>`

#word_char?(char) ⇒ `Boolean`