Class: Kotoshu::Language::Tokenizer::Base

Inherits:

Object

Object
Kotoshu::Language::Tokenizer::Base

show all

Defined in:: lib/kotoshu/language/tokenizer/base.rb

Overview

Abstract base class for tokenizers.

Uses Strategy pattern to allow different tokenization approaches for different languages.

Subclasses must implement the tokenize method.

Examples:

Implement a tokenizer

class MyTokenizer < Tokenizer::Base
  def tokenize(text)
    text.split(/ /)
  end
end

Direct Known Subclasses

FrenchTokenizer, GermanTokenizer, JapaneseTokenizer, LatinTokenizer, PortugueseTokenizer, RussianTokenizer, SpanishTokenizer

Instance Method Summary collapse

#normalize(token) ⇒ String

Normalize a token.
#skip_token?(token) ⇒ Boolean

Check if a token should be skipped.
#tokenize(text) ⇒ Array<String>

Tokenize text into words.
#tokenize_with_positions(text) ⇒ Array<Hash>

Tokenize text with positions.
#word_boundary_regex ⇒ Regexp

Get word boundary regex for this tokenizer.
#word_char?(char) ⇒ Boolean

Check if a character is a word character.

Instance Method Details

#normalize(token) ⇒ `String`

Normalize a token.

Subclasses can override this for language-specific normalization.

Parameters:

token (String) —

Token to normalize

Returns:

(String) —

Normalized token



114
115
116

# File 'lib/kotoshu/language/tokenizer/base.rb', line 114

def normalize(token)
  token
end

#skip_token?(token) ⇒ `Boolean`

Check if a token should be skipped.

Subclasses can override this for language-specific filtering.

Parameters:

token (String) —

Token to check

Returns:

(Boolean) —

True if token should be skipped

# File 'lib/kotoshu/language/tokenizer/base.rb', line 124

def skip_token?(token)
  return true if token.empty?
  return true if token.match?(/^\d+$/) # Pure numbers
  return true if token.length < 2 && token.match?(/^[^\p{L}]$/)

  false
end

#tokenize(text) ⇒ `Array<String>`

Tokenize text into words.

Parameters:

text (String) —

Text to tokenize

Returns:

(Array<String>) —

Array of tokens

Raises:

(NotImplementedError) —

Must be implemented by subclass



25
26
27

# File 'lib/kotoshu/language/tokenizer/base.rb', line 25

def tokenize(text)
  raise NotImplementedError, "#{self.class} must implement #tokenize"
end

#tokenize_with_positions(text) ⇒ `Array<Hash>`

Tokenize text with positions.

Returns tokens along with their position information.

Parameters:

text (String) —

Text to tokenize

Returns:

(Array<Hash>) —

Array of start:, end:, line:, column:

# File 'lib/kotoshu/language/tokenizer/base.rb', line 35

def tokenize_with_positions(text)
  return [] if text.nil?
  return [] if text.empty?

  tokens = []
  line = 1
  column = 1
  position = 0

  while position < text.length
    # Skip whitespace
    while position < text.length && text[position].match?(/\s/)
      if text[position] == "\n"
        line += 1
        column = 1
      else
        column += 1
      end
      position += 1
    end

    break if position >= text.length

    # Find token
    start_pos = position
    start_line = line
    start_column = column

    token_text = extract_next_token(text, position)

    if token_text
      tokens << {
        token: token_text,
        start: start_pos,
        end: start_pos + token_text.length,
        line: start_line,
        column: start_column
      }

      token_text.each_char do |char|
        column += 1
        position += 1
        if char == "\n"
          line += 1
          column = 1
        end
      end
    else
      position += 1
      column += 1
    end
  end

  tokens
end

#word_boundary_regex ⇒ `Regexp`

Get word boundary regex for this tokenizer.

Subclasses should override this to define word boundaries.

Returns:

(Regexp) —

Word boundary regex

Raises:

(NotImplementedError)



104
105
106

# File 'lib/kotoshu/language/tokenizer/base.rb', line 104

def word_boundary_regex
  raise NotImplementedError, "#{self.class} must implement #word_boundary_regex"
end

#word_char?(char) ⇒ `Boolean`

Check if a character is a word character.

Parameters:

char (String) —

Single character

Returns:

(Boolean) —

True if word character



95
96
97

# File 'lib/kotoshu/language/tokenizer/base.rb', line 95

def word_char?(char)
  match?(word_boundary_regex, char)
end

Class: Kotoshu::Language::Tokenizer::Base

Overview

Examples:

Implement a tokenizer

Direct Known Subclasses

Instance Method Summary collapse

Instance Method Details

#normalize(token) ⇒ String

#skip_token?(token) ⇒ Boolean

#tokenize(text) ⇒ Array<String>

#tokenize_with_positions(text) ⇒ Array<Hash>

#word_boundary_regex ⇒ Regexp

#word_char?(char) ⇒ Boolean

#normalize(token) ⇒ `String`

#skip_token?(token) ⇒ `Boolean`

#tokenize(text) ⇒ `Array<String>`

#tokenize_with_positions(text) ⇒ `Array<Hash>`

#word_boundary_regex ⇒ `Regexp`

#word_char?(char) ⇒ `Boolean`