Class: Kotoshu::Language::Tokenizer::Base

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/language/tokenizer/base.rb

Overview

Abstract base class for tokenizers.

Uses Strategy pattern to allow different tokenization approaches for different languages.

Subclasses must implement the tokenize method.

Examples:

Implement a tokenizer

class MyTokenizer < Tokenizer::Base
  def tokenize(text)
    text.split(/ /)
  end
end

Instance Method Summary collapse

Instance Method Details

#normalize(token) ⇒ String

Normalize a token.

Subclasses can override this for language-specific normalization.

Parameters:

  • token (String)

    Token to normalize

Returns:

  • (String)

    Normalized token



114
115
116
# File 'lib/kotoshu/language/tokenizer/base.rb', line 114

def normalize(token)
  token
end

#skip_token?(token) ⇒ Boolean

Check if a token should be skipped.

Subclasses can override this for language-specific filtering.

Parameters:

  • token (String)

    Token to check

Returns:

  • (Boolean)

    True if token should be skipped



124
125
126
127
128
129
130
# File 'lib/kotoshu/language/tokenizer/base.rb', line 124

def skip_token?(token)
  return true if token.empty?
  return true if token.match?(/^\d+$/) # Pure numbers
  return true if token.length < 2 && token.match?(/^[^\p{L}]$/)

  false
end

#tokenize(text) ⇒ Array<String>

Tokenize text into words.

Parameters:

  • text (String)

    Text to tokenize

Returns:

  • (Array<String>)

    Array of tokens

Raises:

  • (NotImplementedError)

    Must be implemented by subclass



25
26
27
# File 'lib/kotoshu/language/tokenizer/base.rb', line 25

def tokenize(text)
  raise NotImplementedError, "#{self.class} must implement #tokenize"
end

#tokenize_with_positions(text) ⇒ Array<Hash>

Tokenize text with positions.

Returns tokens along with their position information.

Parameters:

  • text (String)

    Text to tokenize

Returns:

  • (Array<Hash>)

    Array of start:, end:, line:, column:



35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/kotoshu/language/tokenizer/base.rb', line 35

def tokenize_with_positions(text)
  return [] if text.nil?
  return [] if text.empty?

  tokens = []
  line = 1
  column = 1
  position = 0

  while position < text.length
    # Skip whitespace
    while position < text.length && text[position].match?(/\s/)
      if text[position] == "\n"
        line += 1
        column = 1
      else
        column += 1
      end
      position += 1
    end

    break if position >= text.length

    # Find token
    start_pos = position
    start_line = line
    start_column = column

    token_text = extract_next_token(text, position)

    if token_text
      tokens << {
        token: token_text,
        start: start_pos,
        end: start_pos + token_text.length,
        line: start_line,
        column: start_column
      }

      token_text.each_char do |char|
        column += 1
        position += 1
        if char == "\n"
          line += 1
          column = 1
        end
      end
    else
      position += 1
      column += 1
    end
  end

  tokens
end

#word_boundary_regexRegexp

Get word boundary regex for this tokenizer.

Subclasses should override this to define word boundaries.

Returns:

  • (Regexp)

    Word boundary regex

Raises:

  • (NotImplementedError)


104
105
106
# File 'lib/kotoshu/language/tokenizer/base.rb', line 104

def word_boundary_regex
  raise NotImplementedError, "#{self.class} must implement #word_boundary_regex"
end

#word_char?(char) ⇒ Boolean

Check if a character is a word character.

Parameters:

  • char (String)

    Single character

Returns:

  • (Boolean)

    True if word character



95
96
97
# File 'lib/kotoshu/language/tokenizer/base.rb', line 95

def word_char?(char)
  match?(word_boundary_regex, char)
end