Class: Kotoshu::Components::Tokenizer Abstract

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/components/tokenizer.rb

Overview

This class is abstract.

Subclasses must implement #tokenize

Base class for tokenizers.

Tokenizers split text into individual tokens (words, punctuation). Different languages use different tokenization strategies:

  • Latin scripts: Whitespace + punctuation

  • CJK: Morphological analysis

  • German: Compound word splitting

  • RTL: Right-to-left text handling

Examples:

Tokenizing English text

tokenizer = WhitespaceTokenizer.new
tokens = tokenizer.tokenize("Hello, world!")
# => [
#      { token: "Hello", position: 0, length: 5 },
#      { token: ",", position: 5, length: 1 },
#      { token: "world", position: 7, length: 5 },
#      { token: "!", position: 12, length: 1 }
#    ]

Direct Known Subclasses

WhitespaceTokenizer

Instance Method Summary collapse

Instance Method Details

#tokenize(text) ⇒ Array<Hash>

This method is abstract.

Subclasses must implement

Split text into tokens.

Each token is a hash with:

  • :token (String) - The token text

  • :position (Integer) - Character position in original text

  • :length (Integer) - Token length in characters

Additional keys may be added by subclasses:

  • :pos_tag (String) - Part of speech tag

  • :lemma (String) - Base form / lemma

  • :compound_part (Boolean) - Whether this is a compound word part

  • :script (Symbol) - Script type for multilingual text

Parameters:

  • text (String)

    The input text

Returns:

  • (Array<Hash>)

    Array of token hashes

Raises:

  • (NotImplementedError)

    if not implemented by subclass



43
44
45
# File 'lib/kotoshu/components/tokenizer.rb', line 43

def tokenize(text)
  raise NotImplementedError, "#{self.class} must implement #tokenize"
end

#tokenize_to_strings(text) ⇒ Array<String>

Tokenize and return just the token strings.

Convenience method for when you only need the text content.

Parameters:

  • text (String)

    The input text

Returns:

  • (Array<String>)

    Array of token strings



53
54
55
# File 'lib/kotoshu/components/tokenizer.rb', line 53

def tokenize_to_strings(text)
  tokenize(text).map { |t| t[:token] }
end