Class: Kotoshu::Components::Tokenizer Abstract

Inherits:

Object

Object
Kotoshu::Components::Tokenizer

show all

Defined in:: lib/kotoshu/components/tokenizer.rb

Overview

This class is abstract.

Subclasses must implement #tokenize

Base class for tokenizers.

Tokenizers split text into individual tokens (words, punctuation). Different languages use different tokenization strategies:

Latin scripts: Whitespace + punctuation
CJK: Morphological analysis
German: Compound word splitting
RTL: Right-to-left text handling

Examples:

Tokenizing English text

tokenizer = WhitespaceTokenizer.new
tokens = tokenizer.tokenize("Hello, world!")
# => [
#      { token: "Hello", position: 0, length: 5 },
#      { token: ",", position: 5, length: 1 },
#      { token: "world", position: 7, length: 5 },
#      { token: "!", position: 12, length: 1 }
#    ]

Direct Known Subclasses

WhitespaceTokenizer

Instance Method Summary collapse

#tokenize(text) ⇒ Array<Hash> abstract

Split text into tokens.
#tokenize_to_strings(text) ⇒ Array<String>

Tokenize and return just the token strings.

Instance Method Details

#tokenize(text) ⇒ `Array<Hash>`

This method is abstract.

Subclasses must implement

Split text into tokens.

Each token is a hash with:

:token (String) - The token text
:position (Integer) - Character position in original text
:length (Integer) - Token length in characters

Additional keys may be added by subclasses:

:pos_tag (String) - Part of speech tag
:lemma (String) - Base form / lemma
:compound_part (Boolean) - Whether this is a compound word part
:script (Symbol) - Script type for multilingual text

Parameters:

text (String) —

The input text

Returns:

(Array<Hash>) —

Array of token hashes

Raises:

(NotImplementedError) —

if not implemented by subclass



43
44
45

# File 'lib/kotoshu/components/tokenizer.rb', line 43

def tokenize(text)
  raise NotImplementedError, "#{self.class} must implement #tokenize"
end

#tokenize_to_strings(text) ⇒ `Array<String>`

Tokenize and return just the token strings.

Convenience method for when you only need the text content.