Class: TokenKit::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/tokenkit.rb

Overview

Instance-based tokenizer for thread-safe tokenization with specific configuration.

Examples:

Create a tokenizer with custom config

tokenizer = TokenKit::Tokenizer.new(
  strategy: :unicode,
  lowercase: true,
  preserve_patterns: [/\d+mg/i]
)
tokenizer.tokenize("Patient received 100mg")
# => ["patient", "received", "100mg"]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config = {}) ⇒ Tokenizer

Creates a new tokenizer instance with the specified configuration.

Examples:

With hash configuration

tokenizer = TokenKit::Tokenizer.new(strategy: :whitespace)

With existing configuration

config = TokenKit.config_hash
tokenizer = TokenKit::Tokenizer.new(config)

Parameters:

Options Hash (config):

  • :strategy (Symbol) — default: :unicode

    The tokenization strategy

  • :lowercase (Boolean) — default: true

    Whether to lowercase tokens

  • :remove_punctuation (Boolean) — default: false

    Whether to remove punctuation

  • :preserve_patterns (Array<Regexp>) — default: []

    Patterns to preserve



72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/tokenkit.rb', line 72

def initialize(config = {})
  @config = if config.is_a?(Configuration)
    config
  elsif config.is_a?(ConfigBuilder)
    config.build
  elsif config.is_a?(Hash)
    builder = TokenKit.config_hash.to_builder
    config.each do |key, value|
      builder.send("#{key}=", value) if builder.respond_to?("#{key}=")
    end
    builder.build
  else
    TokenKit.config_hash
  end
end

Instance Attribute Details

#configConfiguration (readonly)

Returns The tokenizer's configuration.

Returns:



55
56
57
# File 'lib/tokenkit.rb', line 55

def config
  @config
end

Instance Method Details

#tokenize(text) ⇒ Array<String>

Tokenizes the given text using this tokenizer's configuration.

Examples:

tokenizer = TokenKit::Tokenizer.new(strategy: :unicode)
tokenizer.tokenize("Hello world")
# => ["hello", "world"]

Parameters:

  • text (String)

    The text to tokenize

Returns:

  • (Array<String>)

    An array of tokens



98
99
100
# File 'lib/tokenkit.rb', line 98

def tokenize(text)
  TokenKit._tokenize_with_config(text, @config.to_rust_config)
end