Module: TokenKit

Extended by:
TokenKit
Included in:
TokenKit
Defined in:
lib/tokenkit.rb,
lib/tokenkit/config.rb,
lib/tokenkit/version.rb,
lib/tokenkit/config_compat.rb,
lib/tokenkit/configuration.rb,
lib/tokenkit/config_builder.rb,
lib/tokenkit/regex_converter.rb

Overview

TokenKit provides fast, Rust-backed tokenization for Ruby with pattern preservation.

Examples:

Basic usage

TokenKit.tokenize("Hello, world!")
# => ["hello", "world"]

Configuration

TokenKit.configure do |config|
  config.strategy = :unicode
  config.lowercase = true
  config.preserve_patterns = [/\d+mg/i]
end

Instance-based tokenization

tokenizer = TokenKit::Tokenizer.new(strategy: :unicode)
tokenizer.tokenize("test text")

Defined Under Namespace

Modules: RegexConverter Classes: Config, ConfigBuilder, Configuration, Error, Tokenizer

Constant Summary collapse

VERSION =
"0.1.0.pre.2"

Instance Method Summary collapse

Instance Method Details

#configConfig

Deprecated.

Use #config_hash for read-only access or #configure to modify

Returns the global configuration object for backward compatibility.

Examples:

TokenKit.config.strategy = :unicode  # Deprecated

Returns:

  • (Config)

    The global configuration singleton



157
158
159
# File 'lib/tokenkit.rb', line 157

def config
  Config.instance
end

#config_hashConfiguration

Returns the current global configuration as an immutable object.

Examples:

Get current configuration

config = TokenKit.config_hash
config.strategy          # => :unicode
config.lowercase         # => true
config.preserve_patterns # => []

Check strategy type

config = TokenKit.config_hash
config.unicode?          # => true
config.edge_ngram?       # => false

Returns:

  • (Configuration)

    The current configuration with accessor methods



176
177
178
179
180
# File 'lib/tokenkit.rb', line 176

def config_hash
  @config_mutex.synchronize do
    @current_config ||= ConfigBuilder.new.build
  end
end

#configure {|Config| ... } ⇒ Configuration

Configures the global tokenizer settings.

Examples:

Basic configuration

TokenKit.configure do |config|
  config.strategy = :unicode
  config.lowercase = true
end

With pattern preservation

TokenKit.configure do |config|
  config.strategy = :unicode
  config.preserve_patterns = [
    /\d+mg/i,            # Measurements
    /[A-Z]{2,}/,         # Acronyms
    /\w+@\w+\.\w+/      # Emails
  ]
end

Edge n-gram configuration

TokenKit.configure do |config|
  config.strategy = :edge_ngram
  config.min_gram = 2
  config.max_gram = 10
end

Yields:

  • (Config)

    Yields the configuration object for modification

Returns:

Raises:

  • (ArgumentError)

    If invalid configuration is provided

  • (RegexpError)

    If invalid regex pattern is provided



213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'lib/tokenkit.rb', line 213

def configure
  # Use the compatibility wrapper to support old API
  yield Config.instance if block_given?

  # Get the builder from the compatibility wrapper
  builder = Config.instance.build_config

  begin
    # Build and validate the new configuration
    new_config = builder.build

    # Apply to Rust tokenizer
    _configure(new_config.to_rust_config)

    # Store the new configuration
    @config_mutex.synchronize do
      @current_config = new_config
    end

    # Reset the compatibility wrapper
    Config.instance.reset_temp

    new_config
  rescue => e
    # Reset the compatibility wrapper on error
    Config.instance.reset_temp
    raise e
  end
end

#resetvoid

This method returns an undefined value.

Resets the tokenizer to default configuration.

Examples:

TokenKit.reset
# Configuration is now:
# - strategy: :unicode
# - lowercase: true
# - remove_punctuation: false
# - preserve_patterns: []


255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# File 'lib/tokenkit.rb', line 255

def reset
  # Create default configuration
  new_config = ConfigBuilder.new.build

  # Reset Rust tokenizer
  _reset
  _configure(new_config.to_rust_config)

  # Store the new configuration
  @config_mutex.synchronize do
    @current_config = new_config
  end

  # Reset the compatibility wrapper
  Config.instance.reset_temp

  # Reset Config singleton instance variables for backward compatibility
  Config.instance.instance_variable_set(:@strategy, :unicode)
  Config.instance.instance_variable_set(:@lowercase, true)
  Config.instance.instance_variable_set(:@remove_punctuation, false)
  Config.instance.instance_variable_set(:@preserve_patterns, [])
  Config.instance.instance_variable_set(:@grapheme_extended, true)
  Config.instance.instance_variable_set(:@min_gram, 2)
  Config.instance.instance_variable_set(:@max_gram, 10)
  Config.instance.instance_variable_set(:@delimiter, "/")
  Config.instance.instance_variable_set(:@split_on_chars, " \t\n\r")
end

#tokenize(text, **opts) ⇒ Array<String>

Tokenizes text using the global configuration or with temporary overrides.

Examples:

Basic tokenization

TokenKit.tokenize("Hello, world!")
# => ["hello", "world"]

With temporary overrides

TokenKit.tokenize("Hello World", lowercase: false)
# => ["Hello", "World"]

With strategy override

TokenKit.tokenize("test-case", strategy: :char_group, split_on_chars: "-")
# => ["test", "case"]

Parameters:

  • text (String)

    The text to tokenize

  • opts (Hash)

    Optional configuration overrides for this tokenization only

Options Hash (**opts):

  • :strategy (Symbol)

    The tokenization strategy to use

  • :lowercase (Boolean)

    Whether to lowercase tokens

  • :remove_punctuation (Boolean)

    Whether to remove punctuation

  • :preserve_patterns (Array<Regexp>)

    Patterns to preserve

  • :regex (String, Regexp)

    Pattern for :pattern strategy

  • :min_gram (Integer)

    Minimum n-gram size (for n-gram strategies)

  • :max_gram (Integer)

    Maximum n-gram size (for n-gram strategies)

  • :delimiter (String)

    Delimiter for :path_hierarchy strategy

  • :split_on_chars (String)

    Characters to split on for :char_group strategy

  • :extended (Boolean)

    Extended grapheme clusters for :grapheme strategy

Returns:

  • (Array<String>)

    An array of tokens



138
139
140
141
142
143
144
145
146
147
# File 'lib/tokenkit.rb', line 138

def tokenize(text, **opts)
  if opts.any?
    # Create a fresh tokenizer with merged config
    merged_config = build_merged_config(opts)
    _tokenize_with_config(text, merged_config)
  else
    # Use default config (creates fresh tokenizer internally)
    _tokenize(text)
  end
end