Module: TokenKit
- Extended by:
- TokenKit
- Included in:
- TokenKit
- Defined in:
- lib/tokenkit.rb,
lib/tokenkit/config.rb,
lib/tokenkit/version.rb,
lib/tokenkit/config_compat.rb,
lib/tokenkit/configuration.rb,
lib/tokenkit/config_builder.rb,
lib/tokenkit/regex_converter.rb
Overview
TokenKit provides fast, Rust-backed tokenization for Ruby with pattern preservation.
Defined Under Namespace
Modules: RegexConverter Classes: Config, ConfigBuilder, Configuration, Error, Tokenizer
Constant Summary collapse
- VERSION =
"0.1.0.pre.2"
Instance Method Summary collapse
-
#config ⇒ Config
deprecated
Deprecated.
Use #config_hash for read-only access or #configure to modify
-
#config_hash ⇒ Configuration
Returns the current global configuration as an immutable object.
-
#configure {|Config| ... } ⇒ Configuration
Configures the global tokenizer settings.
-
#reset ⇒ void
Resets the tokenizer to default configuration.
-
#tokenize(text, **opts) ⇒ Array<String>
Tokenizes text using the global configuration or with temporary overrides.
Instance Method Details
#config ⇒ Config
Use #config_hash for read-only access or #configure to modify
Returns the global configuration object for backward compatibility.
157 158 159 |
# File 'lib/tokenkit.rb', line 157 def config Config.instance end |
#config_hash ⇒ Configuration
Returns the current global configuration as an immutable object.
176 177 178 179 180 |
# File 'lib/tokenkit.rb', line 176 def config_hash @config_mutex.synchronize do @current_config ||= ConfigBuilder.new.build end end |
#configure {|Config| ... } ⇒ Configuration
Configures the global tokenizer settings.
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
# File 'lib/tokenkit.rb', line 213 def configure # Use the compatibility wrapper to support old API yield Config.instance if block_given? # Get the builder from the compatibility wrapper builder = Config.instance.build_config begin # Build and validate the new configuration new_config = builder.build # Apply to Rust tokenizer _configure(new_config.to_rust_config) # Store the new configuration @config_mutex.synchronize do @current_config = new_config end # Reset the compatibility wrapper Config.instance.reset_temp new_config rescue => e # Reset the compatibility wrapper on error Config.instance.reset_temp raise e end end |
#reset ⇒ void
This method returns an undefined value.
Resets the tokenizer to default configuration.
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
# File 'lib/tokenkit.rb', line 255 def reset # Create default configuration new_config = ConfigBuilder.new.build # Reset Rust tokenizer _reset _configure(new_config.to_rust_config) # Store the new configuration @config_mutex.synchronize do @current_config = new_config end # Reset the compatibility wrapper Config.instance.reset_temp # Reset Config singleton instance variables for backward compatibility Config.instance.instance_variable_set(:@strategy, :unicode) Config.instance.instance_variable_set(:@lowercase, true) Config.instance.instance_variable_set(:@remove_punctuation, false) Config.instance.instance_variable_set(:@preserve_patterns, []) Config.instance.instance_variable_set(:@grapheme_extended, true) Config.instance.instance_variable_set(:@min_gram, 2) Config.instance.instance_variable_set(:@max_gram, 10) Config.instance.instance_variable_set(:@delimiter, "/") Config.instance.instance_variable_set(:@split_on_chars, " \t\n\r") end |
#tokenize(text, **opts) ⇒ Array<String>
Tokenizes text using the global configuration or with temporary overrides.
138 139 140 141 142 143 144 145 146 147 |
# File 'lib/tokenkit.rb', line 138 def tokenize(text, **opts) if opts.any? # Create a fresh tokenizer with merged config merged_config = build_merged_config(opts) _tokenize_with_config(text, merged_config) else # Use default config (creates fresh tokenizer internally) _tokenize(text) end end |