Class: TokenKit::Configuration

Inherits:
Object
  • Object
show all
Defined in:
lib/tokenkit/configuration.rb,
lib/tokenkit/config_builder.rb

Overview

Immutable configuration object

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config_hash, builder = nil) ⇒ Configuration

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Creates a new configuration from a hash.

Parameters:

  • config_hash (Hash)

    Configuration values from Rust



35
36
37
38
39
40
41
# File 'lib/tokenkit/configuration.rb', line 35

def initialize(config_hash)
  @strategy = config_hash["strategy"]&.to_sym || :unicode
  @lowercase = config_hash.fetch("lowercase", true)
  @remove_punctuation = config_hash.fetch("remove_punctuation", false)
  @preserve_patterns = config_hash.fetch("preserve_patterns", []).freeze
  @raw_hash = config_hash
end

Instance Attribute Details

#delimiterString? (readonly)

Returns Delimiter for path hierarchy strategy.

Returns:

  • (String, nil)

    Delimiter for path hierarchy strategy



84
85
86
# File 'lib/tokenkit/configuration.rb', line 84

def delimiter
  @raw_hash["delimiter"]
end

#grapheme_extendedObject (readonly)

Returns the value of attribute grapheme_extended.



120
121
122
# File 'lib/tokenkit/config_builder.rb', line 120

def grapheme_extended
  @grapheme_extended
end

#lowercaseBoolean (readonly)

Returns Whether to lowercase tokens.

Returns:

  • (Boolean)

    Whether to lowercase tokens



22
23
24
# File 'lib/tokenkit/configuration.rb', line 22

def lowercase
  @lowercase
end

#max_gramInteger? (readonly)

Returns Maximum n-gram size for n-gram strategies.

Returns:

  • (Integer, nil)

    Maximum n-gram size for n-gram strategies



74
75
76
# File 'lib/tokenkit/configuration.rb', line 74

def max_gram
  @raw_hash["max_gram"]
end

#min_gramInteger? (readonly)

Returns Minimum n-gram size for n-gram strategies.

Returns:

  • (Integer, nil)

    Minimum n-gram size for n-gram strategies



69
70
71
# File 'lib/tokenkit/configuration.rb', line 69

def min_gram
  @raw_hash["min_gram"]
end

#preserve_patternsArray<Regexp> (readonly)

Returns Patterns to preserve from modification.

Returns:

  • (Array<Regexp>)

    Patterns to preserve from modification



28
29
30
# File 'lib/tokenkit/configuration.rb', line 28

def preserve_patterns
  @preserve_patterns
end

#regexString? (readonly)

Returns The regex pattern for pattern strategy.

Returns:

  • (String, nil)

    The regex pattern for pattern strategy



49
50
51
# File 'lib/tokenkit/configuration.rb', line 49

def regex
  @raw_hash["regex"]
end

#remove_punctuationBoolean (readonly)

Returns Whether to remove punctuation.

Returns:

  • (Boolean)

    Whether to remove punctuation



25
26
27
# File 'lib/tokenkit/configuration.rb', line 25

def remove_punctuation
  @remove_punctuation
end

#split_on_charsString? (readonly)

Returns Characters to split on for char_group strategy.

Returns:

  • (String, nil)

    Characters to split on for char_group strategy



99
100
101
# File 'lib/tokenkit/configuration.rb', line 99

def split_on_chars
  @raw_hash["split_on_chars"]
end

#strategySymbol (readonly)

Returns The tokenization strategy.

Returns:

  • (Symbol)

    The tokenization strategy



19
20
21
# File 'lib/tokenkit/configuration.rb', line 19

def strategy
  @strategy
end

Instance Method Details

#==(other) ⇒ Object

Check equality with another configuration



205
206
207
# File 'lib/tokenkit/config_builder.rb', line 205

def ==(other)
  other.is_a?(Configuration) && to_h == other.to_h
end

#char_group?Boolean

Returns true if using character group tokenization strategy.

Returns:

  • (Boolean)

    true if using character group tokenization strategy



94
95
96
# File 'lib/tokenkit/configuration.rb', line 94

def char_group?
  strategy == :char_group
end

#edge_ngram?Boolean

Returns true if using edge n-gram tokenization strategy.

Returns:

  • (Boolean)

    true if using edge n-gram tokenization strategy



64
65
66
# File 'lib/tokenkit/configuration.rb', line 64

def edge_ngram?
  strategy == :edge_ngram
end

#extendedBoolean?

Returns Whether to use extended grapheme clusters.

Returns:

  • (Boolean, nil)

    Whether to use extended grapheme clusters



59
60
61
# File 'lib/tokenkit/configuration.rb', line 59

def extended
  @raw_hash["extended"]
end

#grapheme?Boolean

Returns true if using grapheme tokenization strategy.

Returns:

  • (Boolean)

    true if using grapheme tokenization strategy



54
55
56
# File 'lib/tokenkit/configuration.rb', line 54

def grapheme?
  strategy == :grapheme
end

#inspectString

Returns a string representation of the configuration.

Returns:

  • (String)

    Human-readable configuration summary



154
155
156
# File 'lib/tokenkit/configuration.rb', line 154

def inspect
  "#<TokenKit::Configuration strategy=#{strategy} lowercase=#{lowercase} remove_punctuation=#{remove_punctuation}>"
end

#keyword?Boolean

Returns true if using keyword tokenization strategy.

Returns:

  • (Boolean)

    true if using keyword tokenization strategy



129
130
131
# File 'lib/tokenkit/configuration.rb', line 129

def keyword?
  strategy == :keyword
end

#letter?Boolean

Returns true if using letter tokenization strategy.

Returns:

  • (Boolean)

    true if using letter tokenization strategy



104
105
106
# File 'lib/tokenkit/configuration.rb', line 104

def letter?
  strategy == :letter
end

#lowercase?Boolean

Returns true if using lowercase tokenization strategy.

Returns:

  • (Boolean)

    true if using lowercase tokenization strategy



109
110
111
# File 'lib/tokenkit/configuration.rb', line 109

def lowercase?
  strategy == :lowercase
end

#ngram?Boolean

Returns true if using n-gram tokenization strategy.

Returns:

  • (Boolean)

    true if using n-gram tokenization strategy



89
90
91
# File 'lib/tokenkit/configuration.rb', line 89

def ngram?
  strategy == :ngram
end

#path_hierarchy?Boolean

Returns true if using path hierarchy tokenization strategy.

Returns:

  • (Boolean)

    true if using path hierarchy tokenization strategy



79
80
81
# File 'lib/tokenkit/configuration.rb', line 79

def path_hierarchy?
  strategy == :path_hierarchy
end

#pattern?Boolean

Strategy-specific accessors

Returns:

  • (Boolean)


44
45
46
# File 'lib/tokenkit/configuration.rb', line 44

def pattern?
  strategy == :pattern
end

#sentence?Boolean

Returns true if using sentence tokenization strategy.

Returns:

  • (Boolean)

    true if using sentence tokenization strategy



124
125
126
# File 'lib/tokenkit/configuration.rb', line 124

def sentence?
  strategy == :sentence
end

#to_builderObject

Create a new builder initialized with this configuration



176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# File 'lib/tokenkit/configuration.rb', line 176

def to_builder
  builder = ConfigBuilder.new
  builder.strategy = strategy
  builder.lowercase = lowercase
  builder.remove_punctuation = remove_punctuation
  builder.preserve_patterns = preserve_patterns.dup

  # Copy strategy-specific settings
  builder.regex = regex if pattern?
  builder.extended = extended if grapheme?
  builder.min_gram = min_gram if edge_ngram? || ngram?
  builder.max_gram = max_gram if edge_ngram? || ngram?
  builder.delimiter = delimiter if path_hierarchy?
  builder.split_on_chars = split_on_chars if char_group?

  builder
end

#to_hHash

Converts configuration to a hash.

Examples:

config.to_h
# => {"strategy" => "unicode", "lowercase" => true, ...}

Returns:

  • (Hash)

    Configuration as a hash



146
147
148
# File 'lib/tokenkit/configuration.rb', line 146

def to_h
  @raw_hash.dup
end

#to_rust_configHash

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Converts configuration to format expected by Rust.

Returns:

  • (Hash)

    Configuration hash for Rust FFI



163
164
165
# File 'lib/tokenkit/configuration.rb', line 163

def to_rust_config
  @raw_hash
end

#unicode?Boolean

Returns true if using unicode tokenization strategy.

Returns:

  • (Boolean)

    true if using unicode tokenization strategy



114
115
116
# File 'lib/tokenkit/configuration.rb', line 114

def unicode?
  strategy == :unicode
end

#url_email?Boolean

Returns true if using url_email tokenization strategy.

Returns:

  • (Boolean)

    true if using url_email tokenization strategy



134
135
136
# File 'lib/tokenkit/configuration.rb', line 134

def url_email?
  strategy == :url_email
end

#whitespace?Boolean

Returns true if using whitespace tokenization strategy.

Returns:

  • (Boolean)

    true if using whitespace tokenization strategy



119
120
121
# File 'lib/tokenkit/configuration.rb', line 119

def whitespace?
  strategy == :whitespace
end