Class: TokenKit::Configuration
- Inherits:
-
Object
- Object
- TokenKit::Configuration
- Defined in:
- lib/tokenkit/configuration.rb,
lib/tokenkit/config_builder.rb
Overview
Immutable configuration object
Instance Attribute Summary collapse
-
#delimiter ⇒ String?
readonly
Delimiter for path hierarchy strategy.
-
#grapheme_extended ⇒ Object
readonly
Returns the value of attribute grapheme_extended.
-
#lowercase ⇒ Boolean
readonly
Whether to lowercase tokens.
-
#max_gram ⇒ Integer?
readonly
Maximum n-gram size for n-gram strategies.
-
#min_gram ⇒ Integer?
readonly
Minimum n-gram size for n-gram strategies.
-
#preserve_patterns ⇒ Array<Regexp>
readonly
Patterns to preserve from modification.
-
#regex ⇒ String?
readonly
The regex pattern for pattern strategy.
-
#remove_punctuation ⇒ Boolean
readonly
Whether to remove punctuation.
-
#split_on_chars ⇒ String?
readonly
Characters to split on for char_group strategy.
-
#strategy ⇒ Symbol
readonly
The tokenization strategy.
Instance Method Summary collapse
-
#==(other) ⇒ Object
Check equality with another configuration.
-
#char_group? ⇒ Boolean
True if using character group tokenization strategy.
-
#edge_ngram? ⇒ Boolean
True if using edge n-gram tokenization strategy.
-
#extended ⇒ Boolean?
Whether to use extended grapheme clusters.
-
#grapheme? ⇒ Boolean
True if using grapheme tokenization strategy.
-
#initialize(config_hash, builder = nil) ⇒ Configuration
constructor
private
Creates a new configuration from a hash.
-
#inspect ⇒ String
Returns a string representation of the configuration.
-
#keyword? ⇒ Boolean
True if using keyword tokenization strategy.
-
#letter? ⇒ Boolean
True if using letter tokenization strategy.
-
#lowercase? ⇒ Boolean
True if using lowercase tokenization strategy.
-
#ngram? ⇒ Boolean
True if using n-gram tokenization strategy.
-
#path_hierarchy? ⇒ Boolean
True if using path hierarchy tokenization strategy.
-
#pattern? ⇒ Boolean
Strategy-specific accessors.
-
#sentence? ⇒ Boolean
True if using sentence tokenization strategy.
-
#to_builder ⇒ Object
Create a new builder initialized with this configuration.
-
#to_h ⇒ Hash
Converts configuration to a hash.
-
#to_rust_config ⇒ Hash
private
Converts configuration to format expected by Rust.
-
#unicode? ⇒ Boolean
True if using unicode tokenization strategy.
-
#url_email? ⇒ Boolean
True if using url_email tokenization strategy.
-
#whitespace? ⇒ Boolean
True if using whitespace tokenization strategy.
Constructor Details
#initialize(config_hash, builder = nil) ⇒ Configuration
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Creates a new configuration from a hash.
35 36 37 38 39 40 41 |
# File 'lib/tokenkit/configuration.rb', line 35 def initialize(config_hash) @strategy = config_hash["strategy"]&.to_sym || :unicode @lowercase = config_hash.fetch("lowercase", true) @remove_punctuation = config_hash.fetch("remove_punctuation", false) @preserve_patterns = config_hash.fetch("preserve_patterns", []).freeze @raw_hash = config_hash end |
Instance Attribute Details
#delimiter ⇒ String? (readonly)
Returns Delimiter for path hierarchy strategy.
84 85 86 |
# File 'lib/tokenkit/configuration.rb', line 84 def delimiter @raw_hash["delimiter"] end |
#grapheme_extended ⇒ Object (readonly)
Returns the value of attribute grapheme_extended.
120 121 122 |
# File 'lib/tokenkit/config_builder.rb', line 120 def grapheme_extended @grapheme_extended end |
#lowercase ⇒ Boolean (readonly)
Returns Whether to lowercase tokens.
22 23 24 |
# File 'lib/tokenkit/configuration.rb', line 22 def lowercase @lowercase end |
#max_gram ⇒ Integer? (readonly)
Returns Maximum n-gram size for n-gram strategies.
74 75 76 |
# File 'lib/tokenkit/configuration.rb', line 74 def max_gram @raw_hash["max_gram"] end |
#min_gram ⇒ Integer? (readonly)
Returns Minimum n-gram size for n-gram strategies.
69 70 71 |
# File 'lib/tokenkit/configuration.rb', line 69 def min_gram @raw_hash["min_gram"] end |
#preserve_patterns ⇒ Array<Regexp> (readonly)
Returns Patterns to preserve from modification.
28 29 30 |
# File 'lib/tokenkit/configuration.rb', line 28 def preserve_patterns @preserve_patterns end |
#regex ⇒ String? (readonly)
Returns The regex pattern for pattern strategy.
49 50 51 |
# File 'lib/tokenkit/configuration.rb', line 49 def regex @raw_hash["regex"] end |
#remove_punctuation ⇒ Boolean (readonly)
Returns Whether to remove punctuation.
25 26 27 |
# File 'lib/tokenkit/configuration.rb', line 25 def remove_punctuation @remove_punctuation end |
#split_on_chars ⇒ String? (readonly)
Returns Characters to split on for char_group strategy.
99 100 101 |
# File 'lib/tokenkit/configuration.rb', line 99 def split_on_chars @raw_hash["split_on_chars"] end |
#strategy ⇒ Symbol (readonly)
Returns The tokenization strategy.
19 20 21 |
# File 'lib/tokenkit/configuration.rb', line 19 def strategy @strategy end |
Instance Method Details
#==(other) ⇒ Object
Check equality with another configuration
205 206 207 |
# File 'lib/tokenkit/config_builder.rb', line 205 def ==(other) other.is_a?(Configuration) && to_h == other.to_h end |
#char_group? ⇒ Boolean
Returns true if using character group tokenization strategy.
94 95 96 |
# File 'lib/tokenkit/configuration.rb', line 94 def char_group? strategy == :char_group end |
#edge_ngram? ⇒ Boolean
Returns true if using edge n-gram tokenization strategy.
64 65 66 |
# File 'lib/tokenkit/configuration.rb', line 64 def edge_ngram? strategy == :edge_ngram end |
#extended ⇒ Boolean?
Returns Whether to use extended grapheme clusters.
59 60 61 |
# File 'lib/tokenkit/configuration.rb', line 59 def extended @raw_hash["extended"] end |
#grapheme? ⇒ Boolean
Returns true if using grapheme tokenization strategy.
54 55 56 |
# File 'lib/tokenkit/configuration.rb', line 54 def grapheme? strategy == :grapheme end |
#inspect ⇒ String
Returns a string representation of the configuration.
154 155 156 |
# File 'lib/tokenkit/configuration.rb', line 154 def inspect "#<TokenKit::Configuration strategy=#{strategy} lowercase=#{lowercase} remove_punctuation=#{remove_punctuation}>" end |
#keyword? ⇒ Boolean
Returns true if using keyword tokenization strategy.
129 130 131 |
# File 'lib/tokenkit/configuration.rb', line 129 def keyword? strategy == :keyword end |
#letter? ⇒ Boolean
Returns true if using letter tokenization strategy.
104 105 106 |
# File 'lib/tokenkit/configuration.rb', line 104 def letter? strategy == :letter end |
#lowercase? ⇒ Boolean
Returns true if using lowercase tokenization strategy.
109 110 111 |
# File 'lib/tokenkit/configuration.rb', line 109 def lowercase? strategy == :lowercase end |
#ngram? ⇒ Boolean
Returns true if using n-gram tokenization strategy.
89 90 91 |
# File 'lib/tokenkit/configuration.rb', line 89 def ngram? strategy == :ngram end |
#path_hierarchy? ⇒ Boolean
Returns true if using path hierarchy tokenization strategy.
79 80 81 |
# File 'lib/tokenkit/configuration.rb', line 79 def path_hierarchy? strategy == :path_hierarchy end |
#pattern? ⇒ Boolean
Strategy-specific accessors
44 45 46 |
# File 'lib/tokenkit/configuration.rb', line 44 def pattern? strategy == :pattern end |
#sentence? ⇒ Boolean
Returns true if using sentence tokenization strategy.
124 125 126 |
# File 'lib/tokenkit/configuration.rb', line 124 def sentence? strategy == :sentence end |
#to_builder ⇒ Object
Create a new builder initialized with this configuration
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/tokenkit/configuration.rb', line 176 def to_builder builder = ConfigBuilder.new builder.strategy = strategy builder.lowercase = lowercase builder.remove_punctuation = remove_punctuation builder.preserve_patterns = preserve_patterns.dup # Copy strategy-specific settings builder.regex = regex if pattern? builder.extended = extended if grapheme? builder.min_gram = min_gram if edge_ngram? || ngram? builder.max_gram = max_gram if edge_ngram? || ngram? builder.delimiter = delimiter if path_hierarchy? builder.split_on_chars = split_on_chars if char_group? builder end |
#to_h ⇒ Hash
Converts configuration to a hash.
146 147 148 |
# File 'lib/tokenkit/configuration.rb', line 146 def to_h @raw_hash.dup end |
#to_rust_config ⇒ Hash
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Converts configuration to format expected by Rust.
163 164 165 |
# File 'lib/tokenkit/configuration.rb', line 163 def to_rust_config @raw_hash end |
#unicode? ⇒ Boolean
Returns true if using unicode tokenization strategy.
114 115 116 |
# File 'lib/tokenkit/configuration.rb', line 114 def unicode? strategy == :unicode end |
#url_email? ⇒ Boolean
Returns true if using url_email tokenization strategy.
134 135 136 |
# File 'lib/tokenkit/configuration.rb', line 134 def url_email? strategy == :url_email end |
#whitespace? ⇒ Boolean
Returns true if using whitespace tokenization strategy.
119 120 121 |
# File 'lib/tokenkit/configuration.rb', line 119 def whitespace? strategy == :whitespace end |