Class: Vocabulary
- Inherits:
-
Object
- Object
- Vocabulary
- Includes:
- VocabularyProtocol
- Defined in:
- lib/kotoshu/embeddings/vocabulary.rb
Overview
Vocabulary - Word to index mapping
Provides efficient lookup from words to integer indices for embedding retrieval. Supports JSON file loading and saving.
Instance Attribute Summary collapse
-
#index_to_word ⇒ Array<String>
readonly
Index to word mapping (sparse array).
-
#language_code ⇒ String
readonly
ISO 639-1 language code.
-
#word_to_index ⇒ Hash{String => Integer}
readonly
Word to index mapping.
Class Method Summary collapse
-
.detect_language_from_path(path) ⇒ String
Detect language code from file path.
-
.from_file(path, language_code: nil) ⇒ Vocabulary
Load vocabulary from JSON file.
-
.from_words(words, language_code: 'en') ⇒ Vocabulary
Create vocabulary from Array of words.
Instance Method Summary collapse
-
#common_words(n: 10) ⇒ Array<String>
Get common/most frequent words.
-
#empty? ⇒ Boolean
Check if vocabulary is empty.
-
#get_word(index) ⇒ String?
Get word by index.
-
#include?(word) ⇒ Boolean
Check if word exists in vocabulary.
-
#initialize(language_code:, word_to_index:) ⇒ Vocabulary
constructor
Create a new vocabulary.
-
#lookup(word) ⇒ Integer?
Look up word index.
-
#sample(n: 10) ⇒ Array<String>
Get a sample of words.
-
#save_to_file(path, format: :hash) ⇒ Object
Save vocabulary to JSON file.
-
#size ⇒ Integer
Get vocabulary size.
-
#sub_vocabulary(words) ⇒ Vocabulary
Create a sub-vocabulary containing only specified words.
-
#to_h ⇒ Hash{String => Integer}
Convert to Hash.
-
#to_s ⇒ String
(also: #inspect)
String representation.
-
#valid_index?(index) ⇒ Boolean
Check if index is valid.
-
#words ⇒ Enumerator<String>
Get all words as enumerator.
-
#words_starting_with(prefix) ⇒ Array<String>
Find words starting with a prefix.
Methods included from Protocol
#assert_implemented_by!, #compliance_errors, #optional_methods, #required_methods
Constructor Details
#initialize(language_code:, word_to_index:) ⇒ Vocabulary
Create a new vocabulary
39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 39 def initialize(language_code:, word_to_index:) raise ArgumentError, 'word_to_index cannot be empty' if word_to_index.nil? || word_to_index.empty? @language_code = language_code @word_to_index = word_to_index.dup.freeze # Build reverse index (index -> word) @index_to_word = Array.new(@word_to_index.size) @word_to_index.each do |word, index| @index_to_word[index] = word if index < @index_to_word.size end @index_to_word.freeze end |
Instance Attribute Details
#index_to_word ⇒ Array<String> (readonly)
Returns Index to word mapping (sparse array).
30 31 32 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 30 def index_to_word @index_to_word end |
#language_code ⇒ String (readonly)
Returns ISO 639-1 language code.
24 25 26 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 24 def language_code @language_code end |
#word_to_index ⇒ Hash{String => Integer} (readonly)
Returns Word to index mapping.
27 28 29 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 27 def word_to_index @word_to_index end |
Class Method Details
.detect_language_from_path(path) ⇒ String
Detect language code from file path
244 245 246 247 248 249 250 251 252 253 254 255 256 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 244 def self.detect_language_from_path(path) basename = File.basename(path) if basename =~ /(\w+)\.vocab\.json\z/ return $1 end if basename =~ /\.(\w+)\.vocab\.json\z/ return $1 end 'unknown' end |
.from_file(path, language_code: nil) ⇒ Vocabulary
Load vocabulary from JSON file
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 133 def self.from_file(path, language_code: nil) raise ArgumentError, "File not found: #{path}" unless File.exist?(path) language_code ||= detect_language_from_path(path) data = JSON.parse(File.read(path)) case data when Hash word_to_index = data.transform_keys(&:freeze).freeze when Array word_to_index = {} data.each_with_index do |word, index| word_to_index[word.freeze] = index end word_to_index.freeze else raise ArgumentError, "Invalid vocabulary format: expected Hash or Array" end new(language_code: language_code, word_to_index: word_to_index) end |
.from_words(words, language_code: 'en') ⇒ Vocabulary
Create vocabulary from Array of words
162 163 164 165 166 167 168 169 170 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 162 def self.from_words(words, language_code: 'en') word_to_index = {} words.each_with_index do |word, index| word_to_index[word.freeze] = index end word_to_index.freeze new(language_code: language_code, word_to_index: word_to_index) end |
Instance Method Details
#common_words(n: 10) ⇒ Array<String>
Get common/most frequent words
102 103 104 105 106 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 102 def common_words(n: 10) return [] if @word_to_index.empty? @word_to_index.keys.first(n) end |
#empty? ⇒ Boolean
Check if vocabulary is empty
195 196 197 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 195 def empty? @word_to_index.empty? end |
#get_word(index) ⇒ String?
Get word by index
67 68 69 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 67 def get_word(index) @index_to_word[index] end |
#include?(word) ⇒ Boolean
Check if word exists in vocabulary
76 77 78 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 76 def include?(word) @word_to_index.key?(word) end |
#lookup(word) ⇒ Integer?
Look up word index
58 59 60 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 58 def lookup(word) @word_to_index[word] end |
#sample(n: 10) ⇒ Array<String>
Get a sample of words
204 205 206 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 204 def sample(n: 10) @word_to_index.keys.sample(n) end |
#save_to_file(path, format: :hash) ⇒ Object
Save vocabulary to JSON file
177 178 179 180 181 182 183 184 185 186 187 188 189 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 177 def save_to_file(path, format: :hash) case format when :hash data = @word_to_index.dup when :array max_index = @index_to_word.compact.length data = @index_to_word.compact.first(max_index) else raise ArgumentError, "Unknown format: #{format}" end File.write(path, JSON.pretty_generate(data)) end |
#size ⇒ Integer
Get vocabulary size
84 85 86 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 84 def size @word_to_index.size end |
#sub_vocabulary(words) ⇒ Vocabulary
Create a sub-vocabulary containing only specified words
213 214 215 216 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 213 def sub_vocabulary(words) filtered = @word_to_index.select { |w, _| words.include?(w) } self.class.new(language_code: @language_code, word_to_index: filtered) end |
#to_h ⇒ Hash{String => Integer}
Convert to Hash
112 113 114 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 112 def to_h @word_to_index.dup end |
#to_s ⇒ String Also known as: inspect
String representation
232 233 234 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 232 def to_s "Vocabulary(language: #{@language_code}, size: #{@word_to_index.size})" end |
#valid_index?(index) ⇒ Boolean
Check if index is valid
93 94 95 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 93 def valid_index?(index) index.is_a?(Integer) && index >= 0 && index < @word_to_index.size end |
#words ⇒ Enumerator<String>
Get all words as enumerator
120 121 122 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 120 def words @word_to_index.each_key end |
#words_starting_with(prefix) ⇒ Array<String>
Find words starting with a prefix
223 224 225 226 |
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 223 def words_starting_with(prefix) pattern = /^#{Regexp.escape(prefix)}/ @word_to_index.keys.grep(pattern) end |