Class: Vocabulary

Inherits:
Object
  • Object
show all
Includes:
VocabularyProtocol
Defined in:
lib/kotoshu/embeddings/vocabulary.rb

Overview

Vocabulary - Word to index mapping

Provides efficient lookup from words to integer indices for embedding retrieval. Supports JSON file loading and saving.

Examples:

Creating a vocabulary

vocab = Kotoshu::Embeddings::Vocabulary.new(
  language_code: 'en',
  word_to_index: { 'hello' => 0, 'world' => 1 }
)

Loading from file

vocab = Kotoshu::Embeddings::Vocabulary.from_file('/path/to/vocab.json', language_code: 'en')

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Protocol

#assert_implemented_by!, #compliance_errors, #optional_methods, #required_methods

Constructor Details

#initialize(language_code:, word_to_index:) ⇒ Vocabulary

Create a new vocabulary

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • word_to_index (Hash{String => Integer})

    Word to index mapping

Raises:

  • (ArgumentError)

    If word_to_index is empty



39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 39

def initialize(language_code:, word_to_index:)
  raise ArgumentError, 'word_to_index cannot be empty' if word_to_index.nil? || word_to_index.empty?

  @language_code = language_code
  @word_to_index = word_to_index.dup.freeze

  # Build reverse index (index -> word)
  @index_to_word = Array.new(@word_to_index.size)
  @word_to_index.each do |word, index|
    @index_to_word[index] = word if index < @index_to_word.size
  end
  @index_to_word.freeze
end

Instance Attribute Details

#index_to_wordArray<String> (readonly)

Returns Index to word mapping (sparse array).

Returns:

  • (Array<String>)

    Index to word mapping (sparse array)



30
31
32
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 30

def index_to_word
  @index_to_word
end

#language_codeString (readonly)

Returns ISO 639-1 language code.

Returns:

  • (String)

    ISO 639-1 language code



24
25
26
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 24

def language_code
  @language_code
end

#word_to_indexHash{String => Integer} (readonly)

Returns Word to index mapping.

Returns:

  • (Hash{String => Integer})

    Word to index mapping



27
28
29
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 27

def word_to_index
  @word_to_index
end

Class Method Details

.detect_language_from_path(path) ⇒ String

Detect language code from file path

Parameters:

  • path (String)

    File path

Returns:

  • (String)

    Detected language code



244
245
246
247
248
249
250
251
252
253
254
255
256
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 244

def self.detect_language_from_path(path)
  basename = File.basename(path)

  if basename =~ /(\w+)\.vocab\.json\z/
    return $1
  end

  if basename =~ /\.(\w+)\.vocab\.json\z/
    return $1
  end

  'unknown'
end

.from_file(path, language_code: nil) ⇒ Vocabulary

Load vocabulary from JSON file

Parameters:

  • path (String)

    Path to JSON file

  • language_code (String) (defaults to: nil)

    Language code (auto-detected from filename if nil)

Returns:

Raises:

  • (ArgumentError)

    If file doesn’t exist

  • (Json::ParserError)

    If file is not valid JSON



133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 133

def self.from_file(path, language_code: nil)
  raise ArgumentError, "File not found: #{path}" unless File.exist?(path)

  language_code ||= detect_language_from_path(path)

  data = JSON.parse(File.read(path))

  case data
  when Hash
    word_to_index = data.transform_keys(&:freeze).freeze
  when Array
    word_to_index = {}
    data.each_with_index do |word, index|
      word_to_index[word.freeze] = index
    end
    word_to_index.freeze
  else
    raise ArgumentError, "Invalid vocabulary format: expected Hash or Array"
  end

  new(language_code: language_code, word_to_index: word_to_index)
end

.from_words(words, language_code: 'en') ⇒ Vocabulary

Create vocabulary from Array of words

Parameters:

  • words (Array<String>)

    Array of words

  • language_code (String) (defaults to: 'en')

    Language code

Returns:



162
163
164
165
166
167
168
169
170
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 162

def self.from_words(words, language_code: 'en')
  word_to_index = {}
  words.each_with_index do |word, index|
    word_to_index[word.freeze] = index
  end
  word_to_index.freeze

  new(language_code: language_code, word_to_index: word_to_index)
end

Instance Method Details

#common_words(n: 10) ⇒ Array<String>

Get common/most frequent words

Parameters:

  • n (Integer) (defaults to: 10)

    Number of words to return

Returns:

  • (Array<String>)

    Array of common words



102
103
104
105
106
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 102

def common_words(n: 10)
  return [] if @word_to_index.empty?

  @word_to_index.keys.first(n)
end

#empty?Boolean

Check if vocabulary is empty

Returns:

  • (Boolean)

    True if empty



195
196
197
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 195

def empty?
  @word_to_index.empty?
end

#get_word(index) ⇒ String?

Get word by index

Parameters:

  • index (Integer)

    The index to look up

Returns:

  • (String, nil)

    Word at the index, or nil if not found



67
68
69
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 67

def get_word(index)
  @index_to_word[index]
end

#include?(word) ⇒ Boolean

Check if word exists in vocabulary

Parameters:

  • word (String)

    Word to check

Returns:

  • (Boolean)

    True if word exists



76
77
78
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 76

def include?(word)
  @word_to_index.key?(word)
end

#lookup(word) ⇒ Integer?

Look up word index

Parameters:

  • word (String)

    The word to look up

Returns:

  • (Integer, nil)

    Index of the word, or nil if not found



58
59
60
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 58

def lookup(word)
  @word_to_index[word]
end

#sample(n: 10) ⇒ Array<String>

Get a sample of words

Parameters:

  • n (Integer) (defaults to: 10)

    Number of words to sample

Returns:

  • (Array<String>)

    Sample of words



204
205
206
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 204

def sample(n: 10)
  @word_to_index.keys.sample(n)
end

#save_to_file(path, format: :hash) ⇒ Object

Save vocabulary to JSON file

Parameters:

  • path (String)

    Path to save file

  • format (Symbol) (defaults to: :hash)

    Format: :hash or :array



177
178
179
180
181
182
183
184
185
186
187
188
189
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 177

def save_to_file(path, format: :hash)
  case format
  when :hash
    data = @word_to_index.dup
  when :array
    max_index = @index_to_word.compact.length
    data = @index_to_word.compact.first(max_index)
  else
    raise ArgumentError, "Unknown format: #{format}"
  end

  File.write(path, JSON.pretty_generate(data))
end

#sizeInteger

Get vocabulary size

Returns:

  • (Integer)

    Number of words in vocabulary



84
85
86
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 84

def size
  @word_to_index.size
end

#sub_vocabulary(words) ⇒ Vocabulary

Create a sub-vocabulary containing only specified words

Parameters:

  • words (Array<String>)

    Words to include

Returns:

  • (Vocabulary)

    New vocabulary with subset of words



213
214
215
216
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 213

def sub_vocabulary(words)
  filtered = @word_to_index.select { |w, _| words.include?(w) }
  self.class.new(language_code: @language_code, word_to_index: filtered)
end

#to_hHash{String => Integer}

Convert to Hash

Returns:

  • (Hash{String => Integer})

    Copy of word_to_index mapping



112
113
114
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 112

def to_h
  @word_to_index.dup
end

#to_sString Also known as: inspect

String representation

Returns:

  • (String)


232
233
234
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 232

def to_s
  "Vocabulary(language: #{@language_code}, size: #{@word_to_index.size})"
end

#valid_index?(index) ⇒ Boolean

Check if index is valid

Parameters:

  • index (Integer)

    Index to check

Returns:

  • (Boolean)

    True if index is valid



93
94
95
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 93

def valid_index?(index)
  index.is_a?(Integer) && index >= 0 && index < @word_to_index.size
end

#wordsEnumerator<String>

Get all words as enumerator

Returns:

  • (Enumerator<String>)

    Enumerator of all words



120
121
122
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 120

def words
  @word_to_index.each_key
end

#words_starting_with(prefix) ⇒ Array<String>

Find words starting with a prefix

Parameters:

  • prefix (String)

    Prefix to match

Returns:

  • (Array<String>)

    Matching words



223
224
225
226
# File 'lib/kotoshu/embeddings/vocabulary.rb', line 223

def words_starting_with(prefix)
  pattern = /^#{Regexp.escape(prefix)}/
  @word_to_index.keys.grep(pattern)
end