Class: Kotoshu::Cache::ModelCache
- Defined in:
- lib/kotoshu/cache/model_cache.rb
Overview
Manages embedding model downloads from FastText CDN and GitHub.
Extends BaseCache to support FastText .vec files and ONNX models. Downloads FastText models from Facebook’s public CDN.
Constant Summary collapse
- AVAILABLE_MODELS =
Available models in FastText CDN and models-fasttext-onnx repository
{ # FastText crawl vectors (300D) from Facebook Research # https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/ # Selected high-resource languages fasttext: { de: { file: "cc.de.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" }, en: { file: "cc.en.300.vec.gz", size: 2_000_000, source: "FastText Common Crawl" }, es: { file: "cc.es.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" }, fr: { file: "cc.fr.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" }, pt: { file: "cc.pt.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" }, ru: { file: "cc.ru.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" } }, # ONNX models (active set) from models-fasttext-onnx repository. # Sizes synced with manifest.json in kotoshu/models-fasttext-onnx. # The repo holds .onnx for 158 languages but only the 9 below are # tracked and exposed — to promote a language, see # models-fasttext-onnx/.gitignore and re-sync this constant. # https://github.com/kotoshu/models-fasttext-onnx onnx: { de: { file: "fasttext.de.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, en: { file: "fasttext.en.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, es: { file: "fasttext.es.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, fr: { file: "fasttext.fr.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, pt: { file: "fasttext.pt.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, ru: { file: "fasttext.ru.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, zh: { file: "fasttext.zh.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, ja: { file: "fasttext.ja.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, ko: { file: "fasttext.ko.onnx", size: 120_000_415, source: "models-fasttext-onnx" }, } }.freeze
Instance Attribute Summary
Attributes inherited from BaseCache
#cache_path, #cache_ttl, #github_url, #source_registry, #url_base
Instance Method Summary collapse
-
#all_available_models ⇒ Hash
List all available models across all languages.
-
#available_models_for(language_code) ⇒ Array<Symbol>
Get available model types for a language.
-
#cached_resources ⇒ Array<String>
List all cached resources.
-
#get_fasttext_model(language_code, force_download: false) ⇒ String?
Get or download FastText model for a language.
-
#get_onnx_model(language_code, force_download: false) ⇒ String?
Get or download ONNX model for a language.
-
#model_info(language_code, model_type) ⇒ Hash?
Get model info for a language and type.
-
#supports_resource?(resource_id) ⇒ Boolean
Check if a resource type is supported.
Methods inherited from BaseCache
#available?, #clean, #clear, #clear_all, #download, #get, #initialize, #reset_stats, #stats
Constructor Details
This class inherits a constructor from Kotoshu::Cache::BaseCache
Instance Method Details
#all_available_models ⇒ Hash
List all available models across all languages.
103 104 105 |
# File 'lib/kotoshu/cache/model_cache.rb', line 103 def all_available_models AVAILABLE_MODELS end |
#available_models_for(language_code) ⇒ Array<Symbol>
Get available model types for a language.
83 84 85 86 87 88 89 |
# File 'lib/kotoshu/cache/model_cache.rb', line 83 def available_models_for(language_code) lang = language_code.to_sym types = [] types << :fasttext if AVAILABLE_MODELS[:fasttext][lang] types << :onnx if AVAILABLE_MODELS[:onnx][lang] types end |
#cached_resources ⇒ Array<String>
List all cached resources.
122 123 124 125 126 127 128 |
# File 'lib/kotoshu/cache/model_cache.rb', line 122 def cached_resources Dir.glob(File.join(@cache_path, "**", "metadata.json")).map do |path| relative = Pathname.new(path).relative_path_to(Pathname.new(@cache_path)) parts = relative.to_s.split("/") "#{parts[0]}:#{parts[2]}" # language:model_type end.uniq end |
#get_fasttext_model(language_code, force_download: false) ⇒ String?
Get or download FastText model for a language.
60 61 62 63 64 65 |
# File 'lib/kotoshu/cache/model_cache.rb', line 60 def get_fasttext_model(language_code, force_download: false) resource_id = "#{language_code}:fasttext" result = get(resource_id, force_download: force_download) result&.dig(:model_path) end |
#get_onnx_model(language_code, force_download: false) ⇒ String?
Get or download ONNX model for a language.
72 73 74 75 76 77 |
# File 'lib/kotoshu/cache/model_cache.rb', line 72 def get_onnx_model(language_code, force_download: false) resource_id = "#{language_code}:onnx" result = get(resource_id, force_download: force_download) result&.dig(:model_path) end |
#model_info(language_code, model_type) ⇒ Hash?
Get model info for a language and type.
96 97 98 |
# File 'lib/kotoshu/cache/model_cache.rb', line 96 def model_info(language_code, model_type) AVAILABLE_MODELS.dig(model_type, language_code.to_sym) end |
#supports_resource?(resource_id) ⇒ Boolean
Check if a resource type is supported.
111 112 113 114 115 116 117 |
# File 'lib/kotoshu/cache/model_cache.rb', line 111 def supports_resource?(resource_id) parts = resource_id.split(":") return false unless parts.size == 2 language, type = parts AVAILABLE_MODELS[type.to_sym]&.key?(language.to_sym) end |