Class: Kotoshu::Cache::ModelCache

Inherits:
BaseCache
  • Object
show all
Defined in:
lib/kotoshu/cache/model_cache.rb

Overview

Manages embedding model downloads from FastText CDN and GitHub.

Extends BaseCache to support FastText .vec files and ONNX models. Downloads FastText models from Facebook’s public CDN.

Examples:

Downloading a FastText model

cache = ModelCache.new
vec_file = cache.get_fasttext_model('en')
model = FastTextModel.from_file(vec_file)

Downloading an ONNX model

onnx_file = cache.get_onnx_model('en')

Constant Summary collapse

AVAILABLE_MODELS =

Available models in FastText CDN and models-fasttext-onnx repository

{
  # FastText crawl vectors (300D) from Facebook Research
  # https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/
  # Selected high-resource languages
  fasttext: {
    de: { file: "cc.de.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" },
    en: { file: "cc.en.300.vec.gz", size: 2_000_000, source: "FastText Common Crawl" },
    es: { file: "cc.es.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" },
    fr: { file: "cc.fr.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" },
    pt: { file: "cc.pt.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" },
    ru: { file: "cc.ru.300.vec.gz", size: 1_000_000, source: "FastText Common Crawl" }
  },
  # ONNX models (active set) from models-fasttext-onnx repository.
  # Sizes synced with manifest.json in kotoshu/models-fasttext-onnx.
  # The repo holds .onnx for 158 languages but only the 9 below are
  # tracked and exposed — to promote a language, see
  # models-fasttext-onnx/.gitignore and re-sync this constant.
  # https://github.com/kotoshu/models-fasttext-onnx
  onnx: {
    de: { file: "fasttext.de.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    en: { file: "fasttext.en.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    es: { file: "fasttext.es.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    fr: { file: "fasttext.fr.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    pt: { file: "fasttext.pt.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    ru: { file: "fasttext.ru.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    zh: { file: "fasttext.zh.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    ja: { file: "fasttext.ja.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
    ko: { file: "fasttext.ko.onnx", size: 120_000_415, source: "models-fasttext-onnx" },
  }
}.freeze

Instance Attribute Summary

Attributes inherited from BaseCache

#cache_path, #cache_ttl, #github_url, #source_registry, #url_base

Instance Method Summary collapse

Methods inherited from BaseCache

#available?, #clean, #clear, #clear_all, #download, #get, #initialize, #reset_stats, #stats

Constructor Details

This class inherits a constructor from Kotoshu::Cache::BaseCache

Instance Method Details

#all_available_modelsHash

List all available models across all languages.

Returns:

  • (Hash)

    Mapping of language to available model types



103
104
105
# File 'lib/kotoshu/cache/model_cache.rb', line 103

def all_available_models
  AVAILABLE_MODELS
end

#available_models_for(language_code) ⇒ Array<Symbol>

Get available model types for a language.

Parameters:

  • language_code (String)

    ISO 639-1 language code

Returns:

  • (Array<Symbol>)

    Available model types (:fasttext, :onnx)



83
84
85
86
87
88
89
# File 'lib/kotoshu/cache/model_cache.rb', line 83

def available_models_for(language_code)
  lang = language_code.to_sym
  types = []
  types << :fasttext if AVAILABLE_MODELS[:fasttext][lang]
  types << :onnx if AVAILABLE_MODELS[:onnx][lang]
  types
end

#cached_resourcesArray<String>

List all cached resources.

Returns:

  • (Array<String>)

    List of cached resource identifiers



122
123
124
125
126
127
128
# File 'lib/kotoshu/cache/model_cache.rb', line 122

def cached_resources
  Dir.glob(File.join(@cache_path, "**", "metadata.json")).map do |path|
    relative = Pathname.new(path).relative_path_to(Pathname.new(@cache_path))
    parts = relative.to_s.split("/")
    "#{parts[0]}:#{parts[2]}" # language:model_type
  end.uniq
end

#get_fasttext_model(language_code, force_download: false) ⇒ String?

Get or download FastText model for a language.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • force_download (Boolean) (defaults to: false)

    Force re-download

Returns:

  • (String, nil)

    Path to downloaded .vec file



60
61
62
63
64
65
# File 'lib/kotoshu/cache/model_cache.rb', line 60

def get_fasttext_model(language_code, force_download: false)
  resource_id = "#{language_code}:fasttext"
  result = get(resource_id, force_download: force_download)

  result&.dig(:model_path)
end

#get_onnx_model(language_code, force_download: false) ⇒ String?

Get or download ONNX model for a language.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • force_download (Boolean) (defaults to: false)

    Force re-download

Returns:

  • (String, nil)

    Path to downloaded .onnx file



72
73
74
75
76
77
# File 'lib/kotoshu/cache/model_cache.rb', line 72

def get_onnx_model(language_code, force_download: false)
  resource_id = "#{language_code}:onnx"
  result = get(resource_id, force_download: force_download)

  result&.dig(:model_path)
end

#model_info(language_code, model_type) ⇒ Hash?

Get model info for a language and type.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • model_type (Symbol)

    Model type (:fasttext, :onnx)

Returns:

  • (Hash, nil)

    Model info or nil if not available



96
97
98
# File 'lib/kotoshu/cache/model_cache.rb', line 96

def model_info(language_code, model_type)
  AVAILABLE_MODELS.dig(model_type, language_code.to_sym)
end

#supports_resource?(resource_id) ⇒ Boolean

Check if a resource type is supported.

Parameters:

  • resource_id (String)

    The resource identifier (e.g., “en:fasttext”)

Returns:

  • (Boolean)

    True if supported



111
112
113
114
115
116
117
# File 'lib/kotoshu/cache/model_cache.rb', line 111

def supports_resource?(resource_id)
  parts = resource_id.split(":")
  return false unless parts.size == 2

  language, type = parts
  AVAILABLE_MODELS[type.to_sym]&.key?(language.to_sym)
end