Class: Kotoshu::Data::CommonWordsLoader
- Inherits:
-
Object
- Object
- Kotoshu::Data::CommonWordsLoader
- Defined in:
- lib/kotoshu/data/common_words_loader.rb
Overview
Loads and provides access to common words data for all supported languages.
This loader supports loading from:
-
Local YAML files in lib/kotoshu/data/common_words/language.yml
-
Frequency.json files downloaded from GitHub (via LanguageCache)
Each language file contains:
-
metadata: Source information, word count, last updated
-
tiers: Top 50, top 200, and top 1000 most common words
Constant Summary collapse
- DATA_DIR =
Default data directory (local YAML files)
File.('../common_words', __FILE__).freeze
Class Method Summary collapse
-
.available?(language_code) ⇒ Boolean
Check if a language has local data.
-
.available_languages ⇒ Array<String>
Get list of languages with local YAML files.
-
.load(language_code, tier: :top_1000) ⇒ Hash{Symbol => Set}
Load common words for a language.
-
.load_from_frequency_file(frequency_path) ⇒ Hash{Symbol => Set}
Load from GitHub frequency.json (Phase 2 integration).
Class Method Details
.available?(language_code) ⇒ Boolean
Check if a language has local data.
110 111 112 |
# File 'lib/kotoshu/data/common_words_loader.rb', line 110 def available?(language_code) File.exist?(File.join(DATA_DIR, "#{language_code}.yml")) end |
.available_languages ⇒ Array<String>
Get list of languages with local YAML files.
102 103 104 |
# File 'lib/kotoshu/data/common_words_loader.rb', line 102 def available_languages Dir.glob(File.join(DATA_DIR, '*.yml')).map { |f| File.basename(f, '.yml') } end |
.load(language_code, tier: :top_1000) ⇒ Hash{Symbol => Set}
Load common words for a language.
39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/kotoshu/data/common_words_loader.rb', line 39 def load(language_code, tier: :top_1000) yaml_file = File.join(DATA_DIR, "#{language_code}.yml") if File.exist?(yaml_file) load_from_yaml(yaml_file, tier) else { tiers: empty_tiers, metadata: { source: 'none', language: language_code } } end end |
.load_from_frequency_file(frequency_path) ⇒ Hash{Symbol => Set}
Load from GitHub frequency.json (Phase 2 integration). Also handles Kelly frequency-list format from kotoshu/frequency-list-kelly
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# File 'lib/kotoshu/data/common_words_loader.rb', line 58 def load_from_frequency_file(frequency_path) return { tiers: empty_tiers, metadata: {} } unless File.exist?(frequency_path) data = JSON.parse(File.read(frequency_path, encoding: 'UTF-8')) # Handle Kelly format: tiers[tier_name]['words'] # Check if format has nested 'words' key (Kelly format) has_words_key = data.dig('tiers', 'top_50', 'words') tiers = if has_words_key # Kelly format: data['tiers']['top_50']['words'] { top_50: Set.new(data.dig('tiers', 'top_50', 'words') || []), top_200: Set.new( (data.dig('tiers', 'top_50', 'words') || []) + (data.dig('tiers', 'top_200', 'words') || []) ), top_1000: Set.new( (data.dig('tiers', 'top_50', 'words') || []) + (data.dig('tiers', 'top_200', 'words') || []) + (data.dig('tiers', 'top_1000', 'words') || []) ) } else # Legacy format: data['tiers']['top_50'] is array { top_50: Set.new(data.dig('tiers', 'top_50') || []), top_200: Set.new((data.dig('tiers', 'top_50') || []) + (data.dig('tiers', 'top_200') || [])), top_1000: Set.new( (data.dig('tiers', 'top_50') || []) + (data.dig('tiers', 'top_200') || []) + (data.dig('tiers', 'top_1000') || []) ) } end = data['metadata'] || {} { tiers: tiers, metadata: } end |