Class: Kotoshu::Data::CommonWordsLoader

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/data/common_words_loader.rb

Overview

Loads and provides access to common words data for all supported languages.

This loader supports loading from:

  1. Local YAML files in lib/kotoshu/data/common_words/language.yml

  2. Frequency.json files downloaded from GitHub (via LanguageCache)

Each language file contains:

  • metadata: Source information, word count, last updated

  • tiers: Top 50, top 200, and top 1000 most common words

Examples:

Loading English common words

loader = CommonWordsLoader.new
tiers = loader.load('en')
tiers[:top_50].include?('the')  # => true

Getting available languages

CommonWordsLoader.available_languages  # => ['de', 'en', 'es', 'fr', 'pt', 'ru']

Loading with tier specification

loader.load('en', tier: :top_200)  # Combines top_50 + top_200

Constant Summary collapse

DATA_DIR =

Default data directory (local YAML files)

File.expand_path('../common_words', __FILE__).freeze

Class Method Summary collapse

Class Method Details

.available?(language_code) ⇒ Boolean

Check if a language has local data.

Parameters:

  • language_code (String)

    ISO 639-1 language code

Returns:

  • (Boolean)

    True if data file exists



110
111
112
# File 'lib/kotoshu/data/common_words_loader.rb', line 110

def available?(language_code)
  File.exist?(File.join(DATA_DIR, "#{language_code}.yml"))
end

.available_languagesArray<String>

Get list of languages with local YAML files.

Returns:

  • (Array<String>)

    List of available language codes



102
103
104
# File 'lib/kotoshu/data/common_words_loader.rb', line 102

def available_languages
  Dir.glob(File.join(DATA_DIR, '*.yml')).map { |f| File.basename(f, '.yml') }
end

.load(language_code, tier: :top_1000) ⇒ Hash{Symbol => Set}

Load common words for a language.

Parameters:

  • language_code (String)

    ISO 639-1 language code (e.g., ‘en’, ‘de’)

  • tier (Symbol) (defaults to: :top_1000)

    Tier level: :top_50, :top_200, or :top_1000

Returns:

  • (Hash{Symbol => Set})

    Hash with :tiers (tier sets) and :metadata



39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/kotoshu/data/common_words_loader.rb', line 39

def load(language_code, tier: :top_1000)
  yaml_file = File.join(DATA_DIR, "#{language_code}.yml")

  if File.exist?(yaml_file)
    load_from_yaml(yaml_file, tier)
  else
    {
      tiers: empty_tiers,
      metadata: { source: 'none', language: language_code }
    }
  end
end

.load_from_frequency_file(frequency_path) ⇒ Hash{Symbol => Set}

Load from GitHub frequency.json (Phase 2 integration). Also handles Kelly frequency-list format from kotoshu/frequency-list-kelly

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • frequency_path (String)

    Path to frequency.json file

Returns:

  • (Hash{Symbol => Set})

    Hash with :tiers and :metadata



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# File 'lib/kotoshu/data/common_words_loader.rb', line 58

def load_from_frequency_file(frequency_path)
  return { tiers: empty_tiers, metadata: {} } unless File.exist?(frequency_path)

  data = JSON.parse(File.read(frequency_path, encoding: 'UTF-8'))

  # Handle Kelly format: tiers[tier_name]['words']
  # Check if format has nested 'words' key (Kelly format)
  has_words_key = data.dig('tiers', 'top_50', 'words')

  tiers = if has_words_key
            # Kelly format: data['tiers']['top_50']['words']
            {
              top_50: Set.new(data.dig('tiers', 'top_50', 'words') || []),
              top_200: Set.new(
                (data.dig('tiers', 'top_50', 'words') || []) +
                (data.dig('tiers', 'top_200', 'words') || [])
              ),
              top_1000: Set.new(
                (data.dig('tiers', 'top_50', 'words') || []) +
                (data.dig('tiers', 'top_200', 'words') || []) +
                (data.dig('tiers', 'top_1000', 'words') || [])
              )
            }
          else
            # Legacy format: data['tiers']['top_50'] is array
            {
              top_50: Set.new(data.dig('tiers', 'top_50') || []),
              top_200: Set.new((data.dig('tiers', 'top_50') || []) + (data.dig('tiers', 'top_200') || [])),
              top_1000: Set.new(
                (data.dig('tiers', 'top_50') || []) +
                (data.dig('tiers', 'top_200') || []) +
                (data.dig('tiers', 'top_1000') || [])
              )
            }
          end

   = data['metadata'] || {}

  { tiers: tiers, metadata:  }
end