Class: Kotoshu::Data::CommonWordsLoader

Inherits:

Object

Object
Kotoshu::Data::CommonWordsLoader

show all

Defined in:: lib/kotoshu/data/common_words_loader.rb

Overview

Loads and provides access to common words data for all supported languages.

This loader supports loading from:

Local YAML files in lib/kotoshu/data/common_words/language.yml
Frequency.json files downloaded from GitHub (via LanguageCache)

Each language file contains:

metadata: Source information, word count, last updated
tiers: Top 50, top 200, and top 1000 most common words

Examples:

Loading English common words

loader = CommonWordsLoader.new
tiers = loader.load('en')
tiers[:top_50].include?('the')  # => true

Getting available languages

CommonWordsLoader.available_languages  # => ['de', 'en', 'es', 'fr', 'pt', 'ru']

Loading with tier specification

loader.load('en', tier: :top_200)  # Combines top_50 + top_200

Constant Summary collapse

DATA_DIR = Default data directory (local YAML files)

File.expand_path('../common_words', __FILE__).freeze

Class Method Summary collapse

.available?(language_code) ⇒ Boolean

Check if a language has local data.
.available_languages ⇒ Array<String>

Get list of languages with local YAML files.
.load(language_code, tier: :top_1000) ⇒ Hash{Symbol => Set}

Load common words for a language.
.load_from_frequency_file(frequency_path) ⇒ Hash{Symbol => Set}

Load from GitHub frequency.json (Phase 2 integration).

Class Method Details

.available?(language_code) ⇒ `Boolean`

Check if a language has local data.

Parameters:

language_code (String) —

ISO 639-1 language code

Returns:

(Boolean) —

True if data file exists



110
111
112

# File 'lib/kotoshu/data/common_words_loader.rb', line 110

def available?(language_code)
  File.exist?(File.join(DATA_DIR, "#{language_code}.yml"))
end

.available_languages ⇒ `Array<String>`

Get list of languages with local YAML files.

Returns:

(Array<String>) —

List of available language codes



102
103
104

# File 'lib/kotoshu/data/common_words_loader.rb', line 102

def available_languages
  Dir.glob(File.join(DATA_DIR, '*.yml')).map { |f| File.basename(f, '.yml') }
end

.load(language_code, tier: :top_1000) ⇒ `Hash{Symbol => Set}`

Load common words for a language.

Parameters:

language_code (String) —

ISO 639-1 language code (e.g., ‘en’, ‘de’)
tier (Symbol) (defaults to: :top_1000) —

Tier level: :top_50, :top_200, or :top_1000

Returns:

(Hash{Symbol => Set}) —

Hash with :tiers (tier sets) and :metadata

# File 'lib/kotoshu/data/common_words_loader.rb', line 39

def load(language_code, tier: :top_1000)
  yaml_file = File.join(DATA_DIR, "#{language_code}.yml")

  if File.exist?(yaml_file)
    load_from_yaml(yaml_file, tier)
  else
    {
      tiers: empty_tiers,
      metadata: { source: 'none', language: language_code }
    }
  end
end

.load_from_frequency_file(frequency_path) ⇒ `Hash{Symbol => Set}`

Load from GitHub frequency.json (Phase 2 integration). Also handles Kelly frequency-list format from kotoshu/frequency-list-kelly

Parameters:

language_code (String) —

ISO 639-1 language code
frequency_path (String) —

Path to frequency.json file

Returns:

(Hash{Symbol => Set}) —

Hash with :tiers and :metadata

# File 'lib/kotoshu/data/common_words_loader.rb', line 58

def load_from_frequency_file(frequency_path)
  return { tiers: empty_tiers, metadata: {} } unless File.exist?(frequency_path)

  data = JSON.parse(File.read(frequency_path, encoding: 'UTF-8'))

  # Handle Kelly format: tiers[tier_name]['words']
  # Check if format has nested 'words' key (Kelly format)
  has_words_key = data.dig('tiers', 'top_50', 'words')

  tiers = if has_words_key
            # Kelly format: data['tiers']['top_50']['words']
            {
              top_50: Set.new(data.dig('tiers', 'top_50', 'words') || []),
              top_200: Set.new(
                (data.dig('tiers', 'top_50', 'words') || []) +
                (data.dig('tiers', 'top_200', 'words') || [])
              ),
              top_1000: Set.new(
                (data.dig('tiers', 'top_50', 'words') || []) +
                (data.dig('tiers', 'top_200', 'words') || []) +
                (data.dig('tiers', 'top_1000', 'words') || [])
              )
            }
          else
            # Legacy format: data['tiers']['top_50'] is array
            {
              top_50: Set.new(data.dig('tiers', 'top_50') || []),
              top_200: Set.new((data.dig('tiers', 'top_50') || []) + (data.dig('tiers', 'top_200') || [])),
              top_1000: Set.new(
                (data.dig('tiers', 'top_50') || []) +
                (data.dig('tiers', 'top_200') || []) +
                (data.dig('tiers', 'top_1000') || [])
              )
            }
          end

  metadata = data['metadata'] || {}

  { tiers: tiers, metadata: metadata }
end

Class: Kotoshu::Data::CommonWordsLoader

Overview

Examples:

Loading English common words

Getting available languages

Loading with tier specification

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.available?(language_code) ⇒ Boolean

.available_languages ⇒ Array<String>

.load(language_code, tier: :top_1000) ⇒ Hash{Symbol => Set}

.load_from_frequency_file(frequency_path) ⇒ Hash{Symbol => Set}

.available?(language_code) ⇒ `Boolean`

.available_languages ⇒ `Array<String>`

.load(language_code, tier: :top_1000) ⇒ `Hash{Symbol => Set}`

.load_from_frequency_file(frequency_path) ⇒ `Hash{Symbol => Set}`