Class: Kotoshu::Languages::Japanese::POSTagger

Inherits:
Components::PosTagger show all
Defined in:
lib/kotoshu/languages/ja/language.rb

Overview

Japanese POS tagger using morphological analysis.

Japanese POS tagging is integrated with tokenization via Suika gem, which provides both segmentation and part-of-speech information.

Suika output format: surface<TAB>POS,subcat1,subcat2,subcat3,conj_type,conj_form,lemma,reading,pronunciation Example: “すももt名詞,一般,*,*,*,*,すもも,スモモ,スモモ”

POS tags use universal English categories for common types, and ROMAJI (Latin script) identifiers based on Japanese terminology only for language-specific categories without universal equivalents.

Constant Summary collapse

FLAG_TO_POS =

Japanese POS tag mappings from Suika to standard identifiers.

Strategy: Use universal English POS tags (NOUN, VERB, etc.) with English suffixes for subcategories. All identifiers are ASCII.

Main categories (field 0) - universal:

  • 名詞 → NOUN

  • 動詞 → VERB

  • 助詞 → PARTICLE

  • 助動詞 → AUX

Noun subcategories (field 1):

  • NOUN_COMMON: 一般 - common nouns

  • NOUN_PROPER: 固有名詞 - proper nouns

  • NOUN_PROPER_GEOGRAPHIC: 固有名詞,地域 - proper noun, geographic

  • NOUN_SUFFIX: 接尾 - suffixes

  • NOUN_DEPENDENT: 非自立 - dependent nouns (cannot stand alone)

  • NOUN_SA_CONNECTION: サ変接続 - sa-variant connection nouns

Particle subcategories (field 1):

  • PARTICLE_GRAMMAR: 格助詞 - grammar/case particles (が, を, に, etc.)

  • PARTICLE_BINDING: 係助詞 - binding particles (は, も, etc.)

  • PARTICLE_ADNOMINAL: 連体化 - adnominal particles (の)

Verb subcategories (field 1):

  • VERB_INDEPENDENT: 自立 - independent verbs

{
  # Main categories - universal English
  '名詞' => 'NOUN',
  '動詞' => 'VERB',
  '助詞' => 'PARTICLE',
  '助動詞' => 'AUX',

  # Noun subcategories
  '名詞,一般' => 'NOUN_COMMON',
  '名詞,固有名詞' => 'NOUN_PROPER',
  '名詞,固有名詞,地域' => 'NOUN_PROPER_GEOGRAPHIC',
  '名詞,接尾' => 'NOUN_SUFFIX',
  '名詞,非自立' => 'NOUN_DEPENDENT',
  '名詞,サ変接続' => 'NOUN_SA_CONNECTION',

  # Particle subcategories
  '助詞,格助詞' => 'PARTICLE_GRAMMAR',
  '助詞,係助詞' => 'PARTICLE_BINDING',
  '助詞,連体化' => 'PARTICLE_ADNOMINAL',

  # Verb subcategories
  '動詞,自立' => 'VERB_INDEPENDENT',
}.freeze

Instance Method Summary collapse

Methods inherited from Components::PosTagger

#tag_word

Constructor Details

#initialize(dictionary_path: nil, flag_mapping: FLAG_TO_POS) ⇒ POSTagger

Returns a new instance of POSTagger.



225
226
227
228
229
230
# File 'lib/kotoshu/languages/ja/language.rb', line 225

def initialize(dictionary_path: nil, flag_mapping: FLAG_TO_POS)
  @dictionary_path = dictionary_path
  @flag_mapping = flag_mapping
  @suika_tagger = nil
  @lookup_cache = {}
end

Instance Method Details

#clear_cacheObject



258
259
260
# File 'lib/kotoshu/languages/ja/language.rb', line 258

def clear_cache
  @lookup_cache.clear
end

#flag_mappingObject



250
251
252
# File 'lib/kotoshu/languages/ja/language.rb', line 250

def flag_mapping
  @flag_mapping
end

#flag_mapping=(mapping) ⇒ Object



254
255
256
# File 'lib/kotoshu/languages/ja/language.rb', line 254

def flag_mapping=(mapping)
  @flag_mapping = mapping
end

#tag(tokens) ⇒ Object



232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# File 'lib/kotoshu/languages/ja/language.rb', line 232

def tag(tokens)
  return [] if tokens.nil? || tokens.empty?

  # Initialize Suika tagger
  require "suika" unless defined?(::Suika)
  @suika_tagger ||= ::Suika::Tagger.new

  tokens.map do |token|
    word = token[:token]
    if word.nil? || word.empty?
      token.merge(pos_tag: nil, lemma: nil)
    else
      lookup_result = lookup_with_pos(word)
      token.merge(pos_tag: lookup_result[:pos_tag], lemma: lookup_result[:lemma] || word)
    end
  end
end