Class: Kotoshu::Languages::Japanese::POSTagger
- Inherits:
-
Components::PosTagger
- Object
- Components::PosTagger
- Kotoshu::Languages::Japanese::POSTagger
- Defined in:
- lib/kotoshu/languages/ja/language.rb
Overview
Japanese POS tagger using morphological analysis.
Japanese POS tagging is integrated with tokenization via Suika gem, which provides both segmentation and part-of-speech information.
Suika output format: surface<TAB>POS,subcat1,subcat2,subcat3,conj_type,conj_form,lemma,reading,pronunciation Example: “すももt名詞,一般,*,*,*,*,すもも,スモモ,スモモ”
POS tags use universal English categories for common types, and ROMAJI (Latin script) identifiers based on Japanese terminology only for language-specific categories without universal equivalents.
Constant Summary collapse
- FLAG_TO_POS =
Japanese POS tag mappings from Suika to standard identifiers.
Strategy: Use universal English POS tags (NOUN, VERB, etc.) with English suffixes for subcategories. All identifiers are ASCII.
Main categories (field 0) - universal:
-
名詞 → NOUN
-
動詞 → VERB
-
助詞 → PARTICLE
-
助動詞 → AUX
Noun subcategories (field 1):
-
NOUN_COMMON: 一般 - common nouns
-
NOUN_PROPER: 固有名詞 - proper nouns
-
NOUN_PROPER_GEOGRAPHIC: 固有名詞,地域 - proper noun, geographic
-
NOUN_SUFFIX: 接尾 - suffixes
-
NOUN_DEPENDENT: 非自立 - dependent nouns (cannot stand alone)
-
NOUN_SA_CONNECTION: サ変接続 - sa-variant connection nouns
Particle subcategories (field 1):
-
PARTICLE_GRAMMAR: 格助詞 - grammar/case particles (が, を, に, etc.)
-
PARTICLE_BINDING: 係助詞 - binding particles (は, も, etc.)
-
PARTICLE_ADNOMINAL: 連体化 - adnominal particles (の)
Verb subcategories (field 1):
-
VERB_INDEPENDENT: 自立 - independent verbs
-
{ # Main categories - universal English '名詞' => 'NOUN', '動詞' => 'VERB', '助詞' => 'PARTICLE', '助動詞' => 'AUX', # Noun subcategories '名詞,一般' => 'NOUN_COMMON', '名詞,固有名詞' => 'NOUN_PROPER', '名詞,固有名詞,地域' => 'NOUN_PROPER_GEOGRAPHIC', '名詞,接尾' => 'NOUN_SUFFIX', '名詞,非自立' => 'NOUN_DEPENDENT', '名詞,サ変接続' => 'NOUN_SA_CONNECTION', # Particle subcategories '助詞,格助詞' => 'PARTICLE_GRAMMAR', '助詞,係助詞' => 'PARTICLE_BINDING', '助詞,連体化' => 'PARTICLE_ADNOMINAL', # Verb subcategories '動詞,自立' => 'VERB_INDEPENDENT', }.freeze
Instance Method Summary collapse
- #clear_cache ⇒ Object
- #flag_mapping ⇒ Object
- #flag_mapping=(mapping) ⇒ Object
-
#initialize(dictionary_path: nil, flag_mapping: FLAG_TO_POS) ⇒ POSTagger
constructor
A new instance of POSTagger.
- #tag(tokens) ⇒ Object
Methods inherited from Components::PosTagger
Constructor Details
#initialize(dictionary_path: nil, flag_mapping: FLAG_TO_POS) ⇒ POSTagger
Returns a new instance of POSTagger.
225 226 227 228 229 230 |
# File 'lib/kotoshu/languages/ja/language.rb', line 225 def initialize(dictionary_path: nil, flag_mapping: FLAG_TO_POS) @dictionary_path = dictionary_path @flag_mapping = flag_mapping @suika_tagger = nil @lookup_cache = {} end |
Instance Method Details
#clear_cache ⇒ Object
258 259 260 |
# File 'lib/kotoshu/languages/ja/language.rb', line 258 def clear_cache @lookup_cache.clear end |
#flag_mapping ⇒ Object
250 251 252 |
# File 'lib/kotoshu/languages/ja/language.rb', line 250 def flag_mapping @flag_mapping end |
#flag_mapping=(mapping) ⇒ Object
254 255 256 |
# File 'lib/kotoshu/languages/ja/language.rb', line 254 def flag_mapping=(mapping) @flag_mapping = mapping end |
#tag(tokens) ⇒ Object
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
# File 'lib/kotoshu/languages/ja/language.rb', line 232 def tag(tokens) return [] if tokens.nil? || tokens.empty? # Initialize Suika tagger require "suika" unless defined?(::Suika) @suika_tagger ||= ::Suika::Tagger.new tokens.map do |token| word = token[:token] if word.nil? || word.empty? token.merge(pos_tag: nil, lemma: nil) else lookup_result = lookup_with_pos(word) token.merge(pos_tag: lookup_result[:pos_tag], lemma: lookup_result[:lemma] || word) end end end |