Class: Kotoshu::Language::Tokenizer::JapaneseTokenizer

Inherits:

Base

Object
Base
Kotoshu::Language::Tokenizer::JapaneseTokenizer

show all

Defined in:: lib/kotoshu/language/tokenizer/japanese_tokenizer.rb

Overview

Tokenizer for Japanese text.

Uses Suika gem for morphological analysis.

Suika is a pure Ruby Japanese morphological analyzer with a built-in dictionary from mecab-ipadic. It provides proper tokenization with part-of-speech information.

Direct Known Subclasses

Kotoshu::Languages::Japanese::Tokenizer

Constant Summary collapse

WORD_SEPARATORS = Japanese word separators - keep it simple since Suika handles tokenization

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze

@@tagger = Class variable to hold the Suika tagger instance

nil

Instance Method Summary collapse

#tokenize(text) ⇒ Object

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ `Object`

# File 'lib/kotoshu/language/tokenizer/japanese_tokenizer.rb', line 24

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Initialize tagger once (class variable for reuse)
  @@tagger ||= ::Suika::Tagger.new

  # Suika.parse returns an array of "surface\tfeatures" strings
  tokens = []
  parsed = @@tagger.parse(text)

  parsed.each do |token|
    # Suika returns: "すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ"
    # The surface form is tab-separated from the POS features
    surface = token.split("\t").first
    tokens << surface if surface && !surface.strip.empty?
  end

  tokens
end