Class: Kotoshu::Language::Tokenizer::JapaneseTokenizer

Inherits:
Base
  • Object
show all
Defined in:
lib/kotoshu/language/tokenizer/japanese_tokenizer.rb

Overview

Tokenizer for Japanese text.

Uses Suika gem for morphological analysis.

Suika is a pure Ruby Japanese morphological analyzer with a built-in dictionary from mecab-ipadic. It provides proper tokenization with part-of-speech information.

Direct Known Subclasses

Kotoshu::Languages::Japanese::Tokenizer

Constant Summary collapse

WORD_SEPARATORS =

Japanese word separators - keep it simple since Suika handles tokenization

/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze
@@tagger =

Class variable to hold the Suika tagger instance

nil

Instance Method Summary collapse

Methods inherited from Base

#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?

Instance Method Details

#tokenize(text) ⇒ Object



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/kotoshu/language/tokenizer/japanese_tokenizer.rb', line 24

def tokenize(text)
  return [] if text.nil? || text.strip.empty?

  # Initialize tagger once (class variable for reuse)
  @@tagger ||= ::Suika::Tagger.new

  # Suika.parse returns an array of "surface\tfeatures" strings
  tokens = []
  parsed = @@tagger.parse(text)

  parsed.each do |token|
    # Suika returns: "すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ"
    # The surface form is tab-separated from the POS features
    surface = token.split("\t").first
    tokens << surface if surface && !surface.strip.empty?
  end

  tokens
end