Class: Kotoshu::Language::Tokenizer::JapaneseTokenizer
- Defined in:
- lib/kotoshu/language/tokenizer/japanese_tokenizer.rb
Overview
Tokenizer for Japanese text.
Uses Suika gem for morphological analysis.
Suika is a pure Ruby Japanese morphological analyzer with a built-in dictionary from mecab-ipadic. It provides proper tokenization with part-of-speech information.
Direct Known Subclasses
Constant Summary collapse
- WORD_SEPARATORS =
Japanese word separators - keep it simple since Suika handles tokenization
/[\s"()\[\]{}<>,.;:!?\\\/|`~@#$%^&*·]/.freeze
- @@tagger =
Class variable to hold the Suika tagger instance
nil
Instance Method Summary collapse
Methods inherited from Base
#normalize, #skip_token?, #tokenize_with_positions, #word_boundary_regex, #word_char?
Instance Method Details
#tokenize(text) ⇒ Object
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/kotoshu/language/tokenizer/japanese_tokenizer.rb', line 24 def tokenize(text) return [] if text.nil? || text.strip.empty? # Initialize tagger once (class variable for reuse) @@tagger ||= ::Suika::Tagger.new # Suika.parse returns an array of "surface\tfeatures" strings tokens = [] parsed = @@tagger.parse(text) parsed.each do |token| # Suika returns: "すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ" # The surface form is tab-separated from the POS features surface = token.split("\t").first tokens << surface if surface && !surface.strip.empty? end tokens end |