CppjiebaRb
Ruby bindings for Cppjieba. C++11 required. (gcc 4.8+)
The TRIE tree has high memory usage. For default dict, it uses ~120 MB memory.
Installation
Add this line to your application's Gemfile:
gem 'cppjieba_rb', require: false
Or pin a version:
gem 'cppjieba_rb', '~> 0.4.2', require: false
Or install it as:
$ gem install cppjieba_rb
Usage
Segmentation mode is described in cppjieba.
Word segment Usage
Mix Segment mode (HMM with Max Prob, default):
require 'cppjieba_rb'
seg = CppjiebaRb::Segment.new # equivalent to "CppjiebaRb::Segment.new mode: :mix"
words = seg.segment "令狐冲是云计算行业的专家"
# 令狐冲 是 云 计算 行业 的 专家
The alternative convenient method:
CppjiebaRb.segment('令狐冲是云计算行业的专家', mode: :mix)
HMM or Max probability (mp) Segment mode:
seg = CppjiebaRb::Segment.new mode: :hmm # or mode: :mp
seg.segment "令狐冲是云计算行业的专家"
Word tagging Usage
require 'cppjieba_rb'
CppjiebaRb.segment_tag "我是蓝翔技工拖拉机学院手扶拖拉机专业的。"
# [{"我"=>"r"}, {"是"=>"v"}, {"蓝翔"=>"x"}, {"技工"=>"n"}, {"拖拉机"=>"n"}, {"学院"=>"n"}, {"手扶拖拉机"=>"n"}, {"专业"=>"n"}, {"的"=>"uj"}, {"。"=>"x"}]
Keyword Extractor Usage
require 'cppjieba_rb'
CppjiebaRb.extract_keyword "我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。", 5
# [
# ["CEO", 11.739204307083542],
# ["升职", 10.8561552143],
# ["加薪", 10.642581114],
# ["手扶拖拉机", 10.0088573539],
# ["巅峰", 9.49395840471]
# ]
Contributing
- Fork it ( http://github.com/fantasticfears/cppjieba_rb/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
TODO
- including 367w dict and provide the option for it.
- cppjieba implements trie tree, it's memory consuming