Hot to use ChupaText as Ruby library

You can use ChupaText as Ruby library. If you want to extract text data from many input data, chupa-text command may be inefficient. You need to execute chupa-text command to process one input file. You need to execute chupa-text command N times to process N input files. It means that you need to initializes ChupaText N times. It may be inefficient.

You can reduce initializations of ChupaText by using ChupaText as Ruby library.

Here is a simple usage:

``` require “chupa-text” gem “chupa-text-decomposer-html”

ChupaText::Decomposers.load

extractor = ChupaText::Extractor.new extractor.apply_configuration(ChupaText::Configuration.default)

extractor.extract(“http://ranguba.org/”) do |text_data| puts(text_data.body) end extractor.extract(“http://ranguba.org/ja/”) do |text_data| puts(text_data.body) end ```

It is better that you use Bundler to manager decomposer plugins:

``` # Gemfile source “https://rubygems.org”

gem “chupa-text-decomposer-html” gem “chupa-text-decomposer-XXX” # … ```

Here is a usage that uses the Gemfile:

``` require “bundler/setup”

ChupaText::Decomposers.load

extractor = ChupaText::Extractor.new extractor.apply_configuration(ChupaText::Configuration.default)

extractor.extract(“http://ranguba.org/”) do |text_data| puts(text_data.body) end extractor.extract(“http://ranguba.org/ja/”) do |text_data| puts(text_data.body) end ```

Use ChupaText::Data#[] to get meta-data from extracted text data. For example, you can get title from input HTML:

extractor.extract("http://ranguba.org/") do |text_data| puts(text_data["title"]) end

It is depended on decomposer that what meta-data can be got. See decomposer’s documentation to know about it.