Class: Coelacanth::Extractor

Inherits:
Object
  • Object
show all
Defined in:
lib/coelacanth/extractor.rb,
lib/coelacanth/extractor/utilities.rb,
lib/coelacanth/extractor/normalizer.rb,
lib/coelacanth/extractor/preprocessor.rb,
lib/coelacanth/extractor/weak_ml_probe.rb,
lib/coelacanth/extractor/fallback_probe.rb,
lib/coelacanth/extractor/metadata_probe.rb,
lib/coelacanth/extractor/heuristic_probe.rb,
lib/coelacanth/extractor/image_collector.rb,
lib/coelacanth/extractor/markdown_renderer.rb,
lib/coelacanth/extractor/morphological_analyzer.rb,
lib/coelacanth/extractor/eyecatch_image_extractor.rb,
lib/coelacanth/extractor/markdown_listing_collector.rb

Overview

High-level API for extracting articles without site-specific selectors.

Defined Under Namespace

Modules: Utilities Classes: EyecatchImageExtractor, FallbackProbe, HeuristicProbe, ImageCollector, MarkdownListingCollector, MarkdownRenderer, MetadataProbe, MorphologicalAnalyzer, Normalizer, PipelineResult, Preprocessor, WeakMlProbe

Instance Method Summary collapse

Instance Method Details

#call(html:, url: nil, response_metadata: nil) ⇒ Object



29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/coelacanth/extractor.rb', line 29

def call(html:, url: nil, response_metadata: nil)
  preprocessed_html = Preprocessor.new.call(html: html, url: url)
  document = Normalizer.new.call(html: preprocessed_html, base_url: url)

  [
    [MetadataProbe.new, 0.85],
    [HeuristicProbe.new, 0.75],
    [WeakMlProbe.new, 0.70],
    [FallbackProbe.new, 0.0]
  ].each do |probe, threshold|
    result = probe.call(doc: document, url: url)
    next unless result

    return build_response(result, document:, url:, response_metadata: ) if result.confidence.to_f >= threshold
  end

  build_response(
    PipelineResult.new(node: document, source_tag: :none, confidence: 0.0),
    document: document,
    url: url,
    response_metadata: 
  )
end