Class: Coelacanth::Extractor
- Inherits:
-
Object
- Object
- Coelacanth::Extractor
- Defined in:
- lib/coelacanth/extractor.rb,
lib/coelacanth/extractor/utilities.rb,
lib/coelacanth/extractor/normalizer.rb,
lib/coelacanth/extractor/preprocessor.rb,
lib/coelacanth/extractor/weak_ml_probe.rb,
lib/coelacanth/extractor/fallback_probe.rb,
lib/coelacanth/extractor/metadata_probe.rb,
lib/coelacanth/extractor/heuristic_probe.rb,
lib/coelacanth/extractor/image_collector.rb,
lib/coelacanth/extractor/markdown_renderer.rb,
lib/coelacanth/extractor/morphological_analyzer.rb,
lib/coelacanth/extractor/eyecatch_image_extractor.rb,
lib/coelacanth/extractor/markdown_listing_collector.rb
Overview
High-level API for extracting articles without site-specific selectors.
Defined Under Namespace
Modules: Utilities Classes: EyecatchImageExtractor, FallbackProbe, HeuristicProbe, ImageCollector, MarkdownListingCollector, MarkdownRenderer, MetadataProbe, MorphologicalAnalyzer, Normalizer, PipelineResult, Preprocessor, WeakMlProbe
Instance Method Summary collapse
Instance Method Details
#call(html:, url: nil, response_metadata: nil) ⇒ Object
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/coelacanth/extractor.rb', line 29 def call(html:, url: nil, response_metadata: nil) preprocessed_html = Preprocessor.new.call(html: html, url: url) document = Normalizer.new.call(html: preprocessed_html, base_url: url) [ [MetadataProbe.new, 0.85], [HeuristicProbe.new, 0.75], [WeakMlProbe.new, 0.70], [FallbackProbe.new, 0.0] ].each do |probe, threshold| result = probe.call(doc: document, url: url) next unless result return build_response(result, document:, url:, response_metadata: ) if result.confidence.to_f >= threshold end build_response( PipelineResult.new(node: document, source_tag: :none, confidence: 0.0), document: document, url: url, response_metadata: ) end |