Class: Coelacanth::Extractor::EyecatchImageExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/coelacanth/extractor/eyecatch_image_extractor.rb

Overview

Finds and downloads the representative image for a document.

Defined Under Namespace

Classes: Result

Constant Summary collapse

POSITIVE_KEYWORDS =
%w[eyecatch hero main featured cover headline banner article primary lead].freeze
NEGATIVE_KEYWORDS =
%w[avatar icon logo emoji badge button profile author comment footer nav thumbnail thumb ad sponsor].freeze
METADATA_SOURCES =
[
  { selector: "meta[property='og:image:secure_url']", attribute: "content", score: 140 },
  { selector: "meta[property='og:image:url']", attribute: "content", score: 135 },
  { selector: "meta[property='og:image']", attribute: "content", score: 130 },
  { selector: "meta[name='twitter:image:src']", attribute: "content", score: 125 },
  { selector: "meta[name='twitter:image']", attribute: "content", score: 120 },
  { selector: "meta[itemprop='image']", attribute: "content", score: 110 },
  { selector: "meta[name='thumbnail']", attribute: "content", score: 100 },
  { selector: "link[rel='image_src']", attribute: "href", score: 95 }
].freeze
JSON_LD_IMAGE_KEYS =
%w[image imageUrl imageURL thumbnail thumbnailUrl thumbnailURL contentUrl contentURL].freeze
LAZY_SOURCE_ATTRIBUTES =
%w[data-src data-original data-lazy-src data-lazy data-url data-image data-preview src].freeze

Instance Method Summary collapse

Constructor Details

#initialize(http_client: Coelacanth::HTTP) ⇒ EyecatchImageExtractor

Returns a new instance of EyecatchImageExtractor.



35
36
37
# File 'lib/coelacanth/extractor/eyecatch_image_extractor.rb', line 35

def initialize(http_client: Coelacanth::HTTP)
  @http_client = http_client
end

Instance Method Details

#call(doc:, base_url: nil) ⇒ Object



39
40
41
42
43
44
45
46
# File 'lib/coelacanth/extractor/eyecatch_image_extractor.rb', line 39

def call(doc:, base_url: nil)
  return unless doc

  image_url = locate_image_url(doc, base_url)
  return unless image_url

  download(image_url)
end