Class: Html2rss::AutoSource::Scraper::Schema::Thing

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/schema/thing.rb

Overview

A Thing is kind of the ‘base class’ for Schema.org schema_objects.

Direct Known Subclasses

ItemList, ListItem

Constant Summary collapse

SUPPORTED_TYPES =

Supported Schema.org ‘@type` values mapped to article extraction.

%w[
  AdvertiserContentArticle AnalysisNewsArticle APIReference Article
  AskPublicNewsArticle BackgroundNewsArticle BlogPosting DiscussionForumPosting
  LiveBlogPosting NewsArticle OpinionNewsArticle Report ReportageNewsArticle
  ReviewNewsArticle SatiricalArticle ScholarlyArticle SocialMediaPosting TechArticle
].to_set.freeze
DEFAULT_ATTRIBUTES =

Attributes exposed by ‘#call` in generated article hashes.

%i[id title description url image published_at categories].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(schema_object, url:) ⇒ Thing

Returns a new instance of Thing.

Parameters:

  • schema_object (Hash{Symbol => Object})

    parsed schema.org object

  • url (String, Html2rss::Url, nil)

    base URL used for relative normalization



27
28
29
30
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 27

def initialize(schema_object, url:)
  @schema_object = schema_object
  @base_url = normalized_base_url(url)
end

Instance Attribute Details

#base_urlObject (readonly)

Returns the value of attribute base_url.



80
81
82
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 80

def base_url
  @base_url
end

#schema_objectObject (readonly)

Returns the value of attribute schema_object.



80
81
82
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 80

def schema_object
  @schema_object
end

Instance Method Details

#callHash

Returns the scraped article hash with DEFAULT_ATTRIBUTES.

Returns:

  • (Hash)

    the scraped article hash with DEFAULT_ATTRIBUTES



33
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 33

def call = DEFAULT_ATTRIBUTES.to_h { [_1, public_send(_1)] }

#categoriesArray<String>?

Returns extracted category labels.

Returns:

  • (Array<String>, nil)

    extracted category labels



76
77
78
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 76

def categories
  @categories ||= CategoryExtractor.call(schema_object)
end

#descriptionString?

Returns longest available description field.

Returns:

  • (String, nil)

    longest available description field



47
48
49
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 47

def description
  schema_object.values_at(:description, :schema_object_body, :abstract).max_by { _1.to_s.size }
end

#idString?

Returns stable schema object identifier.

Returns:

  • (String, nil)

    stable schema object identifier



36
37
38
39
40
41
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 36

def id
  return @id if defined?(@id)

  id = normalized_id(schema_object[:@id], reference_url: url || base_url) || url&.path.to_s
  @id = id.to_s.empty? ? nil : id
end

#imageHtml2rss::Url?

Returns normalized article image URL.

Returns:



65
66
67
68
69
70
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 65

def image
  return @image if defined?(@image)

  img_url = image_urls.first
  @image = img_url ? Url.from_relative(img_url, base_url || img_url) : nil
end

#image_urlsArray<String>

Returns normalized image URL candidates.

Returns:

  • (Array<String>)

    normalized image URL candidates



83
84
85
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 83

def image_urls
  @image_urls ||= schema_object.values_at(:image, :thumbnailUrl).filter_map { image_url_from(_1) }
end

#published_atString?

Returns published-at timestamp string.

Returns:

  • (String, nil)

    published-at timestamp string



73
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 73

def published_at = schema_object[:datePublished]

#titleString?

Returns article title.

Returns:

  • (String, nil)

    article title



44
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 44

def title = schema_object[:title]

#urlHtml2rss::Url?

Returns the URL of the schema object.

Returns:



52
53
54
55
56
57
58
59
60
61
62
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 52

def url
  return @url if defined?(@url)

  url = schema_object[:url]
  if url.to_s.empty?
    Log.debug("Schema#Thing.url: no url in schema_object: #{schema_object.inspect}")
    return @url = nil
  end

  @url = Url.from_relative(url, base_url || url)
end