Class: Html2rss::AutoSource::Scraper::Schema::Thing
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Schema::Thing
- Defined in:
- lib/html2rss/auto_source/scraper/schema/thing.rb
Overview
A Thing is kind of the ‘base class’ for Schema.org schema_objects.
Constant Summary collapse
- SUPPORTED_TYPES =
Supported Schema.org ‘@type` values mapped to article extraction.
%w[ AdvertiserContentArticle AnalysisNewsArticle APIReference Article AskPublicNewsArticle BackgroundNewsArticle BlogPosting DiscussionForumPosting LiveBlogPosting NewsArticle OpinionNewsArticle Report ReportageNewsArticle ReviewNewsArticle SatiricalArticle ScholarlyArticle SocialMediaPosting TechArticle ].to_set.freeze
- DEFAULT_ATTRIBUTES =
Attributes exposed by ‘#call` in generated article hashes.
%i[id title description url image published_at categories].freeze
Instance Attribute Summary collapse
-
#base_url ⇒ Object
readonly
Returns the value of attribute base_url.
-
#schema_object ⇒ Object
readonly
Returns the value of attribute schema_object.
Instance Method Summary collapse
-
#call ⇒ Hash
The scraped article hash with DEFAULT_ATTRIBUTES.
-
#categories ⇒ Array<String>?
Extracted category labels.
-
#description ⇒ String?
Longest available description field.
-
#id ⇒ String?
Stable schema object identifier.
-
#image ⇒ Html2rss::Url?
Normalized article image URL.
-
#image_urls ⇒ Array<String>
Normalized image URL candidates.
-
#initialize(schema_object, url:) ⇒ Thing
constructor
A new instance of Thing.
-
#published_at ⇒ String?
Published-at timestamp string.
-
#title ⇒ String?
Article title.
-
#url ⇒ Html2rss::Url?
The URL of the schema object.
Constructor Details
#initialize(schema_object, url:) ⇒ Thing
Returns a new instance of Thing.
27 28 29 30 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 27 def initialize(schema_object, url:) @schema_object = schema_object @base_url = normalized_base_url(url) end |
Instance Attribute Details
#base_url ⇒ Object (readonly)
Returns the value of attribute base_url.
80 81 82 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 80 def base_url @base_url end |
#schema_object ⇒ Object (readonly)
Returns the value of attribute schema_object.
80 81 82 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 80 def schema_object @schema_object end |
Instance Method Details
#call ⇒ Hash
Returns the scraped article hash with DEFAULT_ATTRIBUTES.
33 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 33 def call = DEFAULT_ATTRIBUTES.to_h { [_1, public_send(_1)] } |
#categories ⇒ Array<String>?
Returns extracted category labels.
76 77 78 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 76 def categories @categories ||= CategoryExtractor.call(schema_object) end |
#description ⇒ String?
Returns longest available description field.
47 48 49 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 47 def description schema_object.values_at(:description, :schema_object_body, :abstract).max_by { _1.to_s.size } end |
#id ⇒ String?
Returns stable schema object identifier.
36 37 38 39 40 41 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 36 def id return @id if defined?(@id) id = normalized_id(schema_object[:@id], reference_url: url || base_url) || url&.path.to_s @id = id.to_s.empty? ? nil : id end |
#image ⇒ Html2rss::Url?
Returns normalized article image URL.
65 66 67 68 69 70 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 65 def image return @image if defined?(@image) img_url = image_urls.first @image = img_url ? Url.from_relative(img_url, base_url || img_url) : nil end |
#image_urls ⇒ Array<String>
Returns normalized image URL candidates.
83 84 85 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 83 def image_urls @image_urls ||= schema_object.values_at(:image, :thumbnailUrl).filter_map { image_url_from(_1) } end |
#published_at ⇒ String?
Returns published-at timestamp string.
73 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 73 def published_at = schema_object[:datePublished] |
#title ⇒ String?
Returns article title.
44 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 44 def title = schema_object[:title] |
#url ⇒ Html2rss::Url?
Returns the URL of the schema object.
52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/html2rss/auto_source/scraper/schema/thing.rb', line 52 def url return @url if defined?(@url) url = schema_object[:url] if url.to_s.empty? Log.debug("Schema#Thing.url: no url in schema_object: #{schema_object.inspect}") return @url = nil end @url = Url.from_relative(url, base_url || url) end |