Class: Html2rss::AutoSource::Scraper::Microdata
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Microdata
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/microdata.rb
Overview
Scrapes Schema.org Microdata items embedded directly in HTML markup.
Constant Summary collapse
- ITEM_SELECTOR =
Selector matching nodes that define a microdata item scope.
'[itemscope][itemtype]'- SUPPORTED_TYPES =
Schema.org types supported for article extraction via Microdata.
(Schema::Thing::SUPPORTED_TYPES | Set['Product']).freeze
- VALUE_ATTRIBUTES =
Attribute names checked first for microdata property values.
%w[content datetime href src data value].freeze
Class Method Summary collapse
- .articles?(parsed_body) ⇒ Boolean
-
.normalized_types(itemtype) ⇒ Array<String>
Normalized schema type names.
-
.options_key ⇒ Symbol
Scraper config key.
- .supported_root?(node) ⇒ Boolean
-
.supported_roots(parsed_body) ⇒ Array<Nokogiri::XML::Element>
Top-level supported Microdata roots.
-
.supported_type_name(node) ⇒ String?
Supported schema type name when present.
- .top_level_item?(node) ⇒ Boolean
Instance Method Summary collapse
-
#each {|article| ... } ⇒ Enumerator, void
Iterates over normalized article hashes extracted from supported Microdata roots.
-
#initialize(parsed_body, url:, **_opts) ⇒ void
constructor
Builds a Microdata scraper for an already parsed response body.
Constructor Details
#initialize(parsed_body, url:, **_opts) ⇒ void
Builds a Microdata scraper for an already parsed response body.
71 72 73 74 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 71 def initialize(parsed_body, url:, **_opts) @parsed_body = parsed_body @url = url end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
22 23 24 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 22 def articles?(parsed_body) supported_roots(parsed_body).any? end |
.normalized_types(itemtype) ⇒ Array<String>
Returns normalized schema type names.
47 48 49 50 51 52 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 47 def normalized_types(itemtype) itemtype.to_s.split.filter_map do |value| type = value.split('/').last.to_s.split('#').last.to_s type unless type.empty? end end |
.options_key ⇒ Symbol
Returns scraper config key.
18 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 18 def self. = :microdata |
.supported_root?(node) ⇒ Boolean
35 36 37 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 35 def supported_root?(node) supported_type_name(node) && top_level_item?(node) end |
.supported_roots(parsed_body) ⇒ Array<Nokogiri::XML::Element>
Returns top-level supported Microdata roots.
28 29 30 31 32 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 28 def supported_roots(parsed_body) return [] unless parsed_body parsed_body.css(ITEM_SELECTOR).select { supported_root?(_1) } end |
.supported_type_name(node) ⇒ String?
Returns supported schema type name when present.
41 42 43 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 41 def supported_type_name(node) normalized_types(node['itemtype']).find { SUPPORTED_TYPES.include?(_1) } end |
.top_level_item?(node) ⇒ Boolean
55 56 57 58 59 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 55 def top_level_item?(node) return false if node.attribute('itemprop') node.ancestors.none? { |ancestor| ancestor.attribute('itemscope') && ancestor.attribute('itemprop') } end |
Instance Method Details
#each {|article| ... } ⇒ Enumerator, void
Iterates over normalized article hashes extracted from supported Microdata roots.
81 82 83 84 85 86 87 88 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 81 def each return enum_for(:each) unless block_given? self.class.supported_roots(parsed_body).each do |root| article = article_from(root) yield article if article end end |