Class: Html2rss::AutoSource::Scraper::Microdata

Inherits:

Object

Object
Html2rss::AutoSource::Scraper::Microdata

show all

Includes:: Enumerable

Defined in:: lib/html2rss/auto_source/scraper/microdata.rb

Overview

Scrapes Schema.org Microdata items embedded directly in HTML markup.

Constant Summary collapse

ITEM_SELECTOR = Selector matching nodes that define a microdata item scope.

'[itemscope][itemtype]'

SUPPORTED_TYPES = Schema.org types supported for article extraction via Microdata.

(Schema::Thing::SUPPORTED_TYPES | Set['Product']).freeze

VALUE_ATTRIBUTES = Attribute names checked first for microdata property values.

%w[content datetime href src data value].freeze

Class Method Summary collapse

.articles?(parsed_body) ⇒ Boolean
.normalized_types(itemtype) ⇒ Array<String>

Normalized schema type names.
.options_key ⇒ Symbol

Scraper config key.
.supported_root?(node) ⇒ Boolean
.supported_roots(parsed_body) ⇒ Array<Nokogiri::XML::Element>

Top-level supported Microdata roots.
.supported_type_name(node) ⇒ String^?

Supported schema type name when present.
.top_level_item?(node) ⇒ Boolean

Instance Method Summary collapse

#each {|article| ... } ⇒ Enumerator, void

Iterates over normalized article hashes extracted from supported Microdata roots.
#initialize(parsed_body, url:, **_opts) ⇒ void constructor

Builds a Microdata scraper for an already parsed response body.

Constructor Details

#initialize(parsed_body, url:, **_opts) ⇒ `void`

Builds a Microdata scraper for an already parsed response body.

Parameters:

parsed_body (Nokogiri::HTML5::Document, Nokogiri::HTML4::Document, Nokogiri::XML::Node, nil) —

the parsed response body to inspect for top-level Microdata items.
url (Html2rss::Url) —

the absolute page URL used to resolve relative links.
_opts (Hash) —

unused scraper-specific options.

Options Hash (**_opts):

:_reserved (Object) —

reserved for future scraper-specific options

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 71

def initialize(parsed_body, url:, **_opts)
  @parsed_body = parsed_body
  @url = url
end

Class Method Details

.articles?(parsed_body) ⇒ `Boolean`

Parameters:

parsed_body (Nokogiri::HTML::Document, nil) —

parsed HTML document

Returns:

(Boolean)



22
23
24

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 22

def articles?(parsed_body)
  supported_roots(parsed_body).any?
end

.normalized_types(itemtype) ⇒ `Array<String>`

Returns normalized schema type names.

Parameters:

itemtype (String, nil) —

raw itemtype attribute value

Returns:

(Array<String>) —

normalized schema type names

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 47

def normalized_types(itemtype)
  itemtype.to_s.split.filter_map do |value|
    type = value.split('/').last.to_s.split('#').last.to_s
    type unless type.empty?
  end
end

.options_key ⇒ `Symbol`

Returns scraper config key.

Returns:

(Symbol) —

scraper config key

18	# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 18 def self.options_key = :microdata

.supported_root?(node) ⇒ `Boolean`

Parameters:

node (Nokogiri::XML::Element) —

itemscope candidate node

Returns:

(Boolean)



35
36
37

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 35

def supported_root?(node)
  supported_type_name(node) && top_level_item?(node)
end

.supported_roots(parsed_body) ⇒ `Array<Nokogiri::XML::Element>`

Returns top-level supported Microdata roots.

Parameters:

parsed_body (Nokogiri::HTML::Document, nil) —

parsed HTML document

Returns:

(Array<Nokogiri::XML::Element>) —

top-level supported Microdata roots

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 28

def supported_roots(parsed_body)
  return [] unless parsed_body

  parsed_body.css(ITEM_SELECTOR).select { supported_root?(_1) }
end

.supported_type_name(node) ⇒ `String`^?

Returns supported schema type name when present.

Parameters:

node (Nokogiri::XML::Element) —

itemscope candidate node

Returns:

(String, nil) —

supported schema type name when present



41
42
43

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 41

def supported_type_name(node)
  normalized_types(node['itemtype']).find { SUPPORTED_TYPES.include?(_1) }
end

.top_level_item?(node) ⇒ `Boolean`

Parameters:

node (Nokogiri::XML::Element) —

itemscope candidate node

Returns:

(Boolean)

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 55

def top_level_item?(node)
  return false if node.attribute('itemprop')

  node.ancestors.none? { |ancestor| ancestor.attribute('itemscope') && ancestor.attribute('itemprop') }
end

Instance Method Details

#each {|article| ... } ⇒ `Enumerator`, `void`

Iterates over normalized article hashes extracted from supported Microdata roots.

Yield Parameters:

article (Hash{Symbol => Object}) —

the normalized article attributes.

Returns:

(Enumerator, void) —

an enumerator when no block is given.

# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 81

def each
  return enum_for(:each) unless block_given?

  self.class.supported_roots(parsed_body).each do |root|
    article = article_from(root)
    yield article if article
  end
end

Class: Html2rss::AutoSource::Scraper::Microdata

Overview

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, **_opts) ⇒ void

Class Method Details

.articles?(parsed_body) ⇒ Boolean

.normalized_types(itemtype) ⇒ Array<String>

.options_key ⇒ Symbol

.supported_root?(node) ⇒ Boolean

.supported_roots(parsed_body) ⇒ Array<Nokogiri::XML::Element>

.supported_type_name(node) ⇒ String?

.top_level_item?(node) ⇒ Boolean

Instance Method Details

#each {|article| ... } ⇒ Enumerator, void

#initialize(parsed_body, url:, **_opts) ⇒ `void`

.articles?(parsed_body) ⇒ `Boolean`

.normalized_types(itemtype) ⇒ `Array<String>`

.options_key ⇒ `Symbol`

.supported_root?(node) ⇒ `Boolean`

.supported_roots(parsed_body) ⇒ `Array<Nokogiri::XML::Element>`

.supported_type_name(node) ⇒ `String`^?

.top_level_item?(node) ⇒ `Boolean`

#each {|article| ... } ⇒ `Enumerator`, `void`