Class: Html2rss::AutoSource::Scraper::Microdata

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/microdata.rb

Overview

Scrapes Schema.org Microdata items embedded directly in HTML markup.

Constant Summary collapse

ITEM_SELECTOR =

Selector matching nodes that define a microdata item scope.

'[itemscope][itemtype]'
SUPPORTED_TYPES =

Schema.org types supported for article extraction via Microdata.

(Schema::Thing::SUPPORTED_TYPES | Set['Product']).freeze
VALUE_ATTRIBUTES =

Attribute names checked first for microdata property values.

%w[content datetime href src data value].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, **_opts) ⇒ void

Builds a Microdata scraper for an already parsed response body.

Parameters:

  • parsed_body (Nokogiri::HTML5::Document, Nokogiri::HTML4::Document, Nokogiri::XML::Node, nil)

    the parsed response body to inspect for top-level Microdata items.

  • url (Html2rss::Url)

    the absolute page URL used to resolve relative links.

  • _opts (Hash)

    unused scraper-specific options.

Options Hash (**_opts):

  • :_reserved (Object)

    reserved for future scraper-specific options



71
72
73
74
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 71

def initialize(parsed_body, url:, **_opts)
  @parsed_body = parsed_body
  @url = url
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Parameters:

  • parsed_body (Nokogiri::HTML::Document, nil)

    parsed HTML document

Returns:

  • (Boolean)


22
23
24
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 22

def articles?(parsed_body)
  supported_roots(parsed_body).any?
end

.normalized_types(itemtype) ⇒ Array<String>

Returns normalized schema type names.

Parameters:

  • itemtype (String, nil)

    raw itemtype attribute value

Returns:

  • (Array<String>)

    normalized schema type names



47
48
49
50
51
52
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 47

def normalized_types(itemtype)
  itemtype.to_s.split.filter_map do |value|
    type = value.split('/').last.to_s.split('#').last.to_s
    type unless type.empty?
  end
end

.options_keySymbol

Returns scraper config key.

Returns:

  • (Symbol)

    scraper config key



18
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 18

def self.options_key = :microdata

.supported_root?(node) ⇒ Boolean

Parameters:

  • node (Nokogiri::XML::Element)

    itemscope candidate node

Returns:

  • (Boolean)


35
36
37
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 35

def supported_root?(node)
  supported_type_name(node) && top_level_item?(node)
end

.supported_roots(parsed_body) ⇒ Array<Nokogiri::XML::Element>

Returns top-level supported Microdata roots.

Parameters:

  • parsed_body (Nokogiri::HTML::Document, nil)

    parsed HTML document

Returns:

  • (Array<Nokogiri::XML::Element>)

    top-level supported Microdata roots



28
29
30
31
32
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 28

def supported_roots(parsed_body)
  return [] unless parsed_body

  parsed_body.css(ITEM_SELECTOR).select { supported_root?(_1) }
end

.supported_type_name(node) ⇒ String?

Returns supported schema type name when present.

Parameters:

  • node (Nokogiri::XML::Element)

    itemscope candidate node

Returns:

  • (String, nil)

    supported schema type name when present



41
42
43
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 41

def supported_type_name(node)
  normalized_types(node['itemtype']).find { SUPPORTED_TYPES.include?(_1) }
end

.top_level_item?(node) ⇒ Boolean

Parameters:

  • node (Nokogiri::XML::Element)

    itemscope candidate node

Returns:

  • (Boolean)


55
56
57
58
59
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 55

def top_level_item?(node)
  return false if node.attribute('itemprop')

  node.ancestors.none? { |ancestor| ancestor.attribute('itemscope') && ancestor.attribute('itemprop') }
end

Instance Method Details

#each {|article| ... } ⇒ Enumerator, void

Iterates over normalized article hashes extracted from supported Microdata roots.

Yield Parameters:

  • article (Hash{Symbol => Object})

    the normalized article attributes.

Returns:

  • (Enumerator, void)

    an enumerator when no block is given.



81
82
83
84
85
86
87
88
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 81

def each
  return enum_for(:each) unless block_given?

  self.class.supported_roots(parsed_body).each do |root|
    article = article_from(root)
    yield article if article
  end
end