Class: Html2rss::AutoSource::Scraper::Schema

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb

Overview

Scrapes articles from Schema.org objects, by looking for the objects in: <script type=“application/ld+json”> “schema” tags.

Defined Under Namespace

Modules: CategoryExtractor Classes: ItemList, ListItem, Thing

Constant Summary collapse

TAG_SELECTOR =

Selector for JSON-LD script tags containing Schema.org objects.

'script[type="application/ld+json"]'

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, **opts) ⇒ Schema

Returns a new instance of Schema.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

  • url (String, Html2rss::Url)

    base page URL

  • opts (Hash)

    scraper-specific options

Options Hash (**opts):

  • :_reserved (Object)

    reserved for future scraper-specific options



95
96
97
98
99
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 95

def initialize(parsed_body, url:, **opts)
  @parsed_body = parsed_body
  @url = url
  @opts = opts
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns whether the page includes supported schema types.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

Returns:

  • (Boolean)

    whether the page includes supported schema types



27
28
29
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 27

def articles?(parsed_body)
  parsed_body.css(TAG_SELECTOR).any? { |script| supported_schema_type?(script) }
end

.from(object) ⇒ Array<Hash>

Returns a flat array of all supported schema objects by recursively traversing the given ‘object`.

:reek:DuplicateMethodCall

Parameters:

  • object (Hash, Array, Nokogiri::XML::Element)

Returns:

  • (Array<Hash>)

    the schema_objects, or an empty array



46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 46

def from(object)
  case object
  when Nokogiri::XML::Element
    from(parse_script_tag(object))
  when Hash
    supported_schema_object?(object) ? [object] : object.values.flat_map { |item| from(item) }
  when Array
    object.flat_map { |item| from(item) }
  else
    []
  end
end

.options_keySymbol

Returns scraper config key.

Returns:

  • (Symbol)

    scraper config key



22
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 22

def self.options_key = :schema

.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Thing, ...

Returns a class responding to ‘#call`.

Parameters:

  • schema_object (Hash{Symbol => Object})

    schema object with an @type key

Returns:



68
69
70
71
72
73
74
75
76
77
78
79
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 68

def scraper_for_schema_object(schema_object)
  type = schema_object[:@type]

  if Thing::SUPPORTED_TYPES.member?(type)
    Thing
  elsif ItemList::SUPPORTED_TYPES.member?(type)
    ItemList
  else
    Log.debug("#{name}: unsupported schema object @type=#{type.inspect}")
    nil
  end
end

.supported_schema_object?(object) ⇒ Boolean

Returns whether an extractor exists for the candidate object.

Parameters:

  • object (Hash{Symbol => Object})

    schema candidate object

Returns:

  • (Boolean)

    whether an extractor exists for the candidate object



61
62
63
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 61

def supported_schema_object?(object)
  scraper_for_schema_object(object) ? true : false
end

.supported_schema_type?(script) ⇒ Boolean

Returns whether the tag references a supported schema type.

Parameters:

  • script (Nokogiri::XML::Element)

    schema JSON-LD script tag

Returns:

  • (Boolean)

    whether the tag references a supported schema type



33
34
35
36
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 33

def supported_schema_type?(script)
  supported_types = Thing::SUPPORTED_TYPES | ItemList::SUPPORTED_TYPES
  supported_types.any? { |type| script.text.match?(/"@type"\s*:\s*"#{Regexp.escape(type)}"/) }
end

Instance Method Details

#each {|Hash| ... } ⇒ Array<Hash>

Returns the scraped article_hashes.

Yields:

  • (Hash)

    Each scraped article_hash

Returns:

  • (Array<Hash>)

    the scraped article_hashes



104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 104

def each(&)
  return enum_for(:each) unless block_given?

  schema_objects.filter_map do |schema_object|
    next unless (klass = self.class.scraper_for_schema_object(schema_object))
    next unless (results = klass.new(schema_object, url:).call)

    if results.is_a?(Array)
      results.each { |result| yield(result) } # rubocop:disable Style/ExplicitBlockArgument
    else
      yield(results)
    end
  end
end