Class: Html2rss::AutoSource::Scraper::Schema

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb

Overview

Scrapes articles from Schema.org objects, by looking for the objects in: <script type=“application/ld+json”> “schema” tags.

Defined Under Namespace

Modules: CategoryExtractor Classes: ItemList, ListItem, Thing

Constant Summary collapse

TAG_SELECTOR =

Selector for JSON-LD script tags containing Schema.org objects.

'script[type="application/ld+json"]'
SUPPORTED_TYPES_RE =

Pre-compiled regex union for supported schema types. Performs a single pass over script tag text instead of multiple regex matches.

begin
  types = Thing::SUPPORTED_TYPES | ItemList::SUPPORTED_TYPES
  /"@type"\s*:\s*"(?:#{Regexp.union(types.to_a).source})"/
end.freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, **opts) ⇒ Schema

Returns a new instance of Schema.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

  • url (String, Html2rss::Url)

    base page URL

  • opts (Hash)

    scraper-specific options

Options Hash (**opts):

  • :_reserved (Object)

    reserved for future scraper-specific options



101
102
103
104
105
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 101

def initialize(parsed_body, url:, **opts)
  @parsed_body = parsed_body
  @url = url
  @opts = opts
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns whether the page includes supported schema types.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

Returns:

  • (Boolean)

    whether the page includes supported schema types



34
35
36
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 34

def articles?(parsed_body)
  parsed_body.css(TAG_SELECTOR).any? { |script| supported_schema_type?(script) }
end

.from(object) ⇒ Array<Hash>

Returns a flat array of all supported schema objects by recursively traversing the given ‘object`.

:reek:DuplicateMethodCall

Parameters:

  • object (Hash, Array, Nokogiri::XML::Element)

Returns:

  • (Array<Hash>)

    the schema_objects, or an empty array



52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 52

def from(object)
  case object
  when Nokogiri::XML::Element
    from(parse_script_tag(object))
  when Hash
    supported_schema_object?(object) ? [object] : object.values.flat_map { |item| from(item) }
  when Array
    object.flat_map { |item| from(item) }
  else
    []
  end
end

.options_keySymbol

Returns scraper config key.

Returns:

  • (Symbol)

    scraper config key



29
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 29

def self.options_key = :schema

.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Thing, ...

Returns a class responding to ‘#call`.

Parameters:

  • schema_object (Hash{Symbol => Object})

    schema object with an @type key

Returns:



74
75
76
77
78
79
80
81
82
83
84
85
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 74

def scraper_for_schema_object(schema_object)
  type = schema_object[:@type]

  if Thing::SUPPORTED_TYPES.member?(type)
    Thing
  elsif ItemList::SUPPORTED_TYPES.member?(type)
    ItemList
  else
    Log.debug("#{name}: unsupported schema object @type=#{type.inspect}")
    nil
  end
end

.supported_schema_object?(object) ⇒ Boolean

Returns whether an extractor exists for the candidate object.

Parameters:

  • object (Hash{Symbol => Object})

    schema candidate object

Returns:

  • (Boolean)

    whether an extractor exists for the candidate object



67
68
69
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 67

def supported_schema_object?(object)
  scraper_for_schema_object(object) ? true : false
end

.supported_schema_type?(script) ⇒ Boolean

Returns whether the tag references a supported schema type.

Parameters:

  • script (Nokogiri::XML::Element)

    schema JSON-LD script tag

Returns:

  • (Boolean)

    whether the tag references a supported schema type



40
41
42
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 40

def supported_schema_type?(script)
  script.text.match?(SUPPORTED_TYPES_RE)
end

Instance Method Details

#each {|Hash| ... } ⇒ Array<Hash>

Returns the scraped article_hashes.

Yields:

  • (Hash)

    Each scraped article_hash

Returns:

  • (Array<Hash>)

    the scraped article_hashes



110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 110

def each(&)
  return enum_for(:each) unless block_given?

  schema_objects.filter_map do |schema_object|
    next unless (klass = self.class.scraper_for_schema_object(schema_object))
    next unless (results = klass.new(schema_object, url:).call)

    if results.is_a?(Array)
      results.each { |result| yield(result) } # rubocop:disable Style/ExplicitBlockArgument
    else
      yield(results)
    end
  end
end