Class: Html2rss::AutoSource::Scraper::Schema
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Schema
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb
Overview
Scrapes articles from Schema.org objects, by looking for the objects in: <script type=“application/ld+json”> “schema” tags.
Defined Under Namespace
Modules: CategoryExtractor Classes: ItemList, ListItem, Thing
Constant Summary collapse
- TAG_SELECTOR =
Selector for JSON-LD script tags containing Schema.org objects.
'script[type="application/ld+json"]'
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
Whether the page includes supported schema types.
-
.from(object) ⇒ Array<Hash>
Returns a flat array of all supported schema objects by recursively traversing the given ‘object`.
-
.options_key ⇒ Symbol
Scraper config key.
-
.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Thing, ...
A class responding to ‘#call`.
-
.supported_schema_object?(object) ⇒ Boolean
Whether an extractor exists for the candidate object.
-
.supported_schema_type?(script) ⇒ Boolean
Whether the tag references a supported schema type.
Instance Method Summary collapse
-
#each {|Hash| ... } ⇒ Array<Hash>
The scraped article_hashes.
-
#initialize(parsed_body, url:, **opts) ⇒ Schema
constructor
A new instance of Schema.
Constructor Details
#initialize(parsed_body, url:, **opts) ⇒ Schema
Returns a new instance of Schema.
95 96 97 98 99 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 95 def initialize(parsed_body, url:, **opts) @parsed_body = parsed_body @url = url @opts = opts end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Returns whether the page includes supported schema types.
27 28 29 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 27 def articles?(parsed_body) parsed_body.css(TAG_SELECTOR).any? { |script| supported_schema_type?(script) } end |
.from(object) ⇒ Array<Hash>
Returns a flat array of all supported schema objects by recursively traversing the given ‘object`.
:reek:DuplicateMethodCall
46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 46 def from(object) case object when Nokogiri::XML::Element from(parse_script_tag(object)) when Hash supported_schema_object?(object) ? [object] : object.values.flat_map { |item| from(item) } when Array object.flat_map { |item| from(item) } else [] end end |
.options_key ⇒ Symbol
Returns scraper config key.
22 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 22 def self. = :schema |
.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Thing, ...
Returns a class responding to ‘#call`.
68 69 70 71 72 73 74 75 76 77 78 79 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 68 def scraper_for_schema_object(schema_object) type = schema_object[:@type] if Thing::SUPPORTED_TYPES.member?(type) Thing elsif ItemList::SUPPORTED_TYPES.member?(type) ItemList else Log.debug("#{name}: unsupported schema object @type=#{type.inspect}") nil end end |
.supported_schema_object?(object) ⇒ Boolean
Returns whether an extractor exists for the candidate object.
61 62 63 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 61 def supported_schema_object?(object) scraper_for_schema_object(object) ? true : false end |
.supported_schema_type?(script) ⇒ Boolean
Returns whether the tag references a supported schema type.
33 34 35 36 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 33 def supported_schema_type?(script) supported_types = Thing::SUPPORTED_TYPES | ItemList::SUPPORTED_TYPES supported_types.any? { |type| script.text.match?(/"@type"\s*:\s*"#{Regexp.escape(type)}"/) } end |
Instance Method Details
#each {|Hash| ... } ⇒ Array<Hash>
Returns the scraped article_hashes.
104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 104 def each(&) return enum_for(:each) unless block_given? schema_objects.filter_map do |schema_object| next unless (klass = self.class.scraper_for_schema_object(schema_object)) next unless (results = klass.new(schema_object, url:).call) if results.is_a?(Array) results.each { |result| yield(result) } # rubocop:disable Style/ExplicitBlockArgument else yield(results) end end end |