Class: Html2rss::AutoSource::Scraper::Schema
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Schema
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb
Overview
Scrapes articles from Schema.org objects, by looking for the objects in: <script type=“application/ld+json”> “schema” tags.
Defined Under Namespace
Modules: CategoryExtractor Classes: ItemList, ListItem, Thing
Constant Summary collapse
- TAG_SELECTOR =
Selector for JSON-LD script tags containing Schema.org objects.
'script[type="application/ld+json"]'- SUPPORTED_TYPES_RE =
Pre-compiled regex union for supported schema types. Performs a single pass over script tag text instead of multiple regex matches.
begin types = Thing::SUPPORTED_TYPES | ItemList::SUPPORTED_TYPES /"@type"\s*:\s*"(?:#{Regexp.union(types.to_a).source})"/ end.freeze
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
Whether the page includes supported schema types.
-
.from(object) ⇒ Array<Hash>
Returns a flat array of all supported schema objects by recursively traversing the given ‘object`.
-
.options_key ⇒ Symbol
Scraper config key.
-
.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Thing, ...
A class responding to ‘#call`.
-
.supported_schema_object?(object) ⇒ Boolean
Whether an extractor exists for the candidate object.
-
.supported_schema_type?(script) ⇒ Boolean
Whether the tag references a supported schema type.
Instance Method Summary collapse
-
#each {|Hash| ... } ⇒ Array<Hash>
The scraped article_hashes.
-
#initialize(parsed_body, url:, **opts) ⇒ Schema
constructor
A new instance of Schema.
Constructor Details
#initialize(parsed_body, url:, **opts) ⇒ Schema
Returns a new instance of Schema.
101 102 103 104 105 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 101 def initialize(parsed_body, url:, **opts) @parsed_body = parsed_body @url = url @opts = opts end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Returns whether the page includes supported schema types.
34 35 36 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 34 def articles?(parsed_body) parsed_body.css(TAG_SELECTOR).any? { |script| supported_schema_type?(script) } end |
.from(object) ⇒ Array<Hash>
Returns a flat array of all supported schema objects by recursively traversing the given ‘object`.
:reek:DuplicateMethodCall
52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 52 def from(object) case object when Nokogiri::XML::Element from(parse_script_tag(object)) when Hash supported_schema_object?(object) ? [object] : object.values.flat_map { |item| from(item) } when Array object.flat_map { |item| from(item) } else [] end end |
.options_key ⇒ Symbol
Returns scraper config key.
29 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 29 def self. = :schema |
.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Thing, ...
Returns a class responding to ‘#call`.
74 75 76 77 78 79 80 81 82 83 84 85 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 74 def scraper_for_schema_object(schema_object) type = schema_object[:@type] if Thing::SUPPORTED_TYPES.member?(type) Thing elsif ItemList::SUPPORTED_TYPES.member?(type) ItemList else Log.debug("#{name}: unsupported schema object @type=#{type.inspect}") nil end end |
.supported_schema_object?(object) ⇒ Boolean
Returns whether an extractor exists for the candidate object.
67 68 69 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 67 def supported_schema_object?(object) scraper_for_schema_object(object) ? true : false end |
.supported_schema_type?(script) ⇒ Boolean
Returns whether the tag references a supported schema type.
40 41 42 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 40 def supported_schema_type?(script) script.text.match?(SUPPORTED_TYPES_RE) end |
Instance Method Details
#each {|Hash| ... } ⇒ Array<Hash>
Returns the scraped article_hashes.
110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 110 def each(&) return enum_for(:each) unless block_given? schema_objects.filter_map do |schema_object| next unless (klass = self.class.scraper_for_schema_object(schema_object)) next unless (results = klass.new(schema_object, url:).call) if results.is_a?(Array) results.each { |result| yield(result) } # rubocop:disable Style/ExplicitBlockArgument else yield(results) end end end |