Class: Html2rss::AutoSource::Scraper::JsonState
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::JsonState
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/json_state.rb
Overview
Scrapes JSON state blobs embedded in script tags such as Next.js, Nuxt, or custom window globals. The scraper searches ‘<script type=“application/json”>` tags and well-known JavaScript globals for arrays of article-like hashes and normalises them to a structure compatible with HtmlExtractor.
Constant Summary collapse
- JSON_SCRIPT_SELECTOR =
Selector for JSON-only script tags.
'script[type="application/json"]'- GLOBAL_ASSIGNMENT_PATTERNS =
Regex patterns for known global JavaScript state assignments.
[ /(?:window|self|globalThis)\.__NEXT_DATA__\s*=\s*/m, /(?:window|self|globalThis)\.__NUXT__\s*=\s*/m, /(?:window|self|globalThis)\.STATE\s*=\s*/m, /(?:window|self|globalThis)\.__REDUX_STATE__\s*=\s*/m, /(?:window|self|globalThis)\.__PRELOADED_STATE__\s*=\s*/m, /(?:window|self|globalThis)\.__APOLLO_STATE__\s*=\s*/m, /(?:window|self|globalThis)\.__remixContext\s*=\s*/m, /(?:window|self|globalThis)\.__sveltekit_data\s*=\s*/m, /(?:window|self|globalThis)\.GATSBY_STATE\s*=\s*/m, /(?:window|self|globalThis)\.__ember_meta\s*=\s*/m, /(?:window|self|globalThis)\.angular\s*=\s*/m ].freeze
- TITLE_KEYS =
Preferred keys when extracting title-like values from state payloads.
%i[title headline name text].freeze
- URL_KEYS =
Preferred keys when extracting URL-like values from state payloads.
%i[url link href permalink slug path canonicalUrl shortUrl].freeze
- DESCRIPTION_KEYS =
Preferred keys when extracting description-like values from state payloads.
%i[description summary excerpt dek subheading].freeze
- IMAGE_KEYS =
Preferred keys when extracting image-like values from state payloads.
%i[image imageUrl thumbnailUrl thumbnail src featuredImage coverImage heroImage].freeze
- PUBLISHED_AT_KEYS =
Preferred keys when extracting publication timestamps from state payloads.
%i[published_at publishedAt datePublished date publicationDate pubDate updatedAt updated_at createdAt created_at].freeze
- CATEGORY_KEYS =
Preferred keys when extracting category-like values from state payloads.
%i[categories tags section sections topic topics channel].freeze
- ID_KEYS =
Preferred keys when extracting identifier-like values from state payloads.
%i[id guid uuid slug key].freeze
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
- .articles?(parsed_body) ⇒ Boolean
-
.json_documents(parsed_body) ⇒ Array<Hash, Array>
Parsed JSON documents discovered in the response body.
-
.options_key ⇒ Symbol
Scraper config key.
Instance Method Summary collapse
-
#each {|Hash{Symbol => Object}| ... } ⇒ Enumerator, void
Article enumerator when no block is given.
-
#initialize(parsed_body, url:, **_opts) ⇒ JsonState
constructor
A new instance of JsonState.
Constructor Details
#initialize(parsed_body, url:, **_opts) ⇒ JsonState
Returns a new instance of JsonState.
411 412 413 414 |
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 411 def initialize(parsed_body, url:, **_opts) @parsed_body = parsed_body @url = url end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
416 417 418 |
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 416 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
394 395 396 397 398 |
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 394 def articles?(parsed_body) return false unless parsed_body DocumentScanner.json_documents(parsed_body).any? { CandidateDetector.candidate_array?(_1) } end |
.json_documents(parsed_body) ⇒ Array<Hash, Array>
Returns parsed JSON documents discovered in the response body.
402 403 404 |
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 402 def json_documents(parsed_body) DocumentScanner.json_documents(parsed_body) end |
.options_key ⇒ Symbol
Returns scraper config key.
390 |
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 390 def self. = :json_state |
Instance Method Details
#each {|Hash{Symbol => Object}| ... } ⇒ Enumerator, void
Returns article enumerator when no block is given.
420 421 422 423 424 425 426 427 428 |
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 420 def each return enum_for(:each) unless block_given? DocumentScanner.json_documents(parsed_body).each do |document| discover_articles(document) do |article| yield article if article end end end |