Class: Html2rss::AutoSource::Scraper::JsonState

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/json_state.rb

Overview

Scrapes JSON state blobs embedded in script tags such as Next.js, Nuxt, or custom window globals. The scraper searches ‘<script type=“application/json”>` tags and well-known JavaScript globals for arrays of article-like hashes and normalises them to a structure compatible with HtmlExtractor.

Constant Summary collapse

JSON_SCRIPT_SELECTOR =

Selector for JSON-only script tags.

'script[type="application/json"]'
GLOBAL_ASSIGNMENT_PATTERNS =

Regex patterns for known global JavaScript state assignments.

[
  /(?:window|self|globalThis)\.__NEXT_DATA__\s*=\s*/m,
  /(?:window|self|globalThis)\.__NUXT__\s*=\s*/m,
  /(?:window|self|globalThis)\.STATE\s*=\s*/m,
  /(?:window|self|globalThis)\.__REDUX_STATE__\s*=\s*/m,
  /(?:window|self|globalThis)\.__PRELOADED_STATE__\s*=\s*/m,
  /(?:window|self|globalThis)\.__APOLLO_STATE__\s*=\s*/m,
  /(?:window|self|globalThis)\.__remixContext\s*=\s*/m,
  /(?:window|self|globalThis)\.__sveltekit_data\s*=\s*/m,
  /(?:window|self|globalThis)\.GATSBY_STATE\s*=\s*/m,
  /(?:window|self|globalThis)\.__ember_meta\s*=\s*/m,
  /(?:window|self|globalThis)\.angular\s*=\s*/m
].freeze
TITLE_KEYS =

Preferred keys when extracting title-like values from state payloads.

%i[title headline name text].freeze
URL_KEYS =

Preferred keys when extracting URL-like values from state payloads.

%i[url link href permalink slug path canonicalUrl shortUrl].freeze
DESCRIPTION_KEYS =

Preferred keys when extracting description-like values from state payloads.

%i[description summary excerpt dek subheading].freeze
IMAGE_KEYS =

Preferred keys when extracting image-like values from state payloads.

%i[image imageUrl thumbnailUrl thumbnail src featuredImage coverImage heroImage].freeze
PUBLISHED_AT_KEYS =

Preferred keys when extracting publication timestamps from state payloads.

%i[published_at publishedAt datePublished date publicationDate pubDate updatedAt updated_at
createdAt created_at].freeze
CATEGORY_KEYS =

Preferred keys when extracting category-like values from state payloads.

%i[categories tags section sections topic topics channel].freeze
ID_KEYS =

Preferred keys when extracting identifier-like values from state payloads.

%i[id guid uuid slug key].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, **_opts) ⇒ JsonState

Returns a new instance of JsonState.

Parameters:

  • parsed_body (Nokogiri::HTML::Document, nil)

    parsed HTML document

  • url (String, Html2rss::Url)

    page URL used to resolve relative links

  • _opts (Hash)

    scraper-specific options

Options Hash (**_opts):

  • :_reserved (Object)

    reserved for future scraper-specific options



411
412
413
414
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 411

def initialize(parsed_body, url:, **_opts)
  @parsed_body = parsed_body
  @url = url
end

Instance Attribute Details

#parsed_bodyObject (readonly)

Returns the value of attribute parsed_body.



416
417
418
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 416

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Parameters:

  • parsed_body (Nokogiri::HTML::Document, nil)

    parsed HTML document

Returns:

  • (Boolean)


394
395
396
397
398
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 394

def articles?(parsed_body)
  return false unless parsed_body

  DocumentScanner.json_documents(parsed_body).any? { CandidateDetector.candidate_array?(_1) }
end

.json_documents(parsed_body) ⇒ Array<Hash, Array>

Returns parsed JSON documents discovered in the response body.

Parameters:

  • parsed_body (Nokogiri::HTML::Document, nil)

    parsed HTML document

Returns:

  • (Array<Hash, Array>)

    parsed JSON documents discovered in the response body



402
403
404
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 402

def json_documents(parsed_body)
  DocumentScanner.json_documents(parsed_body)
end

.options_keySymbol

Returns scraper config key.

Returns:

  • (Symbol)

    scraper config key



390
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 390

def self.options_key = :json_state

Instance Method Details

#each {|Hash{Symbol => Object}| ... } ⇒ Enumerator, void

Returns article enumerator when no block is given.

Yields:

  • (Hash{Symbol => Object})

    normalized article hash

Returns:

  • (Enumerator, void)

    article enumerator when no block is given



420
421
422
423
424
425
426
427
428
# File 'lib/html2rss/auto_source/scraper/json_state.rb', line 420

def each
  return enum_for(:each) unless block_given?

  DocumentScanner.json_documents(parsed_body).each do |document|
    discover_articles(document) do |article|
      yield article if article
    end
  end
end