Module: WebStruct::Http::Shell

Defined in:
lib/webstruct/http/shell.rb

Overview

Heuristic checks on raw HTML to detect typical JavaScript app shells before treating a scrape as successful. Does not execute JavaScript. False positives and false negatives are possible; see predicates and constants.

Constant Summary collapse

MIN_BODY_TEXT =

Minimum visible body text length before SPA mount heuristics are ignored.

120
HEAVY_SCRIPT_THRESHOLD =

Script tag count at or above which the document is treated as script-heavy.

8
HEAVY_SCRIPT_MAX_TEXT =

Upper bound on visible body text that may still pair with a heavy script load.

90
SECONDARY_SHELL_SIGNALS =

Symbols treated as secondary evidence of a JS shell when #shell? also sees :very_low_text. See #secondary_shell?; not sufficient alone without that low-text signal (unless :noscript_gate).

%i[
  next_data
  nuxt
  next_static
  empty_spa_mount
  heavy_scripts_low_text
].freeze

Class Method Summary collapse

Class Method Details

.detect!(html) ⇒ void

This method returns an undefined value.

Runs shell heuristics on HTML and raises when the page should not be scraped statically.

Parameters:

  • html (String, nil)

    full HTML document (or empty), typically an HTTP body

Raises:



36
37
38
39
# File 'lib/webstruct/http/shell.rb', line 36

def detect!(html)
  signals = signals_for(html)
  raise(JavaScriptRequiredError, signals) if shell?(signals)
end

.signals_for(html) ⇒ Array<Symbol>

Returns symbolic flags describing possible JS-app-shell characteristics.

Parameters:

  • html (String, nil)

    full HTML document (or empty)

Returns:

  • (Array<Symbol>)

    Heuristic symbols (e.g. :next_data, :very_low_text). Blank input returns []; markup without a body element returns [:no_body].



46
47
48
49
50
51
52
53
# File 'lib/webstruct/http/shell.rb', line 46

def signals_for(html)
  return [] if html.nil? || html.strip.empty?

  doc = Nokogiri::HTML(html)
  return [:no_body] if doc.at_css("body").nil?

  collect_signals(doc, html)
end