Module: WebStruct::Http::Shell
- Defined in:
- lib/webstruct/http/shell.rb
Overview
Heuristic checks on raw HTML to detect typical JavaScript app shells before treating a scrape as successful. Does not execute JavaScript. False positives and false negatives are possible; see predicates and constants.
Constant Summary collapse
- MIN_BODY_TEXT =
Minimum visible body text length before SPA mount heuristics are ignored.
120- HEAVY_SCRIPT_THRESHOLD =
Script tag count at or above which the document is treated as script-heavy.
8- HEAVY_SCRIPT_MAX_TEXT =
Upper bound on visible body text that may still pair with a heavy script load.
90- SECONDARY_SHELL_SIGNALS =
Symbols treated as secondary evidence of a JS shell when #shell? also sees
:very_low_text. See #secondary_shell?; not sufficient alone without that low-text signal (unless:noscript_gate). %i[ next_data nuxt next_static empty_spa_mount heavy_scripts_low_text ].freeze
Class Method Summary collapse
-
.detect!(html) ⇒ void
Runs shell heuristics on HTML and raises when the page should not be scraped statically.
-
.signals_for(html) ⇒ Array<Symbol>
Returns symbolic flags describing possible JS-app-shell characteristics.
Class Method Details
.detect!(html) ⇒ void
This method returns an undefined value.
Runs shell heuristics on HTML and raises when the page should not be scraped statically.
36 37 38 39 |
# File 'lib/webstruct/http/shell.rb', line 36 def detect!(html) signals = signals_for(html) raise(JavaScriptRequiredError, signals) if shell?(signals) end |
.signals_for(html) ⇒ Array<Symbol>
Returns symbolic flags describing possible JS-app-shell characteristics.
46 47 48 49 50 51 52 53 |
# File 'lib/webstruct/http/shell.rb', line 46 def signals_for(html) return [] if html.nil? || html.strip.empty? doc = Nokogiri::HTML(html) return [:no_body] if doc.at_css("body").nil? collect_signals(doc, html) end |