WebStruct
Fetch HTTP(S) URLs and turn responses into a small WebStruct::Page object: normalized Content-Type, decoded body text, and content.parsed data shaped by MIME (HTML via Nokogiri, JSON, CSV/TSV, XML, plain text, or raw string for unknown types).
Redirects are followed (configurable limit). HTML responses are checked with lightweight heuristics for typical client-rendered “shell” pages; those may raise JavaScriptRequiredError so callers do not treat empty SPAs as successful static scrapes.
Requirements
- Ruby ≥ 3.4.0
Installation
Add to your Gemfile:
gem "webstruct"
Install the gem locally:
bundle install
Usage
require "webstruct"
page = WebStruct.scrape("https://example.com/")
page.content_type # => "text/html" (base type, lowercased, no parameters)
page.content.raw # => original body string as returned
page.content.parsed # => Nokogiri::HTML::Document for HTML, Hash for JSON, etc.
CSV and TSV header detection
For text/csv and text/tab-separated-values, content.parsed is always a CSV::Table. The gem inspects the first two rows (streaming) and decides whether Ruby parses with headers: true (first row becomes column names) or headers: false (every row is data, still wrapped as a table).
The rules are conservative: if it is not reasonably clear that the first row is a header row, the body is parsed without headers. That includes single-row files, all-numeric grids (e.g. "1,2\n3,4"), duplicate values in the first row, and all-text rows with no numeric-looking cell in the second row (e.g. "a,b\nc,d").
Clear cases such as "a,b\n1,2" or "id,name\n1,Bob" use headers: true so #map(&:to_h) yields one hash per data row. For no-header parses, use #map(&:fields) when you want an array of string arrays per row. Heuristics can still be wrong for unusual spreadsheets; post-process content.parsed in your app if needed.
HTTP options
Keywords are forwarded to WebStruct::Http.get. Common options:
| Option | Purpose |
|---|---|
user_agent |
User-Agent header (default identifies the gem). |
max_redirects |
Maximum redirects (default 5). |
read_timeout |
Read timeout in seconds (default 10). |
open_timeout |
Connect/open timeout in seconds (default 5). |
max_body_bytes |
If set, must be a positive Integer; raises BodyTooLargeError when the response body exceeds this many bytes (UTF-8 byte size of the materialized body string). Omit for no limit. |
WebStruct.scrape(
"https://example.com/api",
user_agent: "MyApp/1.0",
max_redirects: 3,
read_timeout: 30,
max_body_bytes: 2_000_000
)
Parsed content
Resolution uses Content-Type and light sniffing when the type is missing or generic (for example application/octet-stream). Typical mappings:
- HTML / XHTML —
Nokogiri::HTML::Document - JSON —
Hash/Array - CSV / TSV —
CSV::Table - XML —
Nokogiri::XML::Document - Plain text — normalized string
- Unknown — raw string
Errors
| Exception | When |
|---|---|
InvalidUrlError |
URL is not an absolute http(s) URL. |
ArgumentError |
Invalid max_body_bytes. |
BodyTooLargeError |
Body byte size exceeds max_body_bytes. |
JavaScriptRequiredError |
HTML looks like a JS-heavy shell; #signals lists heuristic symbols. |
ParseError |
Invalid JSON; HTTP and other parser errors are not wrapped. |
max_body_bytes and memory
With the default Faraday adapter, the HTTP client may buffer the full response before the library runs the size check. The limit still avoids MIME classification, shell heuristics, and Page parsing for bodies over the cap, which bounds downstream CPU and extra allocations. It does not by itself guarantee a hard upper bound on peak download memory; streaming caps would require a different adapter or custom stack.
JavaScript and SPAs
This gem does not execute JavaScript. If the server returns a minimal HTML shell (for example heavy Next/Nuxt markers or a noscript gate), JavaScriptRequiredError may be raised instead of returning a misleading Page. Use a headless browser or another tool if you need rendered DOM content.
License
MIT. See LICENSE.