WebStruct

Fetch HTTP(S) URLs and turn responses into a small WebStruct::Page object: normalized Content-Type, decoded body text, and content.parsed data shaped by MIME (HTML via Nokogiri, JSON, CSV/TSV, XML, plain text, or raw string for unknown types).

Redirects are followed (configurable limit). HTML responses are checked with lightweight heuristics for typical client-rendered “shell” pages; those may raise JavaScriptRequiredError so callers do not treat empty SPAs as successful static scrapes.

Requirements

Ruby ≥ 3.4.0

Installation

Add to your Gemfile:

gem "webstruct"

Install the gem locally:

bundle install

Usage

require "webstruct"

page = WebStruct.scrape("https://example.com/")

page.content_type     # => "text/html" (base type, lowercased, no parameters)
page.content.raw      # => original body string as returned
page.content.parsed   # => Nokogiri::HTML::Document for HTML, Hash for JSON, etc.

CSV and TSV header detection

For text/csv and text/tab-separated-values, content.parsed is always a CSV::Table. The gem inspects the first two rows (streaming) and decides whether Ruby parses with headers: true (first row becomes column names) or headers: false (every row is data, still wrapped as a table).

The rules are conservative: if it is not reasonably clear that the first row is a header row, the body is parsed without headers. That includes single-row files, all-numeric grids (e.g. "1,2\n3,4"), duplicate values in the first row, and all-text rows with no numeric-looking cell in the second row (e.g. "a,b\nc,d").

Clear cases such as "a,b\n1,2" or "id,name\n1,Bob" use headers: true so #map(&:to_h) yields one hash per data row. For no-header parses, use #map(&:fields) when you want an array of string arrays per row. Heuristics can still be wrong for unusual spreadsheets; post-process content.parsed in your app if needed.

HTTP options

Keywords are forwarded to WebStruct::Http.get. Common options:

Option	Purpose
`user_agent`	`User-Agent` header (default identifies the gem).
`max_redirects`	Maximum redirects (default 5).
`read_timeout`	Read timeout in seconds (default 10).
`open_timeout`	Connect/open timeout in seconds (default 5).
`max_body_bytes`	If set, must be a positive Integer; raises `BodyTooLargeError` when the response body exceeds this many bytes (UTF-8 byte size of the materialized body string). Omit for no limit.

WebStruct.scrape(
  "https://example.com/api",
  user_agent: "MyApp/1.0",
  max_redirects: 3,
  read_timeout: 30,
  max_body_bytes: 2_000_000
)

Parsed content

Resolution uses Content-Type and light sniffing when the type is missing or generic (for example application/octet-stream). Typical mappings:

HTML / XHTML — Nokogiri::HTML::Document
JSON — Hash / Array
CSV / TSV — CSV::Table
XML — Nokogiri::XML::Document
Plain text — normalized string
Unknown — raw string

Errors

Exception	When
`InvalidUrlError`	URL is not an absolute http(s) URL.
`ArgumentError`	Invalid `max_body_bytes`.
`BodyTooLargeError`	Body byte size exceeds `max_body_bytes`.
`JavaScriptRequiredError`	HTML looks like a JS-heavy shell; `#signals` lists heuristic symbols.
`ParseError`	Invalid JSON; HTTP and other parser errors are not wrapped.

`max_body_bytes` and memory

With the default Faraday adapter, the HTTP client may buffer the full response before the library runs the size check. The limit still avoids MIME classification, shell heuristics, and Page parsing for bodies over the cap, which bounds downstream CPU and extra allocations. It does not by itself guarantee a hard upper bound on peak download memory; streaming caps would require a different adapter or custom stack.

JavaScript and SPAs

This gem does not execute JavaScript. If the server returns a minimal HTML shell (for example heavy Next/Nuxt markers or a noscript gate), JavaScriptRequiredError may be raised instead of returning a misleading Page. Use a headless browser or another tool if you need rendered DOM content.

License

MIT. See LICENSE.