WebStruct

Fetch HTTP(S) URLs and turn responses into a small WebStruct::Page object: normalized Content-Type, decoded body text, and content.parsed data shaped by MIME (HTML via Nokogiri, JSON, CSV/TSV, XML, plain text, or raw string for unknown types).

Redirects are followed (configurable limit). HTML responses are checked with lightweight heuristics for typical client-rendered “shell” pages; those may raise JavaScriptRequiredError so callers do not treat empty SPAs as successful static scrapes.

Requirements

  • Ruby ≥ 3.4.0

Installation

Add to your Gemfile:

gem "webstruct"

Install the gem locally:

bundle install

Usage

require "webstruct"

page = WebStruct.scrape("https://example.com/")

page.content_type     # => "text/html" (base type, lowercased, no parameters)
page.content.raw      # => original body string as returned
page.content.parsed   # => Nokogiri::HTML::Document for HTML, Hash for JSON, etc.

CSV and TSV header detection

For text/csv and text/tab-separated-values, content.parsed is always a CSV::Table. The gem inspects the first two rows (streaming) and decides whether Ruby parses with headers: true (first row becomes column names) or headers: false (every row is data, still wrapped as a table).

The rules are conservative: if it is not reasonably clear that the first row is a header row, the body is parsed without headers. That includes single-row files, all-numeric grids (e.g. "1,2\n3,4"), duplicate values in the first row, and all-text rows with no numeric-looking cell in the second row (e.g. "a,b\nc,d").

Clear cases such as "a,b\n1,2" or "id,name\n1,Bob" use headers: true so #map(&:to_h) yields one hash per data row. For no-header parses, use #map(&:fields) when you want an array of string arrays per row. Heuristics can still be wrong for unusual spreadsheets; post-process content.parsed in your app if needed.

HTTP options

Keywords are forwarded to WebStruct::Http.get. Common options:

Option Purpose
user_agent User-Agent header (default identifies the gem).
max_redirects Maximum redirects (default 5).
read_timeout Read timeout in seconds (default 10).
open_timeout Connect/open timeout in seconds (default 5).
max_body_bytes If set, must be a positive Integer; raises BodyTooLargeError when the response body exceeds this many bytes (UTF-8 byte size of the materialized body string). Omit for no limit.
WebStruct.scrape(
  "https://example.com/api",
  user_agent: "MyApp/1.0",
  max_redirects: 3,
  read_timeout: 30,
  max_body_bytes: 2_000_000
)

Parsed content

Resolution uses Content-Type and light sniffing when the type is missing or generic (for example application/octet-stream). Typical mappings:

  • HTML / XHTMLNokogiri::HTML::Document
  • JSONHash / Array
  • CSV / TSVCSV::Table
  • XMLNokogiri::XML::Document
  • Plain text — normalized string
  • Unknown — raw string

Errors

Exception When
InvalidUrlError URL is not an absolute http(s) URL.
ArgumentError Invalid max_body_bytes.
BodyTooLargeError Body byte size exceeds max_body_bytes.
JavaScriptRequiredError HTML looks like a JS-heavy shell; #signals lists heuristic symbols.
ParseError Invalid JSON; HTTP and other parser errors are not wrapped.

max_body_bytes and memory

With the default Faraday adapter, the HTTP client may buffer the full response before the library runs the size check. The limit still avoids MIME classification, shell heuristics, and Page parsing for bodies over the cap, which bounds downstream CPU and extra allocations. It does not by itself guarantee a hard upper bound on peak download memory; streaming caps would require a different adapter or custom stack.

JavaScript and SPAs

This gem does not execute JavaScript. If the server returns a minimal HTML shell (for example heavy Next/Nuxt markers or a noscript gate), JavaScriptRequiredError may be raised instead of returning a misleading Page. Use a headless browser or another tool if you need rendered DOM content.

License

MIT. See LICENSE.