Iriq

codecov

Iriq finds the shape of a URL — the structural template you get when you erase the parts that vary and keep the parts that don't. …/users/123 and …/users/999 are the same shape: /users/{user_id}. Feed iriq a pile of messy URLs — a log file, a column of links, free-text prose — and it collapses them into a small set of stable, deterministic route templates. Fifty thousand distinct URLs become twelve shapes.

(An IRI is just a URL — the internationalized superset of URI/URL that also allows non-ASCII characters. If you know URLs, you know IRIs. The name is IRI Query: iriq queries an IRI for its structure.)

Everything iriq does — parsing, normalizing, classifying path and query components, clustering, learning new patterns — exists to derive, render, or group by that shape.

And it gets sharper the more you feed it. Point a corpus at a stream and classifications improve as data flows in — high-churn slots get promoted to placeholders, and whole types emerge that you can't see in any single URL (a position that's always 100–599 is an HTTP status; one bounded to a dozen values is an enum).

$ iriq -n https://foo.com/users/123
https://foo.com/users/{user_id}

It answers questions like:

  • "What routes does this service actually expose?" (cluster a log file)
  • "Which params are stable identifiers vs. churning IDs vs. enums?" (--stats)
  • "Are these 50,000 distinct URLs really just 12 templates?" (clustering)
  • "What does /api/v1/users/abc-123-def become as a route shape?" (/api/{version}/users/{user_id})

Iriq ships as a command-line tool (iriq) and a Rust library.

Quick start

$ iriq https://foo.com/users/123
# parse
original:      https://foo.com/users/123
kind:          url
scheme:        https
host:          foo.com
path_segments: ["users", "123"]
canonical:     https://foo.com/users/123

# normalize
https://foo.com/users/{user_id}

$ iriq -n https://foo.com/users/123
https://foo.com/users/{user_id}

$ iriq -n https://shop.com/pricing/usd?currency=eur
https://shop.com/pricing/USD?currency=EUR     # currency upcased
$ cat access.log | iriq                       # ≥ 10 IRIs → cluster view
[190] docs.example.com  /users/{user_id}
[186] app.example.com   /users/{user_id}
...

$ cat access.log | iriq --stats               # rolling aggregates
$ iriq ./access.log -n                        # auto-detect file → normalize each
$ iriq -J < access.log                        # newline-delimited JSON
$ iriq --corpus c.db < access.log             # persist into a SQLite corpus

Once a corpus has data, -n becomes corpus-informed — a position that only ever holds integers clusters to a single {user_id} shape, and new values normalize to it:

$ for n in 1 2 3 4 5 6 7 8 9 10; do
    iriq --corpus c.db https://api.foo.com/users/$n >/dev/null
  done

$ iriq -n --corpus c.db https://api.foo.com/users/999
https://api.foo.com/users/{user_id}

Two ways to normalize

Pick by the question you're asking:

  • --canonical — clean up this URL, keeping the specifics. HTTP://Foo.com:80/pull/42http://foo.com/pull/42 (scheme/host lowercased, default port dropped; path and query left alone). Handy, but table stakes — plenty of libraries do it.
  • --normalize (the default) — find the URL's shape, erasing the specifics into placeholders. …/pull/42…/pull/{id}. This is the part you came to iriq for.

Same input, two questions: "what's the clean form of this URL?" vs "what kind of URL is this?" The second is iriq's reason to exist.

Install

# Homebrew (recommended)
brew install dpep/tools/iriq

# Cargo, from crates.io
cargo install iriq

# Cargo, from a source checkout
cargo install --path rust/iriq

One crate ships both the library and the iriq binary. Corpora persist to SQLite (bundled, WAL) out of the box — nothing to flag, install, or rebuild.

Use it as a Rust library

cargo add iriq
use iriq::{parse, normalize, Corpus};

let iri = parse("https://foo.com/users/123")?;
iri.host;             // "foo.com"
iri.path_segments;    // ["users", "123"]
iri.canonical();      // "https://foo.com/users/123"

normalize("https://foo.com/users/123")?;   // "https://foo.com/users/{user_id}"

// Streaming clustering against a persistent corpus.
let mut corpus = Corpus::open("c.db")?;
corpus.observe("https://foo.com/users/1")?;
corpus.save("c.db")?;

Full API on docs.rs/iriq; see the crate README for the library tour.

Segment classification

Iriq classifies each path/query segment into one of ~25 types — the first matching rule wins, and heuristics are deterministic:

  • literal — plain word (users, orders, Profile, こんにちは)
  • integer — pure digits below the timestamp range
  • float — decimal with digits on both sides (3.14, -2.5, 1.0)
  • booleantrue / false (any case)
  • version — semver-ish with v prefix (v1, v2.0.1, v1.2.3-beta)
  • locale — BCP 47-ish (en-US, fr_CA, zh-Hant, bare en/fr/ja)
  • currency — ISO 4217 codes (USD, EUR, JPY)
  • uuidf47ac10b-58cc-4372-a567-0e02b2c3d479
  • date2024-05-23, 2024/05/23, 20240523, 05/23/2024. Canonicalized to ISO in --normalize output.
  • timestamp — ISO 8601, or 10/13-digit UNIX epoch
  • hash — 32+ hex chars (md5 / sha)
  • slugmy-cool-post, my_cool_post
  • ipv4 / ipv6 — collapsed to {ip} in normalized output
  • urlhttps://..., ftp://..., also scheme-less foo.com/path
  • emaillocal@host.tld
  • phone — E.164 (+15551234567) or NANP (555-666-7777, (555) 666-7777)
  • jwt — three base64url segments separated by dots
  • mimeimage/png, application/vnd.api+json
  • filename.ext for known extensions; per-kind grouping (image/document/data/...)
  • color — hex form (#fff, #ffffff, #ffffff80)
  • coordinatelat,lng pair with plausible-range validation
  • country — ISO 3166-1 alpha-2 codes (US, JP, GB)
  • base64 — standard base64 blobs with disambiguating +///=
  • opaque_id — short alphanumeric mix that doesn't fit elsewhere

RESTful hints

When a variable segment follows a literal one, iriq derives a hint by singularizing the literal and suffixing _id (or _uuid for UUIDs). That's what produces {user_id} from /users/123 and {order_id} from /orders/456. Semantic types (version, locale, currency, date, boolean) skip the hint and surface as {type}/api/v1/status renders as /api/{version}/status, not the misleading /api/{api_id}/status. Pass -N / --no-hints for mechanical placeholders ({integer} instead of {user_id}).

Types only the corpus can see

Four types never come from a single URL — they emerge from the distribution of values a position has held across many observations:

Type Emerges when a position…
number holds both integers and floats
year holds integers that all land in 1900–2100
http_status holds integers that all land in 100–599
enum holds a small, bounded set of distinct values

Mechanically, 200 is just an integer. Across ten thousand URLs where that slot is always 100–599, it's an HTTP status. That's the corpus earning its keep.

Corpus (streaming + learning)

For processing many identifiers — possibly an unbounded stream — point iriq at a corpus. It maintains rolling aggregates and per-(host, prefix) frequency stats, so classification improves as more data comes in.

--corpus PATH makes the corpus survive across invocations. A .db / .sqlite / .sqlite3 path is stored in SQLite (WAL journaling, incremental UPSERTs — multiple iriq --corpus processes can write concurrently); a .json path writes a plain JSON file instead.

Re-runnable inference

A corpus persists the source-IRI log alongside the materialized views. --reinfer drops every view and replays the log through the current classifier and reducers. Tune a threshold, swap in a different classifier, or activate new recognizers (below) — then reinfer to see the new results without re-feeding URLs.

$ iriq --corpus c.db --reinfer

Learning new types

Iriq doesn't just classify against a fixed list — it watches the stream and proposes new recognizers for patterns it keeps seeing. Notice ghp_… or cus_… recurring at a slug position and iriq will suggest a recognizer for it, with evidence: coverage, host count, confidence. Proposals are never auto-applied — you activate the ones you trust, and they persist with the corpus. Human-in-the-loop by design.

# Print proposals (human-readable, or --json)
$ iriq --corpus c.db --propose-recognizers

# Auto-activate every proposal with confidence ≥ 0.9, then reinfer
$ iriq --corpus c.db --propose-recognizers --activate-above 0.9

Cross-host shape learning

A route shape that recurs across multiple hosts is independent evidence of a semantic pattern — two unrelated hosts inventing the same /users/{integer} structure by accident is unlikely.

$ iriq --corpus c.db --cross-host-shapes [--min-hosts N]

The same signal feeds back into proposal confidence: each additional host beyond the first adds 0.05 to the score (capped at 1.0), so a prefix proposed on 5 hosts is meaningfully stronger than the same coverage seen on 1 host.

Extracting IRIs from text

Pipe-mode extraction picks up explicit-scheme URLs (http, https, ftp, ws, wss, urn) and foo.com/path-style scheme-less URLs (small TLD allow-list, required path). It trims trailing sentence punctuation and preserves balanced parens (https://en.wikipedia.org/wiki/Ruby_(programming_language) stays intact; (see https://foo.com) drops the outer paren).

Known limitations (intentional):

  • Comma is a URL boundary, so query strings like ?q=37.7,-122.4 truncate. Trade-off picked to keep CSV-shaped text working.
  • No HTML entity decoding (&amp; stays as-is).
  • Scheme-less mode skips bare hostnames without a path (too noisy in prose).

Disable scheme-less extraction with --no-scheme-less.

How it works

Under the shape sits one idea: Position + Evidence. A Position is a slot in a host's structure — a typed path prefix, or a query-param name. Evidence is everything the corpus has observed about that slot: which values, how often, across how many hosts. Strings are observations; types are inferences drawn from the pile. Shape is the surface you see; Position + Evidence is the engine underneath. See docs/ARCHITECTURE.md for the full model.

CLI reference

Single input — combined parse + normalize summary; trim with section flags (-p, -n).

Piped stdin — extraction runs by default. Output auto-switches: small inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the cluster view via an ephemeral corpus.

Flag Effect
-p, --parse Show parsed fields
-n, --normalize Show the shape-normalized form
-c, --canonical Show the canonical form (no shape normalization)
-j, --json Emit JSON
-J, --ndjson Newline-delimited JSON (one object per line); implies --json
-N, --no-hints Use {integer} etc. instead of {user_id}
--no-scheme-less Skip foo.com/path-style extraction (explicit-scheme only)
--corpus PATH Load/create a corpus at PATH (.json or .db/.sqlite/.sqlite3)
--host MODE Host-keying for clustering: full (default), reg strips subdomains, none ignores host
--stats Print rolling aggregates
--reinfer Drop the materialized views and replay the source-IRI log through the current classifier + reducers
--propose-recognizers Scan observed values for shape patterns that recur enough to suggest a new recognizer. Combine with --json for structured output
--cross-host-shapes List route shapes that recur across multiple hosts
--min-observations N Proposal threshold; default 20
--min-coverage F Proposal threshold; default 0.7
--min-hosts N Threshold for both proposals and cross-host shapes; default 1 / 2 respectively
--activate-above F With --propose-recognizers, auto-activate every proposal whose confidence is ≥ F
`completion bash\ zsh`
-V, --version Print version

A positional argument that doesn't parse as an IRI but IS an existing file is read and extracted from automatically — iriq ./access.log and iriq /var/log/foo.log Just Work. (Bare filenames like README.md may still parse as a URL; pipe with cat to disambiguate.)

Exit codes: 0 success, 1 usage error, 2 parse error.

Limitations (intentional)

Iriq does not:

  • Implement RFC 3986, RFC 3987, or the WHATWG URL standard fully.
  • Convert between Unicode (IRI) and punycode (URI) — the display form is preserved as-is.
  • Percent-encode or decode path/query bytes. Bytes are kept as written.
  • Validate scheme-specific structure beyond URL vs. URN.
  • Resolve relative references against a base URL.
  • Round-trip canonical back to the exact original byte-for-byte (whitespace is stripped, default ports are dropped, dot segments are collapsed).

Iriq's focus is the analysis side: classification, normalization, and clustering — not a complete URL implementation.


Contributing

Yes please :)

  1. Fork it
  2. Create your feature branch (git checkout -b my-feature)
  3. Ensure the tests pass (cd rust && cargo test)
  4. Commit your changes (git commit -am 'awesome new feature')
  5. Push your branch (git push origin my-feature)
  6. Create a Pull Request