Class: Iriq::SegmentClassifier

Inherits:
Object
  • Object
show all
Defined in:
lib/iriq/segment_classifier.rb

Overview

Heuristic classifier for individual path segments and query values.

Returns a symbol from the known TYPES set. Order matters: the first matching rule wins.

Constant Summary collapse

TYPES =

‘:number` is a corpus-only umbrella surfaced by Cluster#param_type when both `:integer` and `:float` are observed at the same position without either hitting a clear majority. The classifier never returns `:number` for an individual value — every value is unambiguously one or the other.

‘:enum` is similarly corpus-only — it surfaces when a position has a bounded set of distinct values observed across enough samples (see Cluster::ENUM_* thresholds).

%i[literal integer float number uuid date year timestamp hash slug
ipv4 ipv6 url email boolean version locale currency phone jwt mime
file color coordinate country base64 http_status enum opaque_id].freeze
FLOAT_RE =

A float requires a decimal point and digits on both sides. Sign is optional. Bare integers and 4+ char hex/UUID-shaped tokens fall through to their own rules.

/\A-?\d+\.\d+\z/.freeze
ISO_TIME_RE =

ISO 8601 timestamp shapes (RFC 3339-ish). Date-only forms live on Recognizers::Date / Recognizers::Integer.

/\A\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}(:\d{2})?(\.\d+)?(Z|[+\-]\d{2}:?\d{2})?\z/.freeze
HASH_RE =
/\A\h{32,}\z/.freeze
SLUG_RE =
/\A[a-z0-9]+(?:[-_][a-z0-9]+)+\z/.freeze
LITERAL_RE =
/\A[\p{L}][\p{L}\p{M}_]*\z/u.freeze
OPAQUE_RE =
/\A[A-Za-z0-9_\-.~]{4,}\z/.freeze
IPV4_RE =

Dotted-quad shape; per-octet bounds are validated in classify_ipv4.

/\A\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\z/.freeze
IPV6_RE =

IPv6: matches either the full eight-group form (‘a:b:c:d:e:f:g:h`) or any compressed form containing `::`. Rejects bare hex / integers / single-colon strings so we don’t shadow :integer, :hash, etc. Doesn’t accept IPv4-mapped variants (‘::ffff:192.0.2.1`) — common IPv6 traffic in URLs doesn’t use them.

/\A(?:[0-9a-fA-F]{1,4}(?::[0-9a-fA-F]{1,4}){7}|(?=[0-9a-fA-F:]*::)[0-9a-fA-F:]{2,})\z/.freeze
URL_RE =

URL-as-value: a scheme prefix followed by something non-empty. Used for query params like ?redirect=foo.com/bar.

%r{\A[a-zA-Z][a-zA-Z0-9+.\-]*://\S+\z}.freeze
SCHEMELESS_URL_RE =

Scheme-less URL — ‘foo.com/path`, `sub.foo.com/`, etc. Requires a dotted host with a TLD-like suffix (≥2 letters) followed by a slash to disambiguate from filenames like `image.png` or version strings like `1.2.3`.

%r{\A[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*\.[a-zA-Z]{2,}/\S*\z}.freeze
EMAIL_RE =

Simplified email — local@host.tld, no leading/trailing dots in either part. Not RFC 5322 compliant; covers the common shape.

/\A[A-Za-z0-9._%+\-]+@[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?(?:\.[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?)+\z/.freeze
BOOLEAN_RE =

Boolean literal — case-insensitive. ‘0`/`1` look like integers from a single value alone; the corpus’s :enum detection picks them up when they appear as a bounded value set on a param.

/\A(?:true|false)\z/i.freeze
VERSION_RE =

SemVer-ish version tag with explicit ‘v` prefix. Without the prefix `1.2.3` looks like a float / opaque blob; the `v` keeps it unambiguous from a single value.

/\Av\d+(?:\.\d+)*(?:[-+][A-Za-z0-9.\-]+)?\z/.freeze
LOCALE_RE =

BCP 47-ish locale: 2-3 letter language + separator + 2-4 char region or script. Real-world subtags: ISO 3166-1 region (‘US`, `CA`, 2 letters / 3 digits), ISO 15924 script (`Hans`, 4 letters). The bare 2/3-letter case is handled via LOCALE_LANGUAGE_CODES below so we don’t over-classify random short words. A trailing helper (classify_locale_pair) also confirms the language portion is in the allowlist — otherwise things like ‘by-locale` would wrongly promote to :locale.

/\A([a-z]{2,3})[-_]([A-Za-z0-9]{2,4})\z/.freeze
LOCALE_LANGUAGE_CODES =

Inline ISO 639-1 (subset) — the language codes we’ll accept as a standalone locale segment. Bare ‘en` / `fr` / `ja` etc. classify as :locale; tokens not in the list (like the 2-letter literal `to` or `if`) stay as :literal. Curated for the languages that show up in real `?lang=` traffic; expandable as needed.

%w[
  ar bg bn ca cs da de el en es et fa fi fr gu he hi hr hu id it
  ja ka kk km kn ko lt lv mk ml mr ms my nb nl no pa pl pt ro ru
  sk sl sr sv sw ta te th tl tr uk ur vi zh
].to_set.freeze
LOCALE_BARE_RE =

2 letters only — 3-letter slot is handled by CURRENCY_RE (ISO 4217 codes are 3 chars; ISO 639-2 language codes are too, but we don’t ship that list and would shadow currencies for ambiguous strings).

/\A[a-z]{2}\z/.freeze
CURRENCY_CODES =

ISO 4217 currency codes — inline allowlist of the ~30 most-used codes covers the long tail of real traffic. Three-letter all-caps strings (‘FAQ`, `FOO`) would otherwise leak into the literal type if we relied on shape alone.

%w[
  USD EUR GBP JPY CNY CHF CAD AUD NZD HKD SGD
  INR KRW MXN BRL ZAR SEK NOK DKK PLN CZK HUF
  RUB TRY ILS AED SAR THB IDR PHP VND TWD MYR
  NGN EGP
].to_set.freeze
CURRENCY_RE =
/\A[A-Za-z]{3}\z/.freeze
PHONE_RE =

E.164 phone number — leading ‘+` then 1-3 digit country code, then up to 14 more digits. Allows separators (space, dash, dot, parens) but they don’t count toward digit length. A standalone ‘+15551234567` and `+1 (555) 123-4567` both classify; bare digit blobs without `+` stay as :integer / :opaque_id (too ambiguous from a single value).

%r{\A\+(?:[ \-.()\d]){7,20}\z}.freeze
PHONE_NANP_RE =

NANP phone without ‘+` — `555-666-7777`, `555.666.7777`, `(555) 666-7777`. The area-code + exchange leading-digit constraint (first digit 2-9 in both) is what makes this safe to add without shadowing :integer —bare digit blobs / dotted numerics fall through. Only matches the 10-digit NANP shape; international formats need the explicit `+`.

/\A\(?([2-9]\d{2})\)?[ \-.]?([2-9]\d{2})[ \-.]?(\d{4})\z/.freeze
JWT_RE =

JWT: three base64url-encoded segments separated by dots, header starts with ‘eyJ` (the `{` JSON prefix base64url-encoded).

/\Aey[A-Za-z0-9_\-]+\.[A-Za-z0-9_\-]+\.[A-Za-z0-9_\-]+\z/.freeze
MIME_RE =

MIME / media type — RFC 2046 top-level types plus a subtype. The subtype side is permissive (letters/digits/+-.) so ‘application/vnd.api+json` and `image/svg+xml` both match.

%r{\A(?:text|image|video|audio|application|multipart|message|font|model)/[A-Za-z0-9!#$&^_+\-.]+\z}.freeze
FILE_RE =

File — ‘name.ext` shape where ext is in FILE_EXTENSIONS. The stem can be a slug, opaque-id, or literal; the meaningful signal is the extension. Per-extension grouping (image / document / data / etc.) surfaces via SegmentClassifier.file_kind for verbose displays.

/\A([A-Za-z0-9][A-Za-z0-9_\-.~]*)\.([A-Za-z0-9]{1,8})\z/.freeze
FILE_EXTENSIONS =

Allowlist of common file extensions, keyed by kind. The kind is surfaced via file_kind for verbose output; the type itself is just ‘:file`. Keep this list curated — random 1-8 char endings can shadow legitimate semantic types (`fr_CA.us`, `1.2.3`).

{
  image:    %w[png jpg jpeg gif webp svg bmp tiff tif ico avif heic heif],
  document: %w[pdf doc docx xls xlsx ppt pptx odt ods odp rtf epub],
  data:     %w[csv tsv json xml yaml yml parquet sqlite db ndjson jsonl],
  text:     %w[txt md log markdown rst],
  web:      %w[html htm css js mjs cjs ts jsx tsx],
  audio:    %w[mp3 wav ogg flac aac m4a opus],
  video:    %w[mp4 mov avi mkv webm flv wmv m4v],
  archive:  %w[zip tar gz bz2 7z rar xz tgz],
  code:     %w[rb py go java c cc cpp h hpp sh swift kt rs],
}.freeze
FILE_EXTENSION_KIND =

Reverse map ext → kind for O(1) lookup. Lowercase keys; classify downcases before consulting.

FILE_EXTENSIONS.each_with_object({}) { |(kind, exts), h|
  exts.each { |e| h[e] = kind }
}.freeze
COLOR_HEX_RE =

Hex color — ‘#fff`, `#ffffff`, `#ffffff80` (with alpha). 3/4/6/8 hex chars after the leading `#`. Other color formats (named, rgb(), hsl()) aren’t recognized yet; this is the only one common in URL path/query positions.

/\A#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})\z/.freeze
COORDINATE_RE =

Coordinate pair — ‘lat,lng`, both signed decimals. The extractor’s comma boundary means this only survives when present at classify time (e.g. query values fed in already-parsed). Each component validated for plausible lat/lng range in classify_coordinate.

/\A(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?)\z/.freeze
COUNTRY_RE =

ISO 3166-1 alpha-2 — 2 letters, validated against the inline allowlist below (so random 2-letter uppercase tokens like ‘OK` or `NO` don’t unconditionally promote). Lowercase tokens are routed through :locale by LOCALE_BARE_RE.

/\A[A-Z]{2}\z/.freeze
COUNTRY_CODES =
%w[
  AD AE AF AG AL AM AO AR AT AU AZ
  BA BB BD BE BG BH BJ BM BN BO BR BS BT BW BY BZ
  CA CD CG CH CI CL CM CN CO CR CU CY CZ
  DE DJ DK DM DO DZ
  EC EE EG ER ES ET
  FI FJ FK FM FO FR
  GA GB GE GH GI GL GM GN GR GT GU GW GY
  HK HN HR HT HU
  ID IE IL IM IN IQ IR IS IT
  JM JO JP
  KE KG KH KM KN KP KR KW KY KZ
  LA LB LC LI LK LR LS LT LU LV LY
  MA MC MD ME MG MK ML MM MN MO MR MT MU MV MW MX MY MZ
  NA NE NG NI NL NO NP NR NU NZ
  OM
  PA PE PF PG PH PK PL PR PT PW PY
  QA
  RE RO RS RU RW
  SA SB SC SD SE SG SI SK SL SM SN SO SR SS ST SV SY SZ
  TD TG TH TJ TM TN TO TR TT TV TW TZ
  UA UG US UY UZ
  VA VC VE VG VI VN VU
  WS
  YE
  ZA ZM ZW
].to_set.freeze
BASE64_RE =

Standard base64 — at least 16 chars, made up of base64 alphabet, AND contains one of the disambiguating chars (‘+`, `/`, trailing `=` padding) so we don’t shadow plain alphanumeric :opaque_id blobs. URL-safe base64 (which uses ‘-`/`_`) overlaps too heavily with :slug to discriminate from shape alone.

%r{\A[A-Za-z0-9+/]{16,}={0,2}\z}.freeze
HTTP_STATUS_RANGE =

HTTP status — bare 3-digit integer in the 100..599 window. Same corpus-promotion pattern as :year: a single 3-digit int is ambiguous, but a position whose values cluster inside the HTTP status window is almost certainly statuses. See Cluster#param_type for the promotion.

100..599
YEAR_RANGE =

Plausible year — 4-digit integer in the 1900..2100 window. Checked inside classify_integer so we don’t shadow shorter / longer ints.

1900..2100
CACHE_MAX =

Bounded memoization: classification of a given string is pure, so repeat segments (e.g. /users in countless paths) can be cached. Cap keeps the cache from unbounded growth when inputs are dominated by unique IDs.

10_000
DEFAULT =

Shared singleton — preferred default for callers that don’t bring their own classifier (saves a per-call allocation).

new
PARAM_NAME_HINTS =

Param-name hints — when a value’s classifier output is too generic (‘:literal`, `:opaque_id`, `:slug`) to be informative, the param name can supply the type. `?phone=unknown` becomes `:phone` even though `unknown` is a literal. Only “safe” string-shaped types are in the map; numeric types (`:integer`, `:year`, `:http_status`) are handled by range analysis instead.

{
  "phone"        => :phone,
  "tel"          => :phone,
  "telephone"    => :phone,
  "mobile"       => :phone,
  "cell"         => :phone,
  "email"        => :email,
  "e_mail"       => :email,
  "mail"         => :email,
  "locale"       => :locale,
  "lang"         => :locale,
  "language"     => :locale,
  "currency"     => :currency,
  "cur"          => :currency,
  "curr"         => :currency,
  "url"          => :url,
  "uri"          => :url,
  "redirect"     => :url,
  "redirect_url" => :url,
  "return_to"    => :url,
  "return_url"   => :url,
  "callback"     => :url,
  "callback_url" => :url,
  "next_url"     => :url,
  "jwt"          => :jwt,
  "bearer"       => :jwt,
  "auth_token"   => :jwt,
  "mime"         => :mime,
  "content_type" => :mime,
  "media_type"   => :mime,
  "color"        => :color,
  "colour"       => :color,
  "bg"           => :color,
  "background"   => :color,
  "fg"           => :color,
  "foreground"   => :color,
  "coords"       => :coordinate,
  "coordinates"  => :coordinate,
  "geo"          => :coordinate,
  "location"     => :coordinate,
  "position"     => :coordinate,
  "latlng"       => :coordinate,
  "latlon"       => :coordinate,
  "country"      => :country,
  "country_code" => :country,
  "nation"       => :country,
}.freeze
PARAM_HINT_OVERRIDABLE =

Types the param-name hint is allowed to override. Anything more specific (‘:integer`, `:uuid`, etc.) already carries useful info —the classifier wins.

%i[literal opaque_id slug].to_set.freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeSegmentClassifier

Returns a new instance of SegmentClassifier.



204
205
206
207
208
209
210
211
# File 'lib/iriq/segment_classifier.rb', line 204

def initialize
  @cache = {}
  # The recognizer ensemble consulted at classify time. Starts with
  # the built-in three (uuid, date, integer); Corpus#activate_proposal
  # appends SynthesizedRecognizer instances at runtime so a corpus
  # picks up its learned patterns without classifier surgery.
  @recognizers = [Recognizers::UUID, Recognizers::DATE, Recognizers::INTEGER]
end

Class Method Details

.canonical_currency(value) ⇒ Object

Canonicalize a currency code to uppercase ISO 4217. Returns nil if the value isn’t a known code. Used by –normalize so /pricing/usd and /pricing/USD both render as /pricing/USD.



503
504
505
506
507
# File 'lib/iriq/segment_classifier.rb', line 503

def self.canonical_currency(value)
  return nil if value.nil?
  up = value.upcase
  CURRENCY_CODES.include?(up) ? up : nil
end

.canonical_date(value) ⇒ Object

Canonicalize a recognized date string to ISO 8601 (YYYY-MM-DD). Returns nil if the value isn’t one of our accepted date forms. Used by –normalize so /events/2024/01/15 and /events/20240115 both render as /events/2024-01-15 in the output.



513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
# File 'lib/iriq/segment_classifier.rb', line 513

def self.canonical_date(value)
  return nil if value.nil?
  return nil unless value.is_a?(String)

  canon = Recognizers::Date.canonical(value)
  return canon if canon

  # Compact YYYYMMDD lives on the Integer recognizer for classification,
  # but the canonical form is part of the same date family.
  if Recognizers::Integer::COMPACT_DATE_PATTERN.match?(value)
    y, m, d = value[0, 4], value[4, 2], value[6, 2]
    return "#{y}-#{m}-#{d}" if Recognizers::Date.plausible?(y, m, d)
  end

  nil
end

.color_kind(value) ⇒ Object

Return the kind (‘:hex` for now — placeholder for future named / rgb / hsl support) of a color-shaped value, or nil if the value isn’t a recognized color. Used by verbose displays alongside the ‘:color` type itself.



425
426
427
428
429
430
# File 'lib/iriq/segment_classifier.rb', line 425

def self.color_kind(value)
  return nil if value.nil?
  return :hex if COLOR_HEX_RE.match?(value)

  nil
end

.display_type(type) ⇒ Object

Display name for a type in ‘–normalize` placeholders. Collapses `:ipv4` and `:ipv6` to `:ip` (callers that want the specific family read it off the classifier directly or via cluster stats).



405
406
407
408
409
# File 'lib/iriq/segment_classifier.rb', line 405

def self.display_type(type)
  return :ip if type == :ipv4 || type == :ipv6

  type
end

.file_kind(value) ⇒ Object

Return the kind (‘:image`/`:document`/`:data`/…) for a file-shaped value, or nil if the value isn’t a recognized file. Used by verbose displays to subdivide ‘:file` without polluting the top-level type taxonomy.



415
416
417
418
419
# File 'lib/iriq/segment_classifier.rb', line 415

def self.file_kind(value)
  return nil if value.nil?
  ext = value[/\.([A-Za-z0-9]{1,8})\z/, 1]&.downcase
  ext && FILE_EXTENSION_KIND[ext]
end

.param_name_hint(name, current_type) ⇒ Object

Return a hinted type for a param name when the resolved value type is generic. Nil when no hint applies. Both Cluster#param_type (for the corpus path) and Normalizer.shape_query (for one-shot rendering) consult this so corpus + one-shot agree on the override.



494
495
496
497
498
# File 'lib/iriq/segment_classifier.rb', line 494

def self.param_name_hint(name, current_type)
  return nil if name.nil? || !PARAM_HINT_OVERRIDABLE.include?(current_type)

  PARAM_NAME_HINTS[name.to_s.downcase]
end

Instance Method Details

#classify(segment) ⇒ Object



213
214
215
216
217
218
219
220
221
# File 'lib/iriq/segment_classifier.rb', line 213

def classify(segment)
  return :literal if segment.nil? || segment.empty?

  cached = @cache[segment]
  return cached if cached

  @cache.clear if @cache.size >= CACHE_MAX
  @cache[segment] = compute_classification(segment)
end

#recognizersObject

Snapshot of the live ensemble. Useful for tests and tooling that want to inspect which Recognizers a corpus is consulting.



235
236
237
# File 'lib/iriq/segment_classifier.rb', line 235

def recognizers
  @recognizers.dup
end

#register_recognizer(recognizer) ⇒ Object

Append a Recognizer to the ensemble. Called by Corpus#activate_proposal to promote a learned RecognizerProposal into a live Recognizer. Busts the classify cache so subsequent classify() calls see the new Recognizer.



227
228
229
230
231
# File 'lib/iriq/segment_classifier.rb', line 227

def register_recognizer(recognizer)
  @recognizers << recognizer
  @cache.clear
  recognizer
end

#variable?(type) ⇒ Boolean

Anything except :literal is considered variable for shape/explain.

Returns:

  • (Boolean)


240
241
242
# File 'lib/iriq/segment_classifier.rb', line 240

def variable?(type)
  type != :literal
end