Class: Iriq::Cluster

Inherits:
Object
  • Object
show all
Defined in:
lib/iriq/cluster.rb

Overview

A group of identifiers that share a host + shape key. Tracks examples and per-position segment statistics so callers can ask which positions are actually stable in practice (e.g. /users/ always literal, /integer always variable).

Constant Summary collapse

MAX_EXAMPLES =
10
DATE_CONFIDENCE_THRESHOLD =

Share of date-typed observations required before the corpus promotes a param to :date. 8-digit IDs in the 1900..2100 range look like YYYYMMDD by accident — without quorum we’d canonicalize random IDs.

0.8
NUMBER_CONFIDENCE_THRESHOLD =

‘:number` umbrella thresholds. Promote a position to :number when the combined :integer + :float observations dominate (≥ majority) AND neither subtype alone hits the strong threshold (we have a clear numeric pattern but it isn’t purely ints or purely floats).

0.8
NUMBER_SUBTYPE_THRESHOLD =
0.8
ENUM_MIN_OBSERVATIONS =

‘:enum` thresholds. Promote a param to :enum when the corpus has seen enough samples to trust the bound, the value set is small, each value appears more than once (rules out singletons), and the tracked values account for nearly all observations (lets a few stragglers through).

20
ENUM_MAX_CARDINALITY =
10
ENUM_MIN_VALUE_COUNT =
2
ENUM_MIN_COVERAGE =
0.95
YEAR_RANGE =
1900..2100
YEAR_MIN_OBSERVATIONS =
5
YEAR_MIN_DISTINCT =
2
YEAR_MAX_DISTINCT =
150
HTTP_STATUS_RANGE =
100..599
HTTP_STATUS_MIN_OBSERVATIONS =
5
HTTP_STATUS_MIN_DISTINCT =
2
HTTP_STATUS_MAX_DISTINCT =
30

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(key:, host:, scheme:, shape:, max_values: PositionStats::DEFAULT_MAX_VALUES) ⇒ Cluster

Returns a new instance of Cluster.



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/iriq/cluster.rb', line 42

def initialize(key:, host:, scheme:, shape:, max_values: PositionStats::DEFAULT_MAX_VALUES)
  @key            = key
  @host           = host
  @scheme         = scheme
  @shape          = shape
  @shape_object   = nil
  @examples       = []
  @example_keys   = Set.new
  @count          = 0
  @segment_counts = []
  @max_values     = max_values
  # Query-param stats keyed by param name. Each is a PositionStats — same
  # cardinality cap, same type-counts machinery, just indexed by ?key=
  # instead of by path position.
  @param_stats    = {}
end

Instance Attribute Details

#countObject (readonly)

Returns the value of attribute count.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def count
  @count
end

#examplesObject (readonly)

Returns the value of attribute examples.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def examples
  @examples
end

#hostObject (readonly)

Returns the value of attribute host.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def host
  @host
end

#keyObject (readonly)

Returns the value of attribute key.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def key
  @key
end

#max_valuesObject (readonly)

Returns the value of attribute max_values.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def max_values
  @max_values
end

#param_statsObject (readonly)

Returns the value of attribute param_stats.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def param_stats
  @param_stats
end

#schemeObject (readonly)

Returns the value of attribute scheme.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def scheme
  @scheme
end

#shapeObject (readonly)

Returns the value of attribute shape.



7
8
9
# File 'lib/iriq/cluster.rb', line 7

def shape
  @shape
end

Class Method Details

.from_dump(h, max_values: PositionStats::DEFAULT_MAX_VALUES) ⇒ Object



336
337
338
339
340
341
342
343
344
345
346
347
348
349
# File 'lib/iriq/cluster.rb', line 336

def self.from_dump(h, max_values: PositionStats::DEFAULT_MAX_VALUES)
  cluster = new(
    key: h["key"], host: h["host"], scheme: h["scheme"], shape: h["shape"],
    max_values: max_values,
  )
  cluster.instance_variable_set(:@count, h["count"])
  examples = h["examples"].map { |s| Parser.parse(s) }
  cluster.instance_variable_set(:@examples, examples)
  cluster.instance_variable_set(:@example_keys, examples.map(&:canonical).to_set)
  cluster.instance_variable_set(:@segment_counts, h["segment_counts"].map { |sub| Hash.new(0).merge(sub) })
  params = (h["param_stats"] || {}).transform_values { |sd| PositionStats.from_dump(sd) }
  cluster.instance_variable_set(:@param_stats, params)
  cluster
end

.key_for(iri, classifier:, shape: nil, host: nil) ⇒ Object

Shared cluster-key derivation. Returns [key, host, scheme, shape] —callers that already have a hinted shape can pass it in to skip the recomputation; URN inputs ignore the override and always derive their own shape from the NSS value. ‘host:` overrides iri.host — used by Corpus when host_strategy collapses subdomains or ignores the host.



356
357
358
359
360
361
362
363
364
365
366
367
368
# File 'lib/iriq/cluster.rb', line 356

def self.key_for(iri, classifier:, shape: nil, host: nil)
  if iri.urn?
    ns, value = (iri.nss || "").split(":", 2)
    derived = value ? urn_value_shape(ns, value, classifier) : nil
    key     = "urn:#{ns}:#{derived}"
    [key, nil, "urn", key]
  else
    shape ||= PathShape.new(classifier: classifier).for(iri.path_segments)
    effective_host = host.nil? ? iri.host : host
    key = "#{iri.scheme}://#{effective_host}#{shape}"
    [key, effective_host, iri.scheme, shape]
  end
end

.urn_value_shape(ns, value, classifier) ⇒ Object



370
371
372
373
374
375
# File 'lib/iriq/cluster.rb', line 370

def self.urn_value_shape(ns, value, classifier)
  entry = SegmentHints.derive([ns, value], classifier).last
  return entry[:value] unless entry[:variable]

  "{#{entry[:hint] || entry[:type]}}"
end

Instance Method Details

#add(identifier, classifier: SegmentClassifier::DEFAULT) ⇒ Object



59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/iriq/cluster.rb', line 59

def add(identifier, classifier: SegmentClassifier::DEFAULT)
  @count += 1
  if @examples.size < MAX_EXAMPLES
    canon = identifier.canonical
    @examples << identifier unless @example_keys.include?(canon)
    @example_keys << canon
  end

  identifier.path_segments.each_with_index do |seg, i|
    @segment_counts[i] ||= Hash.new(0)
    @segment_counts[i][seg] += 1
  end

  return unless identifier.query_params
  identifier.query_params.each do |name, value|
    stats = @param_stats[name] ||= PositionStats.new(max_values: @max_values)
    stats.observe(value.to_s, classifier.classify(value.to_s))
  end
end

#dominant_excluding(stats, skip) ⇒ Object

Most common type in stats.type_counts excluding ‘skip` — lex tie-break so the choice is deterministic across runtimes.



307
308
309
310
311
312
313
314
315
316
317
318
# File 'lib/iriq/cluster.rb', line 307

def dominant_excluding(stats, skip)
  best = nil
  best_count = -1
  stats.type_counts.each do |t, n|
    next if t == skip
    if n > best_count || (n == best_count && t.to_s < best.to_s)
      best = t
      best_count = n
    end
  end
  best
end

#dumpObject

JSON-friendly dump for persistence (distinct from #to_h which is a display form). Examples are dumped as canonical strings and re-parsed on load.



323
324
325
326
327
328
329
330
331
332
333
334
# File 'lib/iriq/cluster.rb', line 323

def dump
  {
    "key"            => key,
    "host"           => host,
    "scheme"         => scheme,
    "shape"          => shape,
    "count"          => count,
    "examples"       => examples.map(&:canonical),
    "segment_counts" => @segment_counts.map { |h| h || {} },
    "param_stats"    => @param_stats.transform_values(&:dump),
  }
end

#enum?(stats) ⇒ Boolean

True when stats shows a bounded set of repeated values worth treating as an enum. See ENUM_* constants at the top of this class.

Returns:

  • (Boolean)


243
244
245
246
247
248
249
250
# File 'lib/iriq/cluster.rb', line 243

def enum?(stats)
  return false if stats.total < ENUM_MIN_OBSERVATIONS
  return false if stats.cardinality.zero? || stats.cardinality > ENUM_MAX_CARDINALITY
  return false if stats.value_counts.any? { |_, n| n < ENUM_MIN_VALUE_COUNT }

  coverage = stats.value_counts.values.sum.to_f / stats.total
  coverage >= ENUM_MIN_COVERAGE
end

#enum_values(stats) ⇒ Object

Distinct values tracked for this param, ordered by descending count (lex tie-break). Returned alongside :enum-typed rows in param_summary so verbose/explain consumers can render the value set.



255
256
257
# File 'lib/iriq/cluster.rb', line 255

def enum_values(stats)
  stats.value_counts.sort_by { |v, n| [-n, v] }.map(&:first)
end

#file_kind_distribution(stats) ⇒ Object

file_kind_distribution buckets tracked values by file kind and returns the fraction each kind represents over tracked observations. ‘:unknown` covers values that classified as :file but whose extension isn’t in the kind allowlist (shouldn’t normally happen since the classifier already gates on the kind map). Sums to ≤ 1.0 since value_counts caps at PositionStats::DEFAULT_MAX_VALUES.



289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# File 'lib/iriq/cluster.rb', line 289

def file_kind_distribution(stats)
  return {} if stats.value_counts.empty?

  total = stats.value_counts.values.sum
  return {} if total.zero?

  kinds = Hash.new(0)
  stats.value_counts.each do |value, n|
    kind = SegmentClassifier.file_kind(value) || :unknown
    kinds[kind] += n
  end
  kinds.sort_by { |k, n| [-n, k.to_s] }.to_h.transform_values do |n|
    (n.to_f / total).round(4)
  end
end

#http_status_position?(type, stats) ⇒ Boolean

Returns:

  • (Boolean)


231
232
233
234
235
236
237
238
239
# File 'lib/iriq/cluster.rb', line 231

def http_status_position?(type, stats)
  return false unless type == :integer
  return false if stats.numeric_count.zero?
  return false if stats.cardinality < HTTP_STATUS_MIN_DISTINCT
  return false if stats.cardinality > HTTP_STATUS_MAX_DISTINCT
  return false if stats.total < HTTP_STATUS_MIN_OBSERVATIONS

  HTTP_STATUS_RANGE.cover?(stats.numeric_min) && HTTP_STATUS_RANGE.cover?(stats.numeric_max)
end

#param_summaryObject

Per-param summary, ordered by descending presence. Each entry is:

{ name: "page", count: N, type: :integer, cardinality: K, presence: 0.83 }

presence is count / @count — the fraction of observations that had this param.



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/iriq/cluster.rb', line 111

def param_summary
  return [] if @param_stats.empty?

  @param_stats.map { |name, _stats|
    stats = @param_stats[name]
    type  = param_type(name)
    row   = {
      name:        name,
      count:       stats.total,
      type:        type,
      cardinality: stats.cardinality,
      presence:    @count.positive? ? stats.total.to_f / @count : 0.0,
    }
    row[:values] = enum_values(stats) if type == :enum
    # Verbose value distribution — fractions over tracked occurrences.
    # Boolean and enum positions get the per-value breakdown (e.g.
    # `true: 0.97, false: 0.03`). Number positions get the int-vs-float
    # split via :subtype_distribution.
    if type == :boolean || type == :enum
      row[:value_distribution] = value_distribution(stats)
    end
    if type == :number
      row[:subtype_distribution] = subtype_distribution(stats, %i[integer float])
    end
    # :file kind breakdown — derived from tracked value_counts at
    # summary time. Best-effort: only reflects observations within
    # the value-tracking cap.
    if type == :file
      row[:kind_distribution] = file_kind_distribution(stats)
    end
    if stats.numeric_count.positive?
      row[:min] = stats.numeric_min
      row[:max] = stats.numeric_max
      row[:avg] = stats.numeric_avg
    end
    row
  }.sort_by { |row| [-row[:count], row[:name]] }
end

#param_type(name) ⇒ Object

Returns the type the corpus is confident enough to call this param. Equals stats.dominant_type when the dominant type isn’t :date; when :date is dominant but below DATE_CONFIDENCE_THRESHOLD, falls back to the most-common non-date type (or :literal if none exists). Shared by Cluster#param_summary and Corpus#inferred_param_type so both views agree on what the corpus “thinks” about a param.



156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/iriq/cluster.rb', line 156

def param_type(name)
  stats = @param_stats[name]
  return nil unless stats
  return nil if stats.total.zero?

  type = stats.dominant_type

  # :year takes priority over :enum for numeric range columns —
  # a "years 2020..2026" position is more useful described as a
  # ranged year than as an enum of those specific values.
  return :year if year_position?(type, stats)
  # :http_status — 3-digit ints clustered in 100..599 are almost
  # certainly HTTP statuses. Same shape as :year (range check) but
  # tighter window. Useful for `?status=...` or path positions that
  # echo a status code.
  return :http_status if http_status_position?(type, stats)

  # :enum check — bounded set of repeated values trumps the underlying
  # value type. `?status=active|draft|archived` surfaces as :enum
  # (with the value list) rather than :literal even though each value
  # individually classifies as a literal. Skip the override when the
  # dominant type is already specific (`:boolean` carries more meaning
  # than a 2-value enum).
  return :enum if enum?(stats) && type != :boolean

  # :date gate — demote when there isn't enough date-typed quorum.
  if type == :date
    date_frac = stats.type_counts[:date].to_f / stats.total
    return type if date_frac >= DATE_CONFIDENCE_THRESHOLD

    return dominant_excluding(stats, :date) || :literal
  end

  # :number umbrella — promote when ints + floats together dominate
  # but neither alone is the clear winner.
  if type == :integer || type == :float
    int_frac   = stats.type_counts[:integer].to_f / stats.total
    float_frac = stats.type_counts[:float].to_f / stats.total
    if int_frac < NUMBER_SUBTYPE_THRESHOLD &&
       float_frac < NUMBER_SUBTYPE_THRESHOLD &&
       (int_frac + float_frac) >= NUMBER_CONFIDENCE_THRESHOLD
      return :number
    end
  end

  # Param-name fallback — `?phone=...` overrides a generic literal
  # type with `:phone` when the value's shape was too weak to detect
  # on its own. Only fires for overridable types (literal/opaque_id/slug).
  if (hint = SegmentClassifier.param_name_hint(name, type))
    return hint
  end

  type
end

#segment_statsObject

Per-position summary:

[
  { position: 0, stable: true,  values: { "users" => 3 } },
  { position: 1, stable: false, values: { "1" => 1, "2" => 1, "3" => 1 } },
]


84
85
86
87
88
89
90
91
92
# File 'lib/iriq/cluster.rb', line 84

def segment_stats
  @segment_counts.each_with_index.map do |counts, i|
    {
      position: i,
      stable:   counts.size == 1,
      values:   counts.dup,
    }
  end
end

#shape_object(classifier: SegmentClassifier::DEFAULT) ⇒ Object

Structured Shape lazily derived from the first observed example —Iriq::Shape, or nil if no examples are present yet. Cached after the first call.



12
13
14
15
16
17
# File 'lib/iriq/cluster.rb', line 12

def shape_object(classifier: SegmentClassifier::DEFAULT)
  return @shape_object if @shape_object
  return nil if @examples.empty?

  @shape_object = Shape.from_segments(@examples.first.path_segments, classifier: classifier)
end

#subtype_distribution(stats, subtypes) ⇒ Object

subtype_distribution slices type_counts to a specific subset and returns the fraction each subtype represents. Used for the :number umbrella to expose the int-vs-float split.



274
275
276
277
278
279
280
281
# File 'lib/iriq/cluster.rb', line 274

def subtype_distribution(stats, subtypes)
  return {} if stats.total.zero?

  subtypes.each_with_object({}) do |t, out|
    n = stats.type_counts[t] || 0
    out[t] = (n.to_f / stats.total).round(4) if n.positive?
  end
end

#to_hObject



94
95
96
97
98
99
100
101
102
103
104
105
# File 'lib/iriq/cluster.rb', line 94

def to_h
  {
    key:      key,
    host:     host,
    scheme:   scheme,
    shape:    shape,
    count:    count,
    examples: examples.map(&:canonical),
    segments: segment_stats,
    params:   param_summary,
  }
end

#value_distribution(stats) ⇒ Object

value_distribution returns the fraction of total observations each tracked value represents, ordered by descending count then lex. Used by param_summary for :boolean and :enum positions so callers can render “true 97%, false 3%”-style breakdowns.



263
264
265
266
267
268
269
# File 'lib/iriq/cluster.rb', line 263

def value_distribution(stats)
  return {} if stats.total.zero?

  stats.value_counts.sort_by { |v, n| [-n, v] }.to_h.transform_values do |n|
    (n.to_f / stats.total).round(4)
  end
end

#year_position?(type, stats) ⇒ Boolean

Returns:

  • (Boolean)


216
217
218
219
220
221
222
223
224
# File 'lib/iriq/cluster.rb', line 216

def year_position?(type, stats)
  return false unless type == :integer
  return false if stats.numeric_count.zero?
  return false if stats.cardinality < YEAR_MIN_DISTINCT
  return false if stats.cardinality > YEAR_MAX_DISTINCT
  return false if stats.total < YEAR_MIN_OBSERVATIONS

  YEAR_RANGE.cover?(stats.numeric_min) && YEAR_RANGE.cover?(stats.numeric_max)
end