Class: Iriq::Cluster
- Inherits:
-
Object
- Object
- Iriq::Cluster
- Defined in:
- lib/iriq/cluster.rb
Overview
A group of identifiers that share a host + shape key. Tracks examples and per-position segment statistics so callers can ask which positions are actually stable in practice (e.g. /users/ always literal, /integer always variable).
Constant Summary collapse
- MAX_EXAMPLES =
10- DATE_CONFIDENCE_THRESHOLD =
Share of date-typed observations required before the corpus promotes a param to :date. 8-digit IDs in the 1900..2100 range look like YYYYMMDD by accident — without quorum we’d canonicalize random IDs.
0.8- NUMBER_CONFIDENCE_THRESHOLD =
‘:number` umbrella thresholds. Promote a position to :number when the combined :integer + :float observations dominate (≥ majority) AND neither subtype alone hits the strong threshold (we have a clear numeric pattern but it isn’t purely ints or purely floats).
0.8- NUMBER_SUBTYPE_THRESHOLD =
0.8- ENUM_MIN_OBSERVATIONS =
‘:enum` thresholds. Promote a param to :enum when the corpus has seen enough samples to trust the bound, the value set is small, each value appears more than once (rules out singletons), and the tracked values account for nearly all observations (lets a few stragglers through).
20- ENUM_MAX_CARDINALITY =
10- ENUM_MIN_VALUE_COUNT =
2- ENUM_MIN_COVERAGE =
0.95- YEAR_RANGE =
1900..2100
- YEAR_MIN_OBSERVATIONS =
5- YEAR_MIN_DISTINCT =
2- YEAR_MAX_DISTINCT =
150- HTTP_STATUS_RANGE =
100..599
- HTTP_STATUS_MIN_OBSERVATIONS =
5- HTTP_STATUS_MIN_DISTINCT =
2- HTTP_STATUS_MAX_DISTINCT =
30
Instance Attribute Summary collapse
-
#count ⇒ Object
readonly
Returns the value of attribute count.
-
#examples ⇒ Object
readonly
Returns the value of attribute examples.
-
#host ⇒ Object
readonly
Returns the value of attribute host.
-
#key ⇒ Object
readonly
Returns the value of attribute key.
-
#max_values ⇒ Object
readonly
Returns the value of attribute max_values.
-
#param_stats ⇒ Object
readonly
Returns the value of attribute param_stats.
-
#scheme ⇒ Object
readonly
Returns the value of attribute scheme.
-
#shape ⇒ Object
readonly
Returns the value of attribute shape.
Class Method Summary collapse
- .from_dump(h, max_values: PositionStats::DEFAULT_MAX_VALUES) ⇒ Object
-
.key_for(iri, classifier:, shape: nil, host: nil) ⇒ Object
Shared cluster-key derivation.
- .urn_value_shape(ns, value, classifier) ⇒ Object
Instance Method Summary collapse
- #add(identifier, classifier: SegmentClassifier::DEFAULT) ⇒ Object
-
#dominant_excluding(stats, skip) ⇒ Object
Most common type in stats.type_counts excluding ‘skip` — lex tie-break so the choice is deterministic across runtimes.
-
#dump ⇒ Object
JSON-friendly dump for persistence (distinct from #to_h which is a display form).
-
#enum?(stats) ⇒ Boolean
True when stats shows a bounded set of repeated values worth treating as an enum.
-
#enum_values(stats) ⇒ Object
Distinct values tracked for this param, ordered by descending count (lex tie-break).
-
#file_kind_distribution(stats) ⇒ Object
file_kind_distribution buckets tracked values by file kind and returns the fraction each kind represents over tracked observations.
- #http_status_position?(type, stats) ⇒ Boolean
-
#initialize(key:, host:, scheme:, shape:, max_values: PositionStats::DEFAULT_MAX_VALUES) ⇒ Cluster
constructor
A new instance of Cluster.
-
#param_summary ⇒ Object
Per-param summary, ordered by descending presence.
-
#param_type(name) ⇒ Object
Returns the type the corpus is confident enough to call this param.
-
#segment_stats ⇒ Object
Per-position summary: [ { position: 0, stable: true, values: { “users” => 3 } }, { position: 1, stable: false, values: { “1” => 1, “2” => 1, “3” => 1 } }, ].
-
#shape_object(classifier: SegmentClassifier::DEFAULT) ⇒ Object
Structured Shape lazily derived from the first observed example — Iriq::Shape, or nil if no examples are present yet.
-
#subtype_distribution(stats, subtypes) ⇒ Object
subtype_distribution slices type_counts to a specific subset and returns the fraction each subtype represents.
- #to_h ⇒ Object
-
#value_distribution(stats) ⇒ Object
value_distribution returns the fraction of total observations each tracked value represents, ordered by descending count then lex.
- #year_position?(type, stats) ⇒ Boolean
Constructor Details
#initialize(key:, host:, scheme:, shape:, max_values: PositionStats::DEFAULT_MAX_VALUES) ⇒ Cluster
Returns a new instance of Cluster.
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/iriq/cluster.rb', line 42 def initialize(key:, host:, scheme:, shape:, max_values: PositionStats::DEFAULT_MAX_VALUES) @key = key @host = host @scheme = scheme @shape = shape @shape_object = nil @examples = [] @example_keys = Set.new @count = 0 @segment_counts = [] @max_values = max_values # Query-param stats keyed by param name. Each is a PositionStats — same # cardinality cap, same type-counts machinery, just indexed by ?key= # instead of by path position. @param_stats = {} end |
Instance Attribute Details
#count ⇒ Object (readonly)
Returns the value of attribute count.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def count @count end |
#examples ⇒ Object (readonly)
Returns the value of attribute examples.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def examples @examples end |
#host ⇒ Object (readonly)
Returns the value of attribute host.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def host @host end |
#key ⇒ Object (readonly)
Returns the value of attribute key.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def key @key end |
#max_values ⇒ Object (readonly)
Returns the value of attribute max_values.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def max_values @max_values end |
#param_stats ⇒ Object (readonly)
Returns the value of attribute param_stats.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def param_stats @param_stats end |
#scheme ⇒ Object (readonly)
Returns the value of attribute scheme.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def scheme @scheme end |
#shape ⇒ Object (readonly)
Returns the value of attribute shape.
7 8 9 |
# File 'lib/iriq/cluster.rb', line 7 def shape @shape end |
Class Method Details
.from_dump(h, max_values: PositionStats::DEFAULT_MAX_VALUES) ⇒ Object
336 337 338 339 340 341 342 343 344 345 346 347 348 349 |
# File 'lib/iriq/cluster.rb', line 336 def self.from_dump(h, max_values: PositionStats::DEFAULT_MAX_VALUES) cluster = new( key: h["key"], host: h["host"], scheme: h["scheme"], shape: h["shape"], max_values: max_values, ) cluster.instance_variable_set(:@count, h["count"]) examples = h["examples"].map { |s| Parser.parse(s) } cluster.instance_variable_set(:@examples, examples) cluster.instance_variable_set(:@example_keys, examples.map(&:canonical).to_set) cluster.instance_variable_set(:@segment_counts, h["segment_counts"].map { |sub| Hash.new(0).merge(sub) }) params = (h["param_stats"] || {}).transform_values { |sd| PositionStats.from_dump(sd) } cluster.instance_variable_set(:@param_stats, params) cluster end |
.key_for(iri, classifier:, shape: nil, host: nil) ⇒ Object
Shared cluster-key derivation. Returns [key, host, scheme, shape] —callers that already have a hinted shape can pass it in to skip the recomputation; URN inputs ignore the override and always derive their own shape from the NSS value. ‘host:` overrides iri.host — used by Corpus when host_strategy collapses subdomains or ignores the host.
356 357 358 359 360 361 362 363 364 365 366 367 368 |
# File 'lib/iriq/cluster.rb', line 356 def self.key_for(iri, classifier:, shape: nil, host: nil) if iri.urn? ns, value = (iri.nss || "").split(":", 2) derived = value ? urn_value_shape(ns, value, classifier) : nil key = "urn:#{ns}:#{derived}" [key, nil, "urn", key] else shape ||= PathShape.new(classifier: classifier).for(iri.path_segments) effective_host = host.nil? ? iri.host : host key = "#{iri.scheme}://#{effective_host}#{shape}" [key, effective_host, iri.scheme, shape] end end |
.urn_value_shape(ns, value, classifier) ⇒ Object
370 371 372 373 374 375 |
# File 'lib/iriq/cluster.rb', line 370 def self.urn_value_shape(ns, value, classifier) entry = SegmentHints.derive([ns, value], classifier).last return entry[:value] unless entry[:variable] "{#{entry[:hint] || entry[:type]}}" end |
Instance Method Details
#add(identifier, classifier: SegmentClassifier::DEFAULT) ⇒ Object
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/iriq/cluster.rb', line 59 def add(identifier, classifier: SegmentClassifier::DEFAULT) @count += 1 if @examples.size < MAX_EXAMPLES canon = identifier.canonical @examples << identifier unless @example_keys.include?(canon) @example_keys << canon end identifier.path_segments.each_with_index do |seg, i| @segment_counts[i] ||= Hash.new(0) @segment_counts[i][seg] += 1 end return unless identifier.query_params identifier.query_params.each do |name, value| stats = @param_stats[name] ||= PositionStats.new(max_values: @max_values) stats.observe(value.to_s, classifier.classify(value.to_s)) end end |
#dominant_excluding(stats, skip) ⇒ Object
Most common type in stats.type_counts excluding ‘skip` — lex tie-break so the choice is deterministic across runtimes.
307 308 309 310 311 312 313 314 315 316 317 318 |
# File 'lib/iriq/cluster.rb', line 307 def dominant_excluding(stats, skip) best = nil best_count = -1 stats.type_counts.each do |t, n| next if t == skip if n > best_count || (n == best_count && t.to_s < best.to_s) best = t best_count = n end end best end |
#dump ⇒ Object
JSON-friendly dump for persistence (distinct from #to_h which is a display form). Examples are dumped as canonical strings and re-parsed on load.
323 324 325 326 327 328 329 330 331 332 333 334 |
# File 'lib/iriq/cluster.rb', line 323 def dump { "key" => key, "host" => host, "scheme" => scheme, "shape" => shape, "count" => count, "examples" => examples.map(&:canonical), "segment_counts" => @segment_counts.map { |h| h || {} }, "param_stats" => @param_stats.transform_values(&:dump), } end |
#enum?(stats) ⇒ Boolean
True when stats shows a bounded set of repeated values worth treating as an enum. See ENUM_* constants at the top of this class.
243 244 245 246 247 248 249 250 |
# File 'lib/iriq/cluster.rb', line 243 def enum?(stats) return false if stats.total < ENUM_MIN_OBSERVATIONS return false if stats.cardinality.zero? || stats.cardinality > ENUM_MAX_CARDINALITY return false if stats.value_counts.any? { |_, n| n < ENUM_MIN_VALUE_COUNT } coverage = stats.value_counts.values.sum.to_f / stats.total coverage >= ENUM_MIN_COVERAGE end |
#enum_values(stats) ⇒ Object
Distinct values tracked for this param, ordered by descending count (lex tie-break). Returned alongside :enum-typed rows in param_summary so verbose/explain consumers can render the value set.
255 256 257 |
# File 'lib/iriq/cluster.rb', line 255 def enum_values(stats) stats.value_counts.sort_by { |v, n| [-n, v] }.map(&:first) end |
#file_kind_distribution(stats) ⇒ Object
file_kind_distribution buckets tracked values by file kind and returns the fraction each kind represents over tracked observations. ‘:unknown` covers values that classified as :file but whose extension isn’t in the kind allowlist (shouldn’t normally happen since the classifier already gates on the kind map). Sums to ≤ 1.0 since value_counts caps at PositionStats::DEFAULT_MAX_VALUES.
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
# File 'lib/iriq/cluster.rb', line 289 def file_kind_distribution(stats) return {} if stats.value_counts.empty? total = stats.value_counts.values.sum return {} if total.zero? kinds = Hash.new(0) stats.value_counts.each do |value, n| kind = SegmentClassifier.file_kind(value) || :unknown kinds[kind] += n end kinds.sort_by { |k, n| [-n, k.to_s] }.to_h.transform_values do |n| (n.to_f / total).round(4) end end |
#http_status_position?(type, stats) ⇒ Boolean
231 232 233 234 235 236 237 238 239 |
# File 'lib/iriq/cluster.rb', line 231 def http_status_position?(type, stats) return false unless type == :integer return false if stats.numeric_count.zero? return false if stats.cardinality < HTTP_STATUS_MIN_DISTINCT return false if stats.cardinality > HTTP_STATUS_MAX_DISTINCT return false if stats.total < HTTP_STATUS_MIN_OBSERVATIONS HTTP_STATUS_RANGE.cover?(stats.numeric_min) && HTTP_STATUS_RANGE.cover?(stats.numeric_max) end |
#param_summary ⇒ Object
Per-param summary, ordered by descending presence. Each entry is:
{ name: "page", count: N, type: :integer, cardinality: K, presence: 0.83 }
presence is count / @count — the fraction of observations that had this param.
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
# File 'lib/iriq/cluster.rb', line 111 def param_summary return [] if @param_stats.empty? @param_stats.map { |name, _stats| stats = @param_stats[name] type = param_type(name) row = { name: name, count: stats.total, type: type, cardinality: stats.cardinality, presence: @count.positive? ? stats.total.to_f / @count : 0.0, } row[:values] = enum_values(stats) if type == :enum # Verbose value distribution — fractions over tracked occurrences. # Boolean and enum positions get the per-value breakdown (e.g. # `true: 0.97, false: 0.03`). Number positions get the int-vs-float # split via :subtype_distribution. if type == :boolean || type == :enum row[:value_distribution] = value_distribution(stats) end if type == :number row[:subtype_distribution] = subtype_distribution(stats, %i[integer float]) end # :file kind breakdown — derived from tracked value_counts at # summary time. Best-effort: only reflects observations within # the value-tracking cap. if type == :file row[:kind_distribution] = file_kind_distribution(stats) end if stats.numeric_count.positive? row[:min] = stats.numeric_min row[:max] = stats.numeric_max row[:avg] = stats.numeric_avg end row }.sort_by { |row| [-row[:count], row[:name]] } end |
#param_type(name) ⇒ Object
Returns the type the corpus is confident enough to call this param. Equals stats.dominant_type when the dominant type isn’t :date; when :date is dominant but below DATE_CONFIDENCE_THRESHOLD, falls back to the most-common non-date type (or :literal if none exists). Shared by Cluster#param_summary and Corpus#inferred_param_type so both views agree on what the corpus “thinks” about a param.
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
# File 'lib/iriq/cluster.rb', line 156 def param_type(name) stats = @param_stats[name] return nil unless stats return nil if stats.total.zero? type = stats.dominant_type # :year takes priority over :enum for numeric range columns — # a "years 2020..2026" position is more useful described as a # ranged year than as an enum of those specific values. return :year if year_position?(type, stats) # :http_status — 3-digit ints clustered in 100..599 are almost # certainly HTTP statuses. Same shape as :year (range check) but # tighter window. Useful for `?status=...` or path positions that # echo a status code. return :http_status if http_status_position?(type, stats) # :enum check — bounded set of repeated values trumps the underlying # value type. `?status=active|draft|archived` surfaces as :enum # (with the value list) rather than :literal even though each value # individually classifies as a literal. Skip the override when the # dominant type is already specific (`:boolean` carries more meaning # than a 2-value enum). return :enum if enum?(stats) && type != :boolean # :date gate — demote when there isn't enough date-typed quorum. if type == :date date_frac = stats.type_counts[:date].to_f / stats.total return type if date_frac >= DATE_CONFIDENCE_THRESHOLD return dominant_excluding(stats, :date) || :literal end # :number umbrella — promote when ints + floats together dominate # but neither alone is the clear winner. if type == :integer || type == :float int_frac = stats.type_counts[:integer].to_f / stats.total float_frac = stats.type_counts[:float].to_f / stats.total if int_frac < NUMBER_SUBTYPE_THRESHOLD && float_frac < NUMBER_SUBTYPE_THRESHOLD && (int_frac + float_frac) >= NUMBER_CONFIDENCE_THRESHOLD return :number end end # Param-name fallback — `?phone=...` overrides a generic literal # type with `:phone` when the value's shape was too weak to detect # on its own. Only fires for overridable types (literal/opaque_id/slug). if (hint = SegmentClassifier.param_name_hint(name, type)) return hint end type end |
#segment_stats ⇒ Object
Per-position summary:
[
{ position: 0, stable: true, values: { "users" => 3 } },
{ position: 1, stable: false, values: { "1" => 1, "2" => 1, "3" => 1 } },
]
84 85 86 87 88 89 90 91 92 |
# File 'lib/iriq/cluster.rb', line 84 def segment_stats @segment_counts.each_with_index.map do |counts, i| { position: i, stable: counts.size == 1, values: counts.dup, } end end |
#shape_object(classifier: SegmentClassifier::DEFAULT) ⇒ Object
Structured Shape lazily derived from the first observed example —Iriq::Shape, or nil if no examples are present yet. Cached after the first call.
12 13 14 15 16 17 |
# File 'lib/iriq/cluster.rb', line 12 def shape_object(classifier: SegmentClassifier::DEFAULT) return @shape_object if @shape_object return nil if @examples.empty? @shape_object = Shape.from_segments(@examples.first.path_segments, classifier: classifier) end |
#subtype_distribution(stats, subtypes) ⇒ Object
subtype_distribution slices type_counts to a specific subset and returns the fraction each subtype represents. Used for the :number umbrella to expose the int-vs-float split.
274 275 276 277 278 279 280 281 |
# File 'lib/iriq/cluster.rb', line 274 def subtype_distribution(stats, subtypes) return {} if stats.total.zero? subtypes.each_with_object({}) do |t, out| n = stats.type_counts[t] || 0 out[t] = (n.to_f / stats.total).round(4) if n.positive? end end |
#to_h ⇒ Object
94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/iriq/cluster.rb', line 94 def to_h { key: key, host: host, scheme: scheme, shape: shape, count: count, examples: examples.map(&:canonical), segments: segment_stats, params: param_summary, } end |
#value_distribution(stats) ⇒ Object
value_distribution returns the fraction of total observations each tracked value represents, ordered by descending count then lex. Used by param_summary for :boolean and :enum positions so callers can render “true 97%, false 3%”-style breakdowns.
263 264 265 266 267 268 269 |
# File 'lib/iriq/cluster.rb', line 263 def value_distribution(stats) return {} if stats.total.zero? stats.value_counts.sort_by { |v, n| [-n, v] }.to_h.transform_values do |n| (n.to_f / stats.total).round(4) end end |
#year_position?(type, stats) ⇒ Boolean
216 217 218 219 220 221 222 223 224 |
# File 'lib/iriq/cluster.rb', line 216 def year_position?(type, stats) return false unless type == :integer return false if stats.numeric_count.zero? return false if stats.cardinality < YEAR_MIN_DISTINCT return false if stats.cardinality > YEAR_MAX_DISTINCT return false if stats.total < YEAR_MIN_OBSERVATIONS YEAR_RANGE.cover?(stats.numeric_min) && YEAR_RANGE.cover?(stats.numeric_max) end |