Class: Iriq::Corpus

Inherits:
Object
  • Object
show all
Defined in:
lib/iriq/corpus.rb

Overview

Streaming-friendly observer over a (potentially unbounded) corpus of IRIs. Maintains rolling aggregates and per-(host, prefix) frequency stats so that classification can improve as more data flows in.

The deterministic, single-IRI API (Iriq.normalize/explain) is unchanged —Corpus#normalize and Corpus#explain are the corpus-informed variants.

State lives in a Storage backend (Memory by default; Json or Sqlite when opened against a file). The classification logic on top is identical regardless of where the counters live.

Constant Summary collapse

VARIABLE_DOMINANCE_THRESHOLD =

Type-based: position is “mostly variable” (UUIDs/integers/etc.).

0.8
LITERAL_UNIQUENESS_THRESHOLD =

Cardinality-based: position has mostly distinct literal values, so the literal “type” is misleading — it’s really a variable slot. We trigger on either:

- very high cardinality fraction (most observations are singletons), OR
- moderate cardinality fraction AND high absolute distinct count

The second branch catches realistic streams where popular outliers bring the frac down but the long tail is clearly variable.

0.8
LITERAL_UNIQUENESS_MODERATE_THRESHOLD =
0.5
MIN_CARDINALITY_FOR_INFERENCE =
20
MIN_OBSERVATIONS_FOR_INFERENCE =

Don’t apply corpus heuristics until we have at least this many observations at a position — too easy to be wrong with tiny samples.

5
STABLE_LITERAL_THRESHOLD =

Value-fraction at or above which a literal is considered the stable occupant of its position.

0.5
5
3
HOST_STRATEGIES =
%i[full registrable none].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full, storage: nil) ⇒ Corpus

Returns a new instance of Corpus.

Raises:

  • (ArgumentError)


49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/iriq/corpus.rb', line 49

def initialize(classifier: SegmentClassifier::DEFAULT,
               max_values_per_position: PositionStats::DEFAULT_MAX_VALUES,
               host_strategy: :full,
               storage: nil)
  raise ArgumentError, "host_strategy must be one of #{HOST_STRATEGIES.inspect}" \
    unless HOST_STRATEGIES.include?(host_strategy)

  @classifier    = classifier
  @host_strategy = host_strategy
  @storage       = storage || Storage::Memory.new(
    classifier: classifier,
    max_values_per_position: max_values_per_position,
  )
end

Instance Attribute Details

#classifierObject (readonly)

Returns the value of attribute classifier.



47
48
49
# File 'lib/iriq/corpus.rb', line 47

def classifier
  @classifier
end

#host_strategyObject (readonly)

Returns the value of attribute host_strategy.



47
48
49
# File 'lib/iriq/corpus.rb', line 47

def host_strategy
  @host_strategy
end

#storageObject (readonly)

Returns the value of attribute storage.



47
48
49
# File 'lib/iriq/corpus.rb', line 47

def storage
  @storage
end

Class Method Details

.from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object



564
565
566
567
568
569
# File 'lib/iriq/corpus.rb', line 564

def self.from_dump(h, classifier: SegmentClassifier::DEFAULT)
  max_values = h.fetch("max_values_per_position", PositionStats::DEFAULT_MAX_VALUES)
  storage = Storage::Memory.new(classifier: classifier, max_values_per_position: max_values)
  storage.load_dump!(h)
  new(classifier: classifier, storage: storage)
end

.load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object



571
572
573
# File 'lib/iriq/corpus.rb', line 571

def self.load(path, classifier: SegmentClassifier::DEFAULT)
  open(path, classifier: classifier)
end

.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full) ⇒ Object

Open a corpus against ‘path`. File extension picks the backend: `.db`/`.sqlite`/`.sqlite3` use SQLite (incremental writes); anything else uses JSON.



67
68
69
70
71
72
73
74
75
76
# File 'lib/iriq/corpus.rb', line 67

def self.open(path, classifier: SegmentClassifier::DEFAULT,
                    max_values_per_position: PositionStats::DEFAULT_MAX_VALUES,
                    host_strategy: :full)
  storage = Storage.open(path,
                         classifier: classifier,
                         max_values_per_position: max_values_per_position)
  corpus = new(classifier: classifier, storage: storage, host_strategy: host_strategy)
  corpus.send(:reapply_activated_recognizers!) if storage.respond_to?(:each_activated_recognizer)
  corpus
end

Instance Method Details

#activate_proposal(proposal) ⇒ Object

Promote a RecognizerProposal into a live Recognizer for this corpus.

Mechanics:

1. Synthesize a SynthesizedRecognizer from the proposal's prefix.
2. Switch to a per-corpus classifier (if we were sharing the
   module-level DEFAULT) so activation doesn't leak to other
   corpora using the same default singleton.
3. Register the Recognizer on the classifier — the ensemble
   picks it up on the next classify() call.
4. Persist the activation in storage so reopens re-apply it.
5. Reinfer so existing observations get re-classified through
   the new Recognizer.

Returns the synthesized Recognizer.



171
172
173
174
175
176
177
178
179
180
# File 'lib/iriq/corpus.rb', line 171

def activate_proposal(proposal)
  recognizer = SynthesizedRecognizer.from_proposal(proposal)
  ensure_per_corpus_classifier!
  @classifier.register_recognizer(recognizer)
  if @storage.respond_to?(:record_activated_recognizer)
    @storage.record_activated_recognizer(recognizer.to_dump)
  end
  reinfer
  recognizer
end

#activate_proposals_above(confidence_threshold, **propose_opts) ⇒ Object

Convenience: activate every proposal whose confidence clears the given threshold. Returns the activated Recognizers. Confidence incorporates both per-position coverage AND cross-host corroboration — see RecognizerProposal#compute_confidence.



186
187
188
189
# File 'lib/iriq/corpus.rb', line 186

def activate_proposals_above(confidence_threshold, **propose_opts)
  proposals = propose_recognizers(**propose_opts)
  proposals.select { |p| p.confidence >= confidence_threshold }.map { |p| activate_proposal(p) }
end

#activated_recognizer_countObject

Number of activated recognizers persisted with this corpus.



192
193
194
195
# File 'lib/iriq/corpus.rb', line 192

def activated_recognizer_count
  return @storage.activated_recognizer_count if @storage.respond_to?(:activated_recognizer_count)
  0
end

#batch(&block) ⇒ Object

Wrap many observations in a single backend transaction. For SQLite this turns thousands of fsyncs into one; for in-memory backends it’s a no-op. Use when ingesting a batch.



376
377
378
# File 'lib/iriq/corpus.rb', line 376

def batch(&block)
  @storage.batch(&block)
end

#closeObject



369
370
371
# File 'lib/iriq/corpus.rb', line 369

def close
  @storage.close
end

#clustersObject



337
338
339
# File 'lib/iriq/corpus.rb', line 337

def clusters
  @storage.clusters
end

#cross_host_shapes(min_hosts: 2) ⇒ Object

Route shapes that recur across ‘min_hosts` or more distinct hosts. Returns CrossHostShape records sorted by host_count desc, then by observation_count desc, then by shape (stable, deterministic).

Cross-host recurrence is independent evidence of a real semantic pattern — two unrelated hosts inventing the same ‘/users/integer` structure by accident is unlikely. A natural follow-up is feeding this signal back into RecognizerProposal confidence: a proposal supported by N hosts is much stronger than one seen on a single host with the same per-position coverage.



207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# File 'lib/iriq/corpus.rb', line 207

def cross_host_shapes(min_hosts: 2)
  by_shape = Hash.new { |h, k| h[k] = { hosts: Set.new, count: 0 } }
  @storage.clusters.each do |cluster|
    # Skip non-URL clusters (URN clusters have no host).
    next if cluster.host.nil? || cluster.host.empty?

    agg = by_shape[cluster.shape]
    agg[:hosts] << cluster.host
    agg[:count] += cluster.count
  end

  by_shape.filter_map do |shape, data|
    next nil if data[:hosts].size < min_hosts

    CrossHostShape.new(
      shape:             shape,
      hosts:             data[:hosts],
      observation_count: data[:count],
    )
  end.sort_by { |s| [-s.host_count, -s.observation_count, s.shape] }
end

#dumpObject

— Legacy dump/load (JSON shape) ————————————

The pre-Storage release exposed ‘Corpus#dump`, `Corpus#save(path)`, and `Corpus.load(path)` for JSON-backed persistence. Those names still work but are now thin wrappers around the appropriate Storage backend.



560
561
562
# File 'lib/iriq/corpus.rb', line 560

def dump
  memory_view.to_dump
end

#each_position_stats(&block) ⇒ Object

Iterates Position → PositionStats over all observed positions. Used by inspection tooling; not part of the hot path.



333
334
335
# File 'lib/iriq/corpus.rb', line 333

def each_position_stats(&block)
  @storage.each_position_stats(&block)
end

#effective_host(host) ⇒ Object

Normalize the host for keying purposes. ‘:full` keeps the original host; `:registrable` collapses subdomains via the inline-PSL heuristic (api.foo.com + app.foo.com → foo.com); `:none` ignores host entirely so clusters group across all hosts by shape alone.



82
83
84
85
86
87
88
# File 'lib/iriq/corpus.rb', line 82

def effective_host(host)
  case @host_strategy
  when :registrable then RegistrableDomain.for(host)
  when :none        then ""
  else                   host
  end
end

#events_for(input) ⇒ Object

Build the ordered Event list for ‘input` without applying it. Useful for inspection, tests, and future event-log persistence. Each call is pure — no storage side-effects.



232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# File 'lib/iriq/corpus.rb', line 232

def events_for(input)
  iri = coerce(input)
  hinted_entries = SegmentHints.derive(iri.path_segments, @classifier)
  raw_shape    = PathShape.new(classifier: @classifier, hints: false).from_entries(hinted_entries)
  hinted_shape = PathShape.new(classifier: @classifier, hints: true).from_entries(hinted_entries)
  keying_host  = effective_host(iri.host)

  events = [
    Event::HostSeen.new(keying_host),
    Event::PathLengthSeen.new(iri.path_segments.size),
    Event::RawShapeSeen.new(raw_shape),
    Event::FingerprintSeen.new(hinted_shape),
  ]

  prefix = ""
  hinted_entries.each do |entry|
    events << Event::PositionSeen.new(
      Position.path(host: keying_host, prefix: prefix),
      entry[:value], entry[:type],
    )
    prefix = "#{prefix}/#{placeholder(entry)}"
  end

  key, host, scheme, shape = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape, host: keying_host)
  events << Event::ClusterAddition.new(key, host, scheme, shape, iri)

  events
end

#explain(input) ⇒ Object

Per-segment explanation with corpus-informed ‘classification`. Returns an array of entries shaped like the Explanation rows plus `classification:` ∈ :stable_literal, :variable_identifier, :rare_literal, :ambiguous, :corpus_inferred_variable.



319
320
321
322
323
324
# File 'lib/iriq/corpus.rb', line 319

def explain(input)
  iri = coerce(input)
  annotate_segments(iri).map do |entry|
    entry.reject { |k, _| k == :prefix }
  end
end

#fingerprint_countsObject



329
# File 'lib/iriq/corpus.rb', line 329

def fingerprint_counts; @storage.fingerprint_counts; end

#host_countsObject



326
# File 'lib/iriq/corpus.rb', line 326

def host_counts;        @storage.host_counts;        end

#normalize(input) ⇒ Object

Corpus-informed normalization. Falls back to mechanical normalization when the corpus has no signal for a position. Implemented as a thin call into Normalizer with ‘evidence: self`; the corpus-informed path and query rendering live in #render_path / #render_query below (the evidence-source interface).



266
267
268
269
# File 'lib/iriq/corpus.rb', line 266

def normalize(input)
  iri = coerce(input)
  Normalizer.normalize_identifier(iri, classifier: @classifier, hints: true, evidence: self)
end

#observe(input) ⇒ Object

Observe a single IRI. Returns an Observation.

Internally: builds an Event list for the IRI, then applies each event through the Reducer registry inside a single storage transaction. The event list is transient today — a future commit can persist it and replay against alternate reducers / thresholds for re-runnable inference. See lib/iriq/event.rb and lib/iriq/reducer.rb.



97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/iriq/corpus.rb', line 97

def observe(input)
  iri     = coerce(input)
  events  = events_for(iri)
  cluster = nil

  @storage.transaction do |s|
    events.each do |e|
      result = Reducer.apply(e, s)
      cluster = result if e.is_a?(Event::ClusterAddition)
    end
    s.record_observation(iri.canonical) if s.respond_to?(:record_observation)
  end

  Observation.new(corpus: self, identifier: iri, cluster: cluster)
end

#observed_iri_countObject

Number of IRIs in the source-IRI log. The materialized views are derived from this log; reinfer replays it.



139
140
141
142
# File 'lib/iriq/corpus.rb', line 139

def observed_iri_count
  return @storage.observed_iri_count if @storage.respond_to?(:observed_iri_count)
  0
end

#params_for(input) ⇒ Object

Inferred params for the cluster ‘input` would fall into. Returns the same shape as Cluster#param_summary — useful for “what query params might this URL accept?” tooling. Empty array if no cluster has been observed for this shape yet.



305
306
307
308
309
310
311
312
313
# File 'lib/iriq/corpus.rb', line 305

def params_for(input)
  iri = coerce(input)
  hinted_shape = PathShape.new(classifier: @classifier, hints: true)
                          .from_entries(SegmentHints.derive(iri.path_segments, @classifier))
  key, * = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape,
                           host: effective_host(iri.host))
  cluster = @storage.cluster_for(key)
  cluster ? cluster.param_summary : []
end

#path_length_countsObject



327
# File 'lib/iriq/corpus.rb', line 327

def path_length_counts; @storage.path_length_counts; end

#propose_recognizers(strategies: ProposalStrategy::DEFAULTS, **opts) ⇒ Object

Scan observed values for shape patterns that recur frequently enough to suggest a new Recognizer. Returns RecognizerProposal records; nothing is automatically applied — the proposal carries enough evidence for a human to decide whether to bake the Recognizer in.

Strategies are pluggable; the default set lives in Iriq::ProposalStrategy::DEFAULTS. Pass ‘strategies:` to limit / extend. Pass `min_observations:` / `min_coverage:` / `min_hosts:` to tune what passes the noise floor.



153
154
155
# File 'lib/iriq/corpus.rb', line 153

def propose_recognizers(strategies: ProposalStrategy::DEFAULTS, **opts)
  strategies.flat_map { |s| s.propose(@storage, **opts) }
end

#raw_shape_countsObject



328
# File 'lib/iriq/corpus.rb', line 328

def raw_shape_counts;   @storage.raw_shape_counts;   end

#reinferObject

Drop every materialized view (host counts, position stats, clusters, …) and rebuild them by replaying the source-IRI log through the current events + reducers pipeline. Useful for:

- Tuning thresholds (swap a Corpus constant, call reinfer)
- Swapping the classifier (open the Corpus with a different
  classifier, call reinfer — events are re-derived from raw IRIs)
- Recovering after a Reducer-set change

Wrapped in a single backend transaction so a failure mid-replay leaves the prior views intact.



124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/iriq/corpus.rb', line 124

def reinfer
  @storage.transaction do |s|
    iris = []
    s.each_observed_iri { |canonical| iris << canonical }
    s.clear_materialized_views
    iris.each do |canonical|
      iri = Parser.parse(canonical)
      events_for(iri).each { |e| Reducer.apply(e, s) }
    end
  end
  nil
end

#render_path(iri, _classifier, _hints) ⇒ Object

Evidence-source interface — called by Normalizer when this Corpus is passed as ‘evidence:`. Renders the path using corpus-informed classifications (variability promotion, popular-outlier preservation). Always emits a leading “/” — empty path collapses to “/” to match mechanical output and anchor any trailing query.



276
277
278
279
# File 'lib/iriq/corpus.rb', line 276

def render_path(iri, _classifier, _hints)
  tokens = annotate_segments(iri).map { |entry| corpus_token(entry) }
  "/" + tokens.join("/")
end

#render_query(iri, _classifier = @classifier) ⇒ Object

Evidence-source interface — render the query string with cluster-inferred param types where available. The mechanical NullEvidenceSource provides the classifier-only fallback; this version prefers the cluster’s observed type per param (dominant type_count, subject to the corpus thresholds).



286
287
288
289
290
291
292
293
294
295
296
297
298
299
# File 'lib/iriq/corpus.rb', line 286

def render_query(iri, _classifier = @classifier)
  hinted_shape = PathShape.new(classifier: @classifier, hints: true)
                          .from_entries(SegmentHints.derive(iri.path_segments, @classifier))
  key, * = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape,
                           host: effective_host(iri.host))
  cluster = @storage.cluster_for(key)

  iri.query_params.keys.sort.map do |k|
    v = iri.query_params[k].to_s
    type = inferred_param_type(cluster, k, v)
    shaped = render_param_value(v, type)
    "#{k}=#{shaped}"
  end.join("&")
end

#save(path = nil) ⇒ Object

Persist the corpus.

save()           → flush the backend in place (JSON writes its file,
                   SQLite is already on disk).
save(same_path)  → same as save() — idempotent for the backend's path.
save(other_path) → export to other_path as JSON, regardless of the
                   live backend.


360
361
362
363
364
365
366
367
# File 'lib/iriq/corpus.rb', line 360

def save(path = nil)
  backend_path = @storage.respond_to?(:path) ? @storage.path : nil
  if path.nil? || path == backend_path
    @storage.save
  else
    write_json_dump(path)
  end
end

#sizeObject



341
342
343
# File 'lib/iriq/corpus.rb', line 341

def size
  @storage.cluster_size
end

#stats_for(host_or_position, prefix = nil) ⇒ Object

Stats for a given (host, path-prefix) — useful for tests and debugging. Returns nil if nothing has been observed there. Accepts either a Position or (host, prefix) for ergonomics.



348
349
350
351
# File 'lib/iriq/corpus.rb', line 348

def stats_for(host_or_position, prefix = nil)
  position = host_or_position.is_a?(Position) ? host_or_position : Position.path(host: host_or_position, prefix: prefix)
  @storage.position_stats(position)
end