Class: Iriq::Corpus
- Inherits:
-
Object
- Object
- Iriq::Corpus
- Defined in:
- lib/iriq/corpus.rb
Overview
Streaming-friendly observer over a (potentially unbounded) corpus of IRIs. Maintains rolling aggregates and per-(host, prefix) frequency stats so that classification can improve as more data flows in.
The deterministic, single-IRI API (Iriq.normalize/explain) is unchanged —Corpus#normalize and Corpus#explain are the corpus-informed variants.
State lives in a Storage backend (Memory by default; Json or Sqlite when opened against a file). The classification logic on top is identical regardless of where the counters live.
Constant Summary collapse
- VARIABLE_DOMINANCE_THRESHOLD =
Type-based: position is “mostly variable” (UUIDs/integers/etc.).
0.8- LITERAL_UNIQUENESS_THRESHOLD =
Cardinality-based: position has mostly distinct literal values, so the literal “type” is misleading — it’s really a variable slot. We trigger on either:
- very high cardinality fraction (most observations are singletons), OR - moderate cardinality fraction AND high absolute distinct countThe second branch catches realistic streams where popular outliers bring the frac down but the long tail is clearly variable.
0.8- LITERAL_UNIQUENESS_MODERATE_THRESHOLD =
0.5- MIN_CARDINALITY_FOR_INFERENCE =
20- MIN_OBSERVATIONS_FOR_INFERENCE =
Don’t apply corpus heuristics until we have at least this many observations at a position — too easy to be wrong with tiny samples.
5- STABLE_LITERAL_THRESHOLD =
Value-fraction at or above which a literal is considered the stable occupant of its position.
0.5- POPULAR_MIN_COUNT =
Within a high-cardinality literal position (mostly singletons), a specific value qualifies as a “popular outlier” — and gets preserved as :stable_literal instead of being lumped into :corpus_inferred_variable — when its count is at least POPULAR_MIN_COUNT and its frequency is at least POPULAR_BASELINE_MULTIPLE × the uniform baseline (1/cardinality).
5- POPULAR_BASELINE_MULTIPLE =
3- HOST_STRATEGIES =
%i[full registrable none].freeze
Instance Attribute Summary collapse
-
#classifier ⇒ Object
readonly
Returns the value of attribute classifier.
-
#host_strategy ⇒ Object
readonly
Returns the value of attribute host_strategy.
-
#storage ⇒ Object
readonly
Returns the value of attribute storage.
Class Method Summary collapse
- .from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object
- .load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object
-
.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full) ⇒ Object
Open a corpus against ‘path`.
Instance Method Summary collapse
-
#activate_proposal(proposal) ⇒ Object
Promote a RecognizerProposal into a live Recognizer for this corpus.
-
#activate_proposals_above(confidence_threshold, **propose_opts) ⇒ Object
Convenience: activate every proposal whose confidence clears the given threshold.
-
#activated_recognizer_count ⇒ Object
Number of activated recognizers persisted with this corpus.
-
#batch(&block) ⇒ Object
Wrap many observations in a single backend transaction.
- #close ⇒ Object
- #clusters ⇒ Object
-
#cross_host_shapes(min_hosts: 2) ⇒ Object
Route shapes that recur across ‘min_hosts` or more distinct hosts.
-
#dump ⇒ Object
— Legacy dump/load (JSON shape) ————————————.
-
#each_position_stats(&block) ⇒ Object
Iterates Position → PositionStats over all observed positions.
-
#effective_host(host) ⇒ Object
Normalize the host for keying purposes.
-
#events_for(input) ⇒ Object
Build the ordered Event list for ‘input` without applying it.
-
#explain(input) ⇒ Object
Per-segment explanation with corpus-informed ‘classification`.
- #fingerprint_counts ⇒ Object
- #host_counts ⇒ Object
-
#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full, storage: nil) ⇒ Corpus
constructor
A new instance of Corpus.
-
#normalize(input) ⇒ Object
Corpus-informed normalization.
-
#observe(input) ⇒ Object
Observe a single IRI.
-
#observed_iri_count ⇒ Object
Number of IRIs in the source-IRI log.
-
#params_for(input) ⇒ Object
Inferred params for the cluster ‘input` would fall into.
- #path_length_counts ⇒ Object
-
#propose_recognizers(strategies: ProposalStrategy::DEFAULTS, **opts) ⇒ Object
Scan observed values for shape patterns that recur frequently enough to suggest a new Recognizer.
- #raw_shape_counts ⇒ Object
-
#reinfer ⇒ Object
Drop every materialized view (host counts, position stats, clusters, …) and rebuild them by replaying the source-IRI log through the current events + reducers pipeline.
-
#render_path(iri, _classifier, _hints) ⇒ Object
Evidence-source interface — called by Normalizer when this Corpus is passed as ‘evidence:`.
-
#render_query(iri, _classifier = @classifier) ⇒ Object
Evidence-source interface — render the query string with cluster-inferred param types where available.
-
#save(path = nil) ⇒ Object
Persist the corpus.
- #size ⇒ Object
-
#stats_for(host_or_position, prefix = nil) ⇒ Object
Stats for a given (host, path-prefix) — useful for tests and debugging.
Constructor Details
#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full, storage: nil) ⇒ Corpus
Returns a new instance of Corpus.
49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/iriq/corpus.rb', line 49 def initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full, storage: nil) raise ArgumentError, "host_strategy must be one of #{HOST_STRATEGIES.inspect}" \ unless HOST_STRATEGIES.include?(host_strategy) @classifier = classifier @host_strategy = host_strategy @storage = storage || Storage::Memory.new( classifier: classifier, max_values_per_position: max_values_per_position, ) end |
Instance Attribute Details
#classifier ⇒ Object (readonly)
Returns the value of attribute classifier.
47 48 49 |
# File 'lib/iriq/corpus.rb', line 47 def classifier @classifier end |
#host_strategy ⇒ Object (readonly)
Returns the value of attribute host_strategy.
47 48 49 |
# File 'lib/iriq/corpus.rb', line 47 def host_strategy @host_strategy end |
#storage ⇒ Object (readonly)
Returns the value of attribute storage.
47 48 49 |
# File 'lib/iriq/corpus.rb', line 47 def storage @storage end |
Class Method Details
.from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object
564 565 566 567 568 569 |
# File 'lib/iriq/corpus.rb', line 564 def self.from_dump(h, classifier: SegmentClassifier::DEFAULT) max_values = h.fetch("max_values_per_position", PositionStats::DEFAULT_MAX_VALUES) storage = Storage::Memory.new(classifier: classifier, max_values_per_position: max_values) storage.load_dump!(h) new(classifier: classifier, storage: storage) end |
.load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object
571 572 573 |
# File 'lib/iriq/corpus.rb', line 571 def self.load(path, classifier: SegmentClassifier::DEFAULT) open(path, classifier: classifier) end |
.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full) ⇒ Object
Open a corpus against ‘path`. File extension picks the backend: `.db`/`.sqlite`/`.sqlite3` use SQLite (incremental writes); anything else uses JSON.
67 68 69 70 71 72 73 74 75 76 |
# File 'lib/iriq/corpus.rb', line 67 def self.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, host_strategy: :full) storage = Storage.open(path, classifier: classifier, max_values_per_position: max_values_per_position) corpus = new(classifier: classifier, storage: storage, host_strategy: host_strategy) corpus.send(:reapply_activated_recognizers!) if storage.respond_to?(:each_activated_recognizer) corpus end |
Instance Method Details
#activate_proposal(proposal) ⇒ Object
Promote a RecognizerProposal into a live Recognizer for this corpus.
Mechanics:
1. Synthesize a SynthesizedRecognizer from the proposal's prefix.
2. Switch to a per-corpus classifier (if we were sharing the
module-level DEFAULT) so activation doesn't leak to other
corpora using the same default singleton.
3. Register the Recognizer on the classifier — the ensemble
picks it up on the next classify() call.
4. Persist the activation in storage so reopens re-apply it.
5. Reinfer so existing observations get re-classified through
the new Recognizer.
Returns the synthesized Recognizer.
171 172 173 174 175 176 177 178 179 180 |
# File 'lib/iriq/corpus.rb', line 171 def activate_proposal(proposal) recognizer = SynthesizedRecognizer.from_proposal(proposal) ensure_per_corpus_classifier! @classifier.register_recognizer(recognizer) if @storage.respond_to?(:record_activated_recognizer) @storage.record_activated_recognizer(recognizer.to_dump) end reinfer recognizer end |
#activate_proposals_above(confidence_threshold, **propose_opts) ⇒ Object
Convenience: activate every proposal whose confidence clears the given threshold. Returns the activated Recognizers. Confidence incorporates both per-position coverage AND cross-host corroboration — see RecognizerProposal#compute_confidence.
186 187 188 189 |
# File 'lib/iriq/corpus.rb', line 186 def activate_proposals_above(confidence_threshold, **propose_opts) proposals = propose_recognizers(**propose_opts) proposals.select { |p| p.confidence >= confidence_threshold }.map { |p| activate_proposal(p) } end |
#activated_recognizer_count ⇒ Object
Number of activated recognizers persisted with this corpus.
192 193 194 195 |
# File 'lib/iriq/corpus.rb', line 192 def activated_recognizer_count return @storage.activated_recognizer_count if @storage.respond_to?(:activated_recognizer_count) 0 end |
#batch(&block) ⇒ Object
Wrap many observations in a single backend transaction. For SQLite this turns thousands of fsyncs into one; for in-memory backends it’s a no-op. Use when ingesting a batch.
376 377 378 |
# File 'lib/iriq/corpus.rb', line 376 def batch(&block) @storage.batch(&block) end |
#close ⇒ Object
369 370 371 |
# File 'lib/iriq/corpus.rb', line 369 def close @storage.close end |
#clusters ⇒ Object
337 338 339 |
# File 'lib/iriq/corpus.rb', line 337 def clusters @storage.clusters end |
#cross_host_shapes(min_hosts: 2) ⇒ Object
Route shapes that recur across ‘min_hosts` or more distinct hosts. Returns CrossHostShape records sorted by host_count desc, then by observation_count desc, then by shape (stable, deterministic).
Cross-host recurrence is independent evidence of a real semantic pattern — two unrelated hosts inventing the same ‘/users/integer` structure by accident is unlikely. A natural follow-up is feeding this signal back into RecognizerProposal confidence: a proposal supported by N hosts is much stronger than one seen on a single host with the same per-position coverage.
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
# File 'lib/iriq/corpus.rb', line 207 def cross_host_shapes(min_hosts: 2) by_shape = Hash.new { |h, k| h[k] = { hosts: Set.new, count: 0 } } @storage.clusters.each do |cluster| # Skip non-URL clusters (URN clusters have no host). next if cluster.host.nil? || cluster.host.empty? agg = by_shape[cluster.shape] agg[:hosts] << cluster.host agg[:count] += cluster.count end by_shape.filter_map do |shape, data| next nil if data[:hosts].size < min_hosts CrossHostShape.new( shape: shape, hosts: data[:hosts], observation_count: data[:count], ) end.sort_by { |s| [-s.host_count, -s.observation_count, s.shape] } end |
#dump ⇒ Object
— Legacy dump/load (JSON shape) ————————————
The pre-Storage release exposed ‘Corpus#dump`, `Corpus#save(path)`, and `Corpus.load(path)` for JSON-backed persistence. Those names still work but are now thin wrappers around the appropriate Storage backend.
560 561 562 |
# File 'lib/iriq/corpus.rb', line 560 def dump memory_view.to_dump end |
#each_position_stats(&block) ⇒ Object
Iterates Position → PositionStats over all observed positions. Used by inspection tooling; not part of the hot path.
333 334 335 |
# File 'lib/iriq/corpus.rb', line 333 def each_position_stats(&block) @storage.each_position_stats(&block) end |
#effective_host(host) ⇒ Object
Normalize the host for keying purposes. ‘:full` keeps the original host; `:registrable` collapses subdomains via the inline-PSL heuristic (api.foo.com + app.foo.com → foo.com); `:none` ignores host entirely so clusters group across all hosts by shape alone.
82 83 84 85 86 87 88 |
# File 'lib/iriq/corpus.rb', line 82 def effective_host(host) case @host_strategy when :registrable then RegistrableDomain.for(host) when :none then "" else host end end |
#events_for(input) ⇒ Object
Build the ordered Event list for ‘input` without applying it. Useful for inspection, tests, and future event-log persistence. Each call is pure — no storage side-effects.
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 |
# File 'lib/iriq/corpus.rb', line 232 def events_for(input) iri = coerce(input) hinted_entries = SegmentHints.derive(iri.path_segments, @classifier) raw_shape = PathShape.new(classifier: @classifier, hints: false).from_entries(hinted_entries) hinted_shape = PathShape.new(classifier: @classifier, hints: true).from_entries(hinted_entries) = effective_host(iri.host) events = [ Event::HostSeen.new(), Event::PathLengthSeen.new(iri.path_segments.size), Event::RawShapeSeen.new(raw_shape), Event::FingerprintSeen.new(hinted_shape), ] prefix = "" hinted_entries.each do |entry| events << Event::PositionSeen.new( Position.path(host: , prefix: prefix), entry[:value], entry[:type], ) prefix = "#{prefix}/#{placeholder(entry)}" end key, host, scheme, shape = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape, host: ) events << Event::ClusterAddition.new(key, host, scheme, shape, iri) events end |
#explain(input) ⇒ Object
Per-segment explanation with corpus-informed ‘classification`. Returns an array of entries shaped like the Explanation rows plus `classification:` ∈ :stable_literal, :variable_identifier, :rare_literal, :ambiguous, :corpus_inferred_variable.
319 320 321 322 323 324 |
# File 'lib/iriq/corpus.rb', line 319 def explain(input) iri = coerce(input) annotate_segments(iri).map do |entry| entry.reject { |k, _| k == :prefix } end end |
#fingerprint_counts ⇒ Object
329 |
# File 'lib/iriq/corpus.rb', line 329 def fingerprint_counts; @storage.fingerprint_counts; end |
#host_counts ⇒ Object
326 |
# File 'lib/iriq/corpus.rb', line 326 def host_counts; @storage.host_counts; end |
#normalize(input) ⇒ Object
Corpus-informed normalization. Falls back to mechanical normalization when the corpus has no signal for a position. Implemented as a thin call into Normalizer with ‘evidence: self`; the corpus-informed path and query rendering live in #render_path / #render_query below (the evidence-source interface).
266 267 268 269 |
# File 'lib/iriq/corpus.rb', line 266 def normalize(input) iri = coerce(input) Normalizer.normalize_identifier(iri, classifier: @classifier, hints: true, evidence: self) end |
#observe(input) ⇒ Object
Observe a single IRI. Returns an Observation.
Internally: builds an Event list for the IRI, then applies each event through the Reducer registry inside a single storage transaction. The event list is transient today — a future commit can persist it and replay against alternate reducers / thresholds for re-runnable inference. See lib/iriq/event.rb and lib/iriq/reducer.rb.
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/iriq/corpus.rb', line 97 def observe(input) iri = coerce(input) events = events_for(iri) cluster = nil @storage.transaction do |s| events.each do |e| result = Reducer.apply(e, s) cluster = result if e.is_a?(Event::ClusterAddition) end s.record_observation(iri.canonical) if s.respond_to?(:record_observation) end Observation.new(corpus: self, identifier: iri, cluster: cluster) end |
#observed_iri_count ⇒ Object
Number of IRIs in the source-IRI log. The materialized views are derived from this log; reinfer replays it.
139 140 141 142 |
# File 'lib/iriq/corpus.rb', line 139 def observed_iri_count return @storage.observed_iri_count if @storage.respond_to?(:observed_iri_count) 0 end |
#params_for(input) ⇒ Object
Inferred params for the cluster ‘input` would fall into. Returns the same shape as Cluster#param_summary — useful for “what query params might this URL accept?” tooling. Empty array if no cluster has been observed for this shape yet.
305 306 307 308 309 310 311 312 313 |
# File 'lib/iriq/corpus.rb', line 305 def params_for(input) iri = coerce(input) hinted_shape = PathShape.new(classifier: @classifier, hints: true) .from_entries(SegmentHints.derive(iri.path_segments, @classifier)) key, * = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape, host: effective_host(iri.host)) cluster = @storage.cluster_for(key) cluster ? cluster.param_summary : [] end |
#path_length_counts ⇒ Object
327 |
# File 'lib/iriq/corpus.rb', line 327 def path_length_counts; @storage.path_length_counts; end |
#propose_recognizers(strategies: ProposalStrategy::DEFAULTS, **opts) ⇒ Object
Scan observed values for shape patterns that recur frequently enough to suggest a new Recognizer. Returns RecognizerProposal records; nothing is automatically applied — the proposal carries enough evidence for a human to decide whether to bake the Recognizer in.
Strategies are pluggable; the default set lives in Iriq::ProposalStrategy::DEFAULTS. Pass ‘strategies:` to limit / extend. Pass `min_observations:` / `min_coverage:` / `min_hosts:` to tune what passes the noise floor.
153 154 155 |
# File 'lib/iriq/corpus.rb', line 153 def propose_recognizers(strategies: ProposalStrategy::DEFAULTS, **opts) strategies.flat_map { |s| s.propose(@storage, **opts) } end |
#raw_shape_counts ⇒ Object
328 |
# File 'lib/iriq/corpus.rb', line 328 def raw_shape_counts; @storage.raw_shape_counts; end |
#reinfer ⇒ Object
Drop every materialized view (host counts, position stats, clusters, …) and rebuild them by replaying the source-IRI log through the current events + reducers pipeline. Useful for:
- Tuning thresholds (swap a Corpus constant, call reinfer)
- Swapping the classifier (open the Corpus with a different
classifier, call reinfer — events are re-derived from raw IRIs)
- Recovering after a Reducer-set change
Wrapped in a single backend transaction so a failure mid-replay leaves the prior views intact.
124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/iriq/corpus.rb', line 124 def reinfer @storage.transaction do |s| iris = [] s.each_observed_iri { |canonical| iris << canonical } s.clear_materialized_views iris.each do |canonical| iri = Parser.parse(canonical) events_for(iri).each { |e| Reducer.apply(e, s) } end end nil end |
#render_path(iri, _classifier, _hints) ⇒ Object
Evidence-source interface — called by Normalizer when this Corpus is passed as ‘evidence:`. Renders the path using corpus-informed classifications (variability promotion, popular-outlier preservation). Always emits a leading “/” — empty path collapses to “/” to match mechanical output and anchor any trailing query.
276 277 278 279 |
# File 'lib/iriq/corpus.rb', line 276 def render_path(iri, _classifier, _hints) tokens = annotate_segments(iri).map { |entry| corpus_token(entry) } "/" + tokens.join("/") end |
#render_query(iri, _classifier = @classifier) ⇒ Object
Evidence-source interface — render the query string with cluster-inferred param types where available. The mechanical NullEvidenceSource provides the classifier-only fallback; this version prefers the cluster’s observed type per param (dominant type_count, subject to the corpus thresholds).
286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
# File 'lib/iriq/corpus.rb', line 286 def render_query(iri, _classifier = @classifier) hinted_shape = PathShape.new(classifier: @classifier, hints: true) .from_entries(SegmentHints.derive(iri.path_segments, @classifier)) key, * = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape, host: effective_host(iri.host)) cluster = @storage.cluster_for(key) iri.query_params.keys.sort.map do |k| v = iri.query_params[k].to_s type = inferred_param_type(cluster, k, v) shaped = render_param_value(v, type) "#{k}=#{shaped}" end.join("&") end |
#save(path = nil) ⇒ Object
Persist the corpus.
save() → flush the backend in place (JSON writes its file,
SQLite is already on disk).
save(same_path) → same as save() — idempotent for the backend's path.
save(other_path) → export to other_path as JSON, regardless of the
live backend.
360 361 362 363 364 365 366 367 |
# File 'lib/iriq/corpus.rb', line 360 def save(path = nil) backend_path = @storage.respond_to?(:path) ? @storage.path : nil if path.nil? || path == backend_path @storage.save else write_json_dump(path) end end |
#size ⇒ Object
341 342 343 |
# File 'lib/iriq/corpus.rb', line 341 def size @storage.cluster_size end |
#stats_for(host_or_position, prefix = nil) ⇒ Object
Stats for a given (host, path-prefix) — useful for tests and debugging. Returns nil if nothing has been observed there. Accepts either a Position or (host, prefix) for ergonomics.
348 349 350 351 |
# File 'lib/iriq/corpus.rb', line 348 def stats_for(host_or_position, prefix = nil) position = host_or_position.is_a?(Position) ? host_or_position : Position.path(host: host_or_position, prefix: prefix) @storage.position_stats(position) end |