Class: Iriq::Corpus
- Inherits:
-
Object
- Object
- Iriq::Corpus
- Defined in:
- lib/iriq/corpus.rb
Overview
Streaming-friendly observer over a (potentially unbounded) corpus of IRIs. Maintains rolling aggregates and per-(host, prefix) frequency stats so that classification can improve as more data flows in.
The deterministic, single-IRI API (Iriq.normalize/explain) is unchanged —Corpus#normalize and Corpus#explain are the corpus-informed variants.
State lives in a Storage backend (Memory by default; Json or Sqlite when opened against a file). The classification logic on top is identical regardless of where the counters live.
Constant Summary collapse
- VARIABLE_DOMINANCE_THRESHOLD =
Type-based: position is “mostly variable” (UUIDs/integers/etc.).
0.8- LITERAL_UNIQUENESS_THRESHOLD =
Cardinality-based: position has mostly distinct literal values, so the literal “type” is misleading — it’s really a variable slot. We trigger on either:
- very high cardinality fraction (most observations are singletons), OR - moderate cardinality fraction AND high absolute distinct countThe second branch catches realistic streams where popular outliers bring the frac down but the long tail is clearly variable.
0.8- LITERAL_UNIQUENESS_MODERATE_THRESHOLD =
0.5- MIN_CARDINALITY_FOR_INFERENCE =
20- MIN_OBSERVATIONS_FOR_INFERENCE =
Don’t apply corpus heuristics until we have at least this many observations at a position — too easy to be wrong with tiny samples.
5- STABLE_LITERAL_THRESHOLD =
Value-fraction at or above which a literal is considered the stable occupant of its position.
0.5- POPULAR_MIN_COUNT =
Within a high-cardinality literal position (mostly singletons), a specific value qualifies as a “popular outlier” — and gets preserved as :stable_literal instead of being lumped into :corpus_inferred_variable — when its count is at least POPULAR_MIN_COUNT and its frequency is at least POPULAR_BASELINE_MULTIPLE × the uniform baseline (1/cardinality).
5- POPULAR_BASELINE_MULTIPLE =
3
Instance Attribute Summary collapse
-
#storage ⇒ Object
readonly
Returns the value of attribute storage.
Class Method Summary collapse
- .from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object
- .load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object
-
.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) ⇒ Object
Open a corpus against ‘path`.
Instance Method Summary collapse
-
#batch(&block) ⇒ Object
Wrap many observations in a single backend transaction.
- #close ⇒ Object
- #clusters ⇒ Object
-
#dump ⇒ Object
— Legacy dump/load (JSON shape) ————————————.
-
#each_position_stats(&block) ⇒ Object
Iterates (host, prefix) → PositionStats over all observed positions.
-
#explain(input) ⇒ Object
Per-segment explanation with corpus-informed ‘classification`.
- #fingerprint_counts ⇒ Object
- #host_counts ⇒ Object
-
#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, storage: nil) ⇒ Corpus
constructor
A new instance of Corpus.
-
#normalize(input) ⇒ Object
Corpus-informed normalization.
-
#observe(input) ⇒ Object
Observe a single IRI.
- #path_length_counts ⇒ Object
- #raw_shape_counts ⇒ Object
-
#save(path = nil) ⇒ Object
Persist the corpus.
- #size ⇒ Object
-
#stats_for(host, prefix) ⇒ Object
Stats for a given (host, prefix_shape) — useful for tests and debugging.
Constructor Details
#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, storage: nil) ⇒ Corpus
Returns a new instance of Corpus.
47 48 49 50 51 52 53 54 55 |
# File 'lib/iriq/corpus.rb', line 47 def initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, storage: nil) @classifier = classifier @storage = storage || Storage::Memory.new( classifier: classifier, max_values_per_position: max_values_per_position, ) end |
Instance Attribute Details
#storage ⇒ Object (readonly)
Returns the value of attribute storage.
45 46 47 |
# File 'lib/iriq/corpus.rb', line 45 def storage @storage end |
Class Method Details
.from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object
282 283 284 285 286 287 |
# File 'lib/iriq/corpus.rb', line 282 def self.from_dump(h, classifier: SegmentClassifier::DEFAULT) max_values = h.fetch("max_values_per_position", PositionStats::DEFAULT_MAX_VALUES) storage = Storage::Memory.new(classifier: classifier, max_values_per_position: max_values) storage.load_dump!(h) new(classifier: classifier, storage: storage) end |
.load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object
289 290 291 |
# File 'lib/iriq/corpus.rb', line 289 def self.load(path, classifier: SegmentClassifier::DEFAULT) open(path, classifier: classifier) end |
.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) ⇒ Object
Open a corpus against ‘path`. File extension picks the backend: `.db`/`.sqlite`/`.sqlite3` use SQLite (incremental writes); anything else uses JSON.
60 61 62 63 64 65 66 |
# File 'lib/iriq/corpus.rb', line 60 def self.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) storage = Storage.open(path, classifier: classifier, max_values_per_position: max_values_per_position) new(classifier: classifier, storage: storage) end |
Instance Method Details
#batch(&block) ⇒ Object
Wrap many observations in a single backend transaction. For SQLite this turns thousands of fsyncs into one; for in-memory backends it’s a no-op. Use when ingesting a batch.
169 170 171 |
# File 'lib/iriq/corpus.rb', line 169 def batch(&block) @storage.batch(&block) end |
#close ⇒ Object
162 163 164 |
# File 'lib/iriq/corpus.rb', line 162 def close @storage.close end |
#clusters ⇒ Object
132 133 134 |
# File 'lib/iriq/corpus.rb', line 132 def clusters @storage.clusters end |
#dump ⇒ Object
— Legacy dump/load (JSON shape) ————————————
The pre-Storage release exposed ‘Corpus#dump`, `Corpus#save(path)`, and `Corpus.load(path)` for JSON-backed persistence. Those names still work but are now thin wrappers around the appropriate Storage backend.
278 279 280 |
# File 'lib/iriq/corpus.rb', line 278 def dump memory_view.to_dump end |
#each_position_stats(&block) ⇒ Object
Iterates (host, prefix) → PositionStats over all observed positions. Used by inspection tooling; not part of the hot path.
128 129 130 |
# File 'lib/iriq/corpus.rb', line 128 def each_position_stats(&block) @storage.each_position_stats(&block) end |
#explain(input) ⇒ Object
Per-segment explanation with corpus-informed ‘classification`. Returns an array of entries shaped like the Explanation rows plus `classification:` ∈ :stable_literal, :variable_identifier, :rare_literal, :ambiguous, :corpus_inferred_variable.
114 115 116 117 118 119 |
# File 'lib/iriq/corpus.rb', line 114 def explain(input) iri = coerce(input) annotate_segments(iri).map do |entry| entry.reject { |k, _| k == :prefix } end end |
#fingerprint_counts ⇒ Object
124 |
# File 'lib/iriq/corpus.rb', line 124 def fingerprint_counts; @storage.fingerprint_counts; end |
#host_counts ⇒ Object
121 |
# File 'lib/iriq/corpus.rb', line 121 def host_counts; @storage.host_counts; end |
#normalize(input) ⇒ Object
Corpus-informed normalization. Falls back to mechanical normalization when the corpus has no signal for a position.
97 98 99 100 101 102 103 104 105 106 107 108 |
# File 'lib/iriq/corpus.rb', line 97 def normalize(input) iri = coerce(input) return Normalizer.normalize_identifier(iri) if iri.urn? || iri.path_segments.empty? tokens = annotate_segments(iri).map { |entry| corpus_token(entry) } out = +"" out << "#{iri.scheme}://" if iri.scheme out << iri.host if iri.host out << ":#{iri.port}" if iri.port out << "/" << tokens.join("/") out end |
#observe(input) ⇒ Object
Observe a single IRI. Returns an Observation.
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/iriq/corpus.rb', line 69 def observe(input) iri = coerce(input) hinted_entries = SegmentHints.derive(iri.path_segments, @classifier) raw_shape = PathShape.new(classifier: @classifier, hints: false).from_entries(hinted_entries) hinted_shape = PathShape.new(classifier: @classifier, hints: true).from_entries(hinted_entries) cluster = nil @storage.transaction do |s| s.increment_host(iri.host) s.increment_path_length(iri.path_segments.size) s.increment_raw_shape(raw_shape) s.increment_fingerprint(hinted_shape) prefix = "" hinted_entries.each do |entry| s.observe_position(iri.host, prefix, entry[:value], entry[:type]) prefix = "#{prefix}/#{placeholder(entry)}" end key, host, scheme, shape = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape) cluster = s.add_to_cluster(key, host, scheme, shape, iri) end Observation.new(corpus: self, identifier: iri, cluster: cluster) end |
#path_length_counts ⇒ Object
122 |
# File 'lib/iriq/corpus.rb', line 122 def path_length_counts; @storage.path_length_counts; end |
#raw_shape_counts ⇒ Object
123 |
# File 'lib/iriq/corpus.rb', line 123 def raw_shape_counts; @storage.raw_shape_counts; end |
#save(path = nil) ⇒ Object
Persist the corpus.
save() → flush the backend in place (JSON writes its file,
SQLite is already on disk).
save(same_path) → same as save() — idempotent for the backend's path.
save(other_path) → export to other_path as JSON, regardless of the
live backend.
153 154 155 156 157 158 159 160 |
# File 'lib/iriq/corpus.rb', line 153 def save(path = nil) backend_path = @storage.respond_to?(:path) ? @storage.path : nil if path.nil? || path == backend_path @storage.save else write_json_dump(path) end end |
#size ⇒ Object
136 137 138 |
# File 'lib/iriq/corpus.rb', line 136 def size @storage.cluster_size end |
#stats_for(host, prefix) ⇒ Object
Stats for a given (host, prefix_shape) — useful for tests and debugging. Returns nil if nothing has been observed there.
142 143 144 |
# File 'lib/iriq/corpus.rb', line 142 def stats_for(host, prefix) @storage.position_stats(host, prefix) end |