Class: Iriq::Corpus
- Inherits:
-
Object
- Object
- Iriq::Corpus
- Defined in:
- lib/iriq/corpus.rb
Overview
Streaming-friendly observer over a (potentially unbounded) corpus of IRIs. Maintains rolling aggregates and per-(host, prefix) frequency stats so that classification can improve as more data flows in.
The deterministic, single-IRI API (Iriq.normalize/explain) is unchanged —Corpus#normalize and Corpus#explain are the corpus-informed variants.
Constant Summary collapse
- VARIABLE_DOMINANCE_THRESHOLD =
Type-based: position is “mostly variable” (UUIDs/integers/etc.).
0.8- LITERAL_UNIQUENESS_THRESHOLD =
Cardinality-based: position has mostly distinct literal values, so the literal “type” is misleading — it’s really a variable slot. We trigger on either:
- very high cardinality fraction (most observations are singletons), OR - moderate cardinality fraction AND high absolute distinct countThe second branch catches realistic streams where popular outliers bring the frac down but the long tail is clearly variable.
0.8- LITERAL_UNIQUENESS_MODERATE_THRESHOLD =
0.5- MIN_CARDINALITY_FOR_INFERENCE =
20- MIN_OBSERVATIONS_FOR_INFERENCE =
Don’t apply corpus heuristics until we have at least this many observations at a position — too easy to be wrong with tiny samples.
5- STABLE_LITERAL_THRESHOLD =
Value-fraction at or above which a literal is considered the stable occupant of its position.
0.5- POPULAR_MIN_COUNT =
Within a high-cardinality literal position (mostly singletons), a specific value qualifies as a “popular outlier” — and gets preserved as :stable_literal instead of being lumped into :corpus_inferred_variable — when its count is at least POPULAR_MIN_COUNT and its frequency is at least POPULAR_BASELINE_MULTIPLE × the uniform baseline (1/cardinality).
5- POPULAR_BASELINE_MULTIPLE =
3
Instance Attribute Summary collapse
-
#fingerprint_counts ⇒ Object
readonly
Returns the value of attribute fingerprint_counts.
-
#host_counts ⇒ Object
readonly
Returns the value of attribute host_counts.
-
#path_length_counts ⇒ Object
readonly
Returns the value of attribute path_length_counts.
-
#position_stats ⇒ Object
readonly
Returns the value of attribute position_stats.
-
#raw_shape_counts ⇒ Object
readonly
Returns the value of attribute raw_shape_counts.
Class Method Summary collapse
- .from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object
- .load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object
Instance Method Summary collapse
- #clusters ⇒ Object
- #dump ⇒ Object
-
#explain(input) ⇒ Object
Per-segment explanation with corpus-informed ‘classification`.
-
#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) ⇒ Corpus
constructor
A new instance of Corpus.
-
#normalize(input) ⇒ Object
Corpus-informed normalization.
-
#observe(input) ⇒ Object
Observe a single IRI.
- #save(path) ⇒ Object
- #size ⇒ Object
-
#stats_for(host, prefix) ⇒ Object
Stats for a given (host, prefix_shape) — useful for tests and debugging.
Constructor Details
#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) ⇒ Corpus
Returns a new instance of Corpus.
44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/iriq/corpus.rb', line 44 def initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) @classifier = classifier @max_values_per_position = max_values_per_position @host_counts = Hash.new(0) @path_length_counts = Hash.new(0) @raw_shape_counts = Hash.new(0) @fingerprint_counts = Hash.new(0) @position_stats = {} @clusterer = Clusterer.new(classifier: classifier) end |
Instance Attribute Details
#fingerprint_counts ⇒ Object (readonly)
Returns the value of attribute fingerprint_counts.
41 42 43 |
# File 'lib/iriq/corpus.rb', line 41 def fingerprint_counts @fingerprint_counts end |
#host_counts ⇒ Object (readonly)
Returns the value of attribute host_counts.
41 42 43 |
# File 'lib/iriq/corpus.rb', line 41 def host_counts @host_counts end |
#path_length_counts ⇒ Object (readonly)
Returns the value of attribute path_length_counts.
41 42 43 |
# File 'lib/iriq/corpus.rb', line 41 def path_length_counts @path_length_counts end |
#position_stats ⇒ Object (readonly)
Returns the value of attribute position_stats.
41 42 43 |
# File 'lib/iriq/corpus.rb', line 41 def position_stats @position_stats end |
#raw_shape_counts ⇒ Object (readonly)
Returns the value of attribute raw_shape_counts.
41 42 43 |
# File 'lib/iriq/corpus.rb', line 41 def raw_shape_counts @raw_shape_counts end |
Class Method Details
.from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 |
# File 'lib/iriq/corpus.rb', line 247 def self.from_dump(h, classifier: SegmentClassifier::DEFAULT) c = new( classifier: classifier, max_values_per_position: h.fetch("max_values_per_position", PositionStats::DEFAULT_MAX_VALUES), ) c.instance_variable_set(:@host_counts, Hash.new(0).merge(h["host_counts"])) c.instance_variable_set(:@path_length_counts, Hash.new(0).merge(h["path_length_counts"].transform_keys(&:to_i))) c.instance_variable_set(:@raw_shape_counts, Hash.new(0).merge(h["raw_shape_counts"])) c.instance_variable_set(:@fingerprint_counts, Hash.new(0).merge(h["fingerprint_counts"])) stats = h["position_stats"].each_with_object({}) do |(host, prefix, sdump), acc| acc[[host, prefix]] = PositionStats.from_dump(sdump) end c.instance_variable_set(:@position_stats, stats) c.instance_variable_set(:@clusterer, Clusterer.from_dump(h["clusterer"], classifier: classifier)) c end |
.load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object
264 265 266 |
# File 'lib/iriq/corpus.rb', line 264 def self.load(path, classifier: SegmentClassifier::DEFAULT) from_dump(JSON.parse(File.read(path)), classifier: classifier) end |
Instance Method Details
#clusters ⇒ Object
92 93 94 |
# File 'lib/iriq/corpus.rb', line 92 def clusters @clusterer.clusters end |
#dump ⇒ Object
229 230 231 232 233 234 235 236 237 238 239 |
# File 'lib/iriq/corpus.rb', line 229 def dump { "host_counts" => @host_counts, "path_length_counts" => @path_length_counts.transform_keys(&:to_s), "raw_shape_counts" => @raw_shape_counts, "fingerprint_counts" => @fingerprint_counts, "max_values_per_position" => @max_values_per_position, "position_stats" => @position_stats.map { |(host, prefix), s| [host, prefix, s.dump] }, "clusterer" => @clusterer.dump, } end |
#explain(input) ⇒ Object
Per-segment explanation with corpus-informed ‘classification`. Returns an array of entries shaped like the Explanation rows plus `classification:` ∈ :stable_literal, :variable_identifier, :rare_literal, :ambiguous, :corpus_inferred_variable.
85 86 87 88 89 90 |
# File 'lib/iriq/corpus.rb', line 85 def explain(input) iri = coerce(input) annotate_segments(iri).map do |entry| entry.reject { |k, _| k == :prefix } end end |
#normalize(input) ⇒ Object
Corpus-informed normalization. Falls back to mechanical normalization when the corpus has no signal for a position.
68 69 70 71 72 73 74 75 76 77 78 79 |
# File 'lib/iriq/corpus.rb', line 68 def normalize(input) iri = coerce(input) return Normalizer.normalize_identifier(iri) if iri.urn? || iri.path_segments.empty? tokens = annotate_segments(iri).map { |entry| corpus_token(entry) } out = +"" out << "#{iri.scheme}://" if iri.scheme out << iri.host if iri.host out << ":#{iri.port}" if iri.port out << "/" << tokens.join("/") out end |
#observe(input) ⇒ Object
Observe a single IRI. Returns an Observation.
57 58 59 60 61 62 63 64 |
# File 'lib/iriq/corpus.rb', line 57 def observe(input) iri = coerce(input) hinted_entries = SegmentHints.derive(iri.path_segments, @classifier) record_aggregates(iri, hinted_entries) hinted_shape = PathShape.new(classifier: @classifier, hints: true).from_entries(hinted_entries) cluster = @clusterer.add(iri, shape: hinted_shape) Observation.new(corpus: self, identifier: iri, cluster: cluster) end |
#save(path) ⇒ Object
241 242 243 244 245 |
# File 'lib/iriq/corpus.rb', line 241 def save(path) tmp = "#{path}.tmp" File.write(tmp, JSON.generate(dump)) File.rename(tmp, path) end |
#size ⇒ Object
96 97 98 |
# File 'lib/iriq/corpus.rb', line 96 def size @clusterer.size end |
#stats_for(host, prefix) ⇒ Object
Stats for a given (host, prefix_shape) — useful for tests and debugging. Returns nil if nothing has been observed there.
102 103 104 |
# File 'lib/iriq/corpus.rb', line 102 def stats_for(host, prefix) @position_stats[[host, prefix]] end |