Class: Iriq::Corpus

Inherits:
Object
  • Object
show all
Defined in:
lib/iriq/corpus.rb

Overview

Streaming-friendly observer over a (potentially unbounded) corpus of IRIs. Maintains rolling aggregates and per-(host, prefix) frequency stats so that classification can improve as more data flows in.

The deterministic, single-IRI API (Iriq.normalize/explain) is unchanged —Corpus#normalize and Corpus#explain are the corpus-informed variants.

Constant Summary collapse

VARIABLE_DOMINANCE_THRESHOLD =

Type-based: position is “mostly variable” (UUIDs/integers/etc.).

0.8
LITERAL_UNIQUENESS_THRESHOLD =

Cardinality-based: position has mostly distinct literal values, so the literal “type” is misleading — it’s really a variable slot. We trigger on either:

- very high cardinality fraction (most observations are singletons), OR
- moderate cardinality fraction AND high absolute distinct count

The second branch catches realistic streams where popular outliers bring the frac down but the long tail is clearly variable.

0.8
LITERAL_UNIQUENESS_MODERATE_THRESHOLD =
0.5
MIN_CARDINALITY_FOR_INFERENCE =
20
MIN_OBSERVATIONS_FOR_INFERENCE =

Don’t apply corpus heuristics until we have at least this many observations at a position — too easy to be wrong with tiny samples.

5
STABLE_LITERAL_THRESHOLD =

Value-fraction at or above which a literal is considered the stable occupant of its position.

0.5
5
3

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) ⇒ Corpus

Returns a new instance of Corpus.



44
45
46
47
48
49
50
51
52
53
54
# File 'lib/iriq/corpus.rb', line 44

def initialize(classifier: SegmentClassifier::DEFAULT,
               max_values_per_position: PositionStats::DEFAULT_MAX_VALUES)
  @classifier              = classifier
  @max_values_per_position = max_values_per_position
  @host_counts             = Hash.new(0)
  @path_length_counts      = Hash.new(0)
  @raw_shape_counts        = Hash.new(0)
  @fingerprint_counts      = Hash.new(0)
  @position_stats          = {}
  @clusterer               = Clusterer.new(classifier: classifier)
end

Instance Attribute Details

#fingerprint_countsObject (readonly)

Returns the value of attribute fingerprint_counts.



41
42
43
# File 'lib/iriq/corpus.rb', line 41

def fingerprint_counts
  @fingerprint_counts
end

#host_countsObject (readonly)

Returns the value of attribute host_counts.



41
42
43
# File 'lib/iriq/corpus.rb', line 41

def host_counts
  @host_counts
end

#path_length_countsObject (readonly)

Returns the value of attribute path_length_counts.



41
42
43
# File 'lib/iriq/corpus.rb', line 41

def path_length_counts
  @path_length_counts
end

#position_statsObject (readonly)

Returns the value of attribute position_stats.



41
42
43
# File 'lib/iriq/corpus.rb', line 41

def position_stats
  @position_stats
end

#raw_shape_countsObject (readonly)

Returns the value of attribute raw_shape_counts.



41
42
43
# File 'lib/iriq/corpus.rb', line 41

def raw_shape_counts
  @raw_shape_counts
end

Class Method Details

.from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object



247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
# File 'lib/iriq/corpus.rb', line 247

def self.from_dump(h, classifier: SegmentClassifier::DEFAULT)
  c = new(
    classifier: classifier,
    max_values_per_position: h.fetch("max_values_per_position", PositionStats::DEFAULT_MAX_VALUES),
  )
  c.instance_variable_set(:@host_counts,        Hash.new(0).merge(h["host_counts"]))
  c.instance_variable_set(:@path_length_counts, Hash.new(0).merge(h["path_length_counts"].transform_keys(&:to_i)))
  c.instance_variable_set(:@raw_shape_counts,   Hash.new(0).merge(h["raw_shape_counts"]))
  c.instance_variable_set(:@fingerprint_counts, Hash.new(0).merge(h["fingerprint_counts"]))
  stats = h["position_stats"].each_with_object({}) do |(host, prefix, sdump), acc|
    acc[[host, prefix]] = PositionStats.from_dump(sdump)
  end
  c.instance_variable_set(:@position_stats, stats)
  c.instance_variable_set(:@clusterer, Clusterer.from_dump(h["clusterer"], classifier: classifier))
  c
end

.load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object



264
265
266
# File 'lib/iriq/corpus.rb', line 264

def self.load(path, classifier: SegmentClassifier::DEFAULT)
  from_dump(JSON.parse(File.read(path)), classifier: classifier)
end

Instance Method Details

#clustersObject



92
93
94
# File 'lib/iriq/corpus.rb', line 92

def clusters
  @clusterer.clusters
end

#dumpObject



229
230
231
232
233
234
235
236
237
238
239
# File 'lib/iriq/corpus.rb', line 229

def dump
  {
    "host_counts"             => @host_counts,
    "path_length_counts"      => @path_length_counts.transform_keys(&:to_s),
    "raw_shape_counts"        => @raw_shape_counts,
    "fingerprint_counts"      => @fingerprint_counts,
    "max_values_per_position" => @max_values_per_position,
    "position_stats"          => @position_stats.map { |(host, prefix), s| [host, prefix, s.dump] },
    "clusterer"               => @clusterer.dump,
  }
end

#explain(input) ⇒ Object

Per-segment explanation with corpus-informed ‘classification`. Returns an array of entries shaped like the Explanation rows plus `classification:` ∈ :stable_literal, :variable_identifier, :rare_literal, :ambiguous, :corpus_inferred_variable.



85
86
87
88
89
90
# File 'lib/iriq/corpus.rb', line 85

def explain(input)
  iri = coerce(input)
  annotate_segments(iri).map do |entry|
    entry.reject { |k, _| k == :prefix }
  end
end

#normalize(input) ⇒ Object

Corpus-informed normalization. Falls back to mechanical normalization when the corpus has no signal for a position.



68
69
70
71
72
73
74
75
76
77
78
79
# File 'lib/iriq/corpus.rb', line 68

def normalize(input)
  iri = coerce(input)
  return Normalizer.normalize_identifier(iri) if iri.urn? || iri.path_segments.empty?

  tokens = annotate_segments(iri).map { |entry| corpus_token(entry) }
  out = +""
  out << "#{iri.scheme}://" if iri.scheme
  out << iri.host if iri.host
  out << ":#{iri.port}" if iri.port
  out << "/" << tokens.join("/")
  out
end

#observe(input) ⇒ Object

Observe a single IRI. Returns an Observation.



57
58
59
60
61
62
63
64
# File 'lib/iriq/corpus.rb', line 57

def observe(input)
  iri = coerce(input)
  hinted_entries = SegmentHints.derive(iri.path_segments, @classifier)
  record_aggregates(iri, hinted_entries)
  hinted_shape = PathShape.new(classifier: @classifier, hints: true).from_entries(hinted_entries)
  cluster = @clusterer.add(iri, shape: hinted_shape)
  Observation.new(corpus: self, identifier: iri, cluster: cluster)
end

#save(path) ⇒ Object



241
242
243
244
245
# File 'lib/iriq/corpus.rb', line 241

def save(path)
  tmp = "#{path}.tmp"
  File.write(tmp, JSON.generate(dump))
  File.rename(tmp, path)
end

#sizeObject



96
97
98
# File 'lib/iriq/corpus.rb', line 96

def size
  @clusterer.size
end

#stats_for(host, prefix) ⇒ Object

Stats for a given (host, prefix_shape) — useful for tests and debugging. Returns nil if nothing has been observed there.



102
103
104
# File 'lib/iriq/corpus.rb', line 102

def stats_for(host, prefix)
  @position_stats[[host, prefix]]
end