Class: Iriq::Corpus

Inherits:
Object
  • Object
show all
Defined in:
lib/iriq/corpus.rb

Overview

Streaming-friendly observer over a (potentially unbounded) corpus of IRIs. Maintains rolling aggregates and per-(host, prefix) frequency stats so that classification can improve as more data flows in.

The deterministic, single-IRI API (Iriq.normalize/explain) is unchanged —Corpus#normalize and Corpus#explain are the corpus-informed variants.

State lives in a Storage backend (Memory by default; Json or Sqlite when opened against a file). The classification logic on top is identical regardless of where the counters live.

Constant Summary collapse

VARIABLE_DOMINANCE_THRESHOLD =

Type-based: position is “mostly variable” (UUIDs/integers/etc.).

0.8
LITERAL_UNIQUENESS_THRESHOLD =

Cardinality-based: position has mostly distinct literal values, so the literal “type” is misleading — it’s really a variable slot. We trigger on either:

- very high cardinality fraction (most observations are singletons), OR
- moderate cardinality fraction AND high absolute distinct count

The second branch catches realistic streams where popular outliers bring the frac down but the long tail is clearly variable.

0.8
LITERAL_UNIQUENESS_MODERATE_THRESHOLD =
0.5
MIN_CARDINALITY_FOR_INFERENCE =
20
MIN_OBSERVATIONS_FOR_INFERENCE =

Don’t apply corpus heuristics until we have at least this many observations at a position — too easy to be wrong with tiny samples.

5
STABLE_LITERAL_THRESHOLD =

Value-fraction at or above which a literal is considered the stable occupant of its position.

0.5
5
3

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES, storage: nil) ⇒ Corpus

Returns a new instance of Corpus.



47
48
49
50
51
52
53
54
55
# File 'lib/iriq/corpus.rb', line 47

def initialize(classifier: SegmentClassifier::DEFAULT,
               max_values_per_position: PositionStats::DEFAULT_MAX_VALUES,
               storage: nil)
  @classifier = classifier
  @storage    = storage || Storage::Memory.new(
    classifier: classifier,
    max_values_per_position: max_values_per_position,
  )
end

Instance Attribute Details

#storageObject (readonly)

Returns the value of attribute storage.



45
46
47
# File 'lib/iriq/corpus.rb', line 45

def storage
  @storage
end

Class Method Details

.from_dump(h, classifier: SegmentClassifier::DEFAULT) ⇒ Object



282
283
284
285
286
287
# File 'lib/iriq/corpus.rb', line 282

def self.from_dump(h, classifier: SegmentClassifier::DEFAULT)
  max_values = h.fetch("max_values_per_position", PositionStats::DEFAULT_MAX_VALUES)
  storage = Storage::Memory.new(classifier: classifier, max_values_per_position: max_values)
  storage.load_dump!(h)
  new(classifier: classifier, storage: storage)
end

.load(path, classifier: SegmentClassifier::DEFAULT) ⇒ Object



289
290
291
# File 'lib/iriq/corpus.rb', line 289

def self.load(path, classifier: SegmentClassifier::DEFAULT)
  open(path, classifier: classifier)
end

.open(path, classifier: SegmentClassifier::DEFAULT, max_values_per_position: PositionStats::DEFAULT_MAX_VALUES) ⇒ Object

Open a corpus against ‘path`. File extension picks the backend: `.db`/`.sqlite`/`.sqlite3` use SQLite (incremental writes); anything else uses JSON.



60
61
62
63
64
65
66
# File 'lib/iriq/corpus.rb', line 60

def self.open(path, classifier: SegmentClassifier::DEFAULT,
                    max_values_per_position: PositionStats::DEFAULT_MAX_VALUES)
  storage = Storage.open(path,
                         classifier: classifier,
                         max_values_per_position: max_values_per_position)
  new(classifier: classifier, storage: storage)
end

Instance Method Details

#batch(&block) ⇒ Object

Wrap many observations in a single backend transaction. For SQLite this turns thousands of fsyncs into one; for in-memory backends it’s a no-op. Use when ingesting a batch.



169
170
171
# File 'lib/iriq/corpus.rb', line 169

def batch(&block)
  @storage.batch(&block)
end

#closeObject



162
163
164
# File 'lib/iriq/corpus.rb', line 162

def close
  @storage.close
end

#clustersObject



132
133
134
# File 'lib/iriq/corpus.rb', line 132

def clusters
  @storage.clusters
end

#dumpObject

— Legacy dump/load (JSON shape) ————————————

The pre-Storage release exposed ‘Corpus#dump`, `Corpus#save(path)`, and `Corpus.load(path)` for JSON-backed persistence. Those names still work but are now thin wrappers around the appropriate Storage backend.



278
279
280
# File 'lib/iriq/corpus.rb', line 278

def dump
  memory_view.to_dump
end

#each_position_stats(&block) ⇒ Object

Iterates (host, prefix) → PositionStats over all observed positions. Used by inspection tooling; not part of the hot path.



128
129
130
# File 'lib/iriq/corpus.rb', line 128

def each_position_stats(&block)
  @storage.each_position_stats(&block)
end

#explain(input) ⇒ Object

Per-segment explanation with corpus-informed ‘classification`. Returns an array of entries shaped like the Explanation rows plus `classification:` ∈ :stable_literal, :variable_identifier, :rare_literal, :ambiguous, :corpus_inferred_variable.



114
115
116
117
118
119
# File 'lib/iriq/corpus.rb', line 114

def explain(input)
  iri = coerce(input)
  annotate_segments(iri).map do |entry|
    entry.reject { |k, _| k == :prefix }
  end
end

#fingerprint_countsObject



124
# File 'lib/iriq/corpus.rb', line 124

def fingerprint_counts; @storage.fingerprint_counts; end

#host_countsObject



121
# File 'lib/iriq/corpus.rb', line 121

def host_counts;        @storage.host_counts;        end

#normalize(input) ⇒ Object

Corpus-informed normalization. Falls back to mechanical normalization when the corpus has no signal for a position.



97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/iriq/corpus.rb', line 97

def normalize(input)
  iri = coerce(input)
  return Normalizer.normalize_identifier(iri) if iri.urn? || iri.path_segments.empty?

  tokens = annotate_segments(iri).map { |entry| corpus_token(entry) }
  out = +""
  out << "#{iri.scheme}://" if iri.scheme
  out << iri.host if iri.host
  out << ":#{iri.port}" if iri.port
  out << "/" << tokens.join("/")
  out
end

#observe(input) ⇒ Object

Observe a single IRI. Returns an Observation.



69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/iriq/corpus.rb', line 69

def observe(input)
  iri = coerce(input)
  hinted_entries = SegmentHints.derive(iri.path_segments, @classifier)
  raw_shape    = PathShape.new(classifier: @classifier, hints: false).from_entries(hinted_entries)
  hinted_shape = PathShape.new(classifier: @classifier, hints: true).from_entries(hinted_entries)

  cluster = nil
  @storage.transaction do |s|
    s.increment_host(iri.host)
    s.increment_path_length(iri.path_segments.size)
    s.increment_raw_shape(raw_shape)
    s.increment_fingerprint(hinted_shape)

    prefix = ""
    hinted_entries.each do |entry|
      s.observe_position(iri.host, prefix, entry[:value], entry[:type])
      prefix = "#{prefix}/#{placeholder(entry)}"
    end

    key, host, scheme, shape = Cluster.key_for(iri, classifier: @classifier, shape: hinted_shape)
    cluster = s.add_to_cluster(key, host, scheme, shape, iri)
  end

  Observation.new(corpus: self, identifier: iri, cluster: cluster)
end

#path_length_countsObject



122
# File 'lib/iriq/corpus.rb', line 122

def path_length_counts; @storage.path_length_counts; end

#raw_shape_countsObject



123
# File 'lib/iriq/corpus.rb', line 123

def raw_shape_counts;   @storage.raw_shape_counts;   end

#save(path = nil) ⇒ Object

Persist the corpus.

save()           → flush the backend in place (JSON writes its file,
                   SQLite is already on disk).
save(same_path)  → same as save() — idempotent for the backend's path.
save(other_path) → export to other_path as JSON, regardless of the
                   live backend.


153
154
155
156
157
158
159
160
# File 'lib/iriq/corpus.rb', line 153

def save(path = nil)
  backend_path = @storage.respond_to?(:path) ? @storage.path : nil
  if path.nil? || path == backend_path
    @storage.save
  else
    write_json_dump(path)
  end
end

#sizeObject



136
137
138
# File 'lib/iriq/corpus.rb', line 136

def size
  @storage.cluster_size
end

#stats_for(host, prefix) ⇒ Object

Stats for a given (host, prefix_shape) — useful for tests and debugging. Returns nil if nothing has been observed there.



142
143
144
# File 'lib/iriq/corpus.rb', line 142

def stats_for(host, prefix)
  @storage.position_stats(host, prefix)
end