Module: ToyTokenDrift
- Defined in:
- lib/toy/dev/toy_token_drift.rb
Class Method Summary collapse
-
.corpus_freq(seqs_path, vocab_size) ⇒ Object
One-time corpus frequency histogram.
-
.emit_per_token(sess, t_embed, snap_mat, freqs, vocab_size, d_model, step, t_now) ⇒ Object
Emit one ‘drift` event per vocab row.
-
.snapshot(sess, t_embed) ⇒ Object
Snapshot the full embed table into a Mat (held in main scope as the step-0 baseline).
Class Method Details
.corpus_freq(seqs_path, vocab_size) ⇒ Object
One-time corpus frequency histogram. Returns Array<Int> of length vocab_size where index = token_id, value = occurrence count across all lines in the seqs file. IDs outside [0,vocab) are silently skipped (corpus-vs-vocab drift is the caller’s concern).
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/toy/dev/toy_token_drift.rb', line 30 def self.corpus_freq(seqs_path, vocab_size) counts = [0]; counts.pop i = 0 while i < vocab_size counts.push(0) i = i + 1 end raw = File.read(seqs_path) lines = raw.split("\n") li = 0 while li < lines.length parts = lines[li].split(" ") pi = 0 while pi < parts.length tid = parts[pi].to_i if tid >= 0 && tid < vocab_size counts[tid] = counts[tid] + 1 end pi = pi + 1 end li = li + 1 end counts end |
.emit_per_token(sess, t_embed, snap_mat, freqs, vocab_size, d_model, step, t_now) ⇒ Object
Emit one ‘drift` event per vocab row. Each event carries:
param = "token_embd.weight"
token_id = the row index
cos_to_init, l2_to_init = per-row metric vs snap_mat
freq = corpus occurrence count
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/toy/dev/toy_token_drift.rb', line 71 def self.emit_per_token(sess, , snap_mat, freqs, vocab_size, d_model, step, t_now) n = vocab_size * d_model cur = Mat.new(1, n) TinyNN.tnn_download_to_f64_array(sess, , cur.flat, n) row = 0 while row < vocab_size base = row * d_model dot = 0.0 sum_sq_s = 0.0 sum_sq_c = 0.0 sum_sq_diff = 0.0 d = 0 while d < d_model sv = snap_mat.flat[base + d] cv = cur.flat[base + d] dot = dot + sv * cv sum_sq_s = sum_sq_s + sv * sv sum_sq_c = sum_sq_c + cv * cv diff = sv - cv sum_sq_diff = sum_sq_diff + diff * diff d = d + 1 end norm_s = sum_sq_s ** 0.5 norm_c = sum_sq_c ** 0.5 cos_to_init = 0.0 if norm_s > 0.0 && norm_c > 0.0 cos_to_init = dot / (norm_s * norm_c) end l2_to_init = sum_sq_diff ** 0.5 freq = 0 if row < freqs.length freq = freqs[row] end ev = SpinelKit::Json::Builder.new ev.add_str("kind", "drift") ev.add_str("phase", "train") ev.add_num("t", t_now) ev.add_num("step", step) ev.add_str("param", "token_embd.weight") ev.add_num("token_id", row) ev.add_num("cos_to_init", cos_to_init) ev.add_num("l2_to_init", l2_to_init) ev.add_num("freq", freq) TinyNN.tnn_events_emit(ev.dump) row = row + 1 end end |
.snapshot(sess, t_embed) ⇒ Object
Snapshot the full embed table into a Mat (held in main scope as the step-0 baseline). Returns Mat of length vocab_size * d_model — the same row-major flat as the tensor’s ggml column- major data slot (ne=[d_model, vocab] → flat[row*d_model + d]).
59 60 61 62 63 64 |
# File 'lib/toy/dev/toy_token_drift.rb', line 59 def self.snapshot(sess, ) n = TinyNN.tnn_tensor_nelements() m = Mat.new(1, n) TinyNN.tnn_download_to_f64_array(sess, , m.flat, n) m end |