Module: ToyTokenDrift

Defined in:
lib/toy/dev/toy_token_drift.rb

Class Method Summary collapse

Class Method Details

.corpus_freq(seqs_path, vocab_size) ⇒ Object

One-time corpus frequency histogram. Returns Array<Int> of length vocab_size where index = token_id, value = occurrence count across all lines in the seqs file. IDs outside [0,vocab) are silently skipped (corpus-vs-vocab drift is the caller’s concern).



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/toy/dev/toy_token_drift.rb', line 30

def self.corpus_freq(seqs_path, vocab_size)
  counts = [0]; counts.pop
  i = 0
  while i < vocab_size
    counts.push(0)
    i = i + 1
  end
  raw   = File.read(seqs_path)
  lines = raw.split("\n")
  li = 0
  while li < lines.length
    parts = lines[li].split(" ")
    pi = 0
    while pi < parts.length
      tid = parts[pi].to_i
      if tid >= 0 && tid < vocab_size
        counts[tid] = counts[tid] + 1
      end
      pi = pi + 1
    end
    li = li + 1
  end
  counts
end

.emit_per_token(sess, t_embed, snap_mat, freqs, vocab_size, d_model, step, t_now) ⇒ Object

Emit one ‘drift` event per vocab row. Each event carries:

param      = "token_embd.weight"
token_id   = the row index
cos_to_init, l2_to_init = per-row metric vs snap_mat
freq       = corpus occurrence count


71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/toy/dev/toy_token_drift.rb', line 71

def self.emit_per_token(sess, t_embed, snap_mat, freqs,
                          vocab_size, d_model, step, t_now)
  n = vocab_size * d_model
  cur = Mat.new(1, n)
  TinyNN.tnn_download_to_f64_array(sess, t_embed, cur.flat, n)

  row = 0
  while row < vocab_size
    base = row * d_model
    dot = 0.0
    sum_sq_s = 0.0
    sum_sq_c = 0.0
    sum_sq_diff = 0.0
    d = 0
    while d < d_model
      sv = snap_mat.flat[base + d]
      cv = cur.flat[base + d]
      dot = dot + sv * cv
      sum_sq_s = sum_sq_s + sv * sv
      sum_sq_c = sum_sq_c + cv * cv
      diff = sv - cv
      sum_sq_diff = sum_sq_diff + diff * diff
      d = d + 1
    end
    norm_s = sum_sq_s ** 0.5
    norm_c = sum_sq_c ** 0.5
    cos_to_init = 0.0
    if norm_s > 0.0 && norm_c > 0.0
      cos_to_init = dot / (norm_s * norm_c)
    end
    l2_to_init = sum_sq_diff ** 0.5
    freq = 0
    if row < freqs.length
      freq = freqs[row]
    end

    ev = SpinelKit::Json::Builder.new
    ev.add_str("kind",  "drift")
    ev.add_str("phase", "train")
    ev.add_num("t",           t_now)
    ev.add_num("step",        step)
    ev.add_str("param",       "token_embd.weight")
    ev.add_num("token_id",    row)
    ev.add_num("cos_to_init", cos_to_init)
    ev.add_num("l2_to_init",  l2_to_init)
    ev.add_num("freq",        freq)
    TinyNN.tnn_events_emit(ev.dump)
    row = row + 1
  end
end

.snapshot(sess, t_embed) ⇒ Object

Snapshot the full embed table into a Mat (held in main scope as the step-0 baseline). Returns Mat of length vocab_size * d_model — the same row-major flat as the tensor’s ggml column- major data slot (ne=[d_model, vocab] → flat[row*d_model + d]).



59
60
61
62
63
64
# File 'lib/toy/dev/toy_token_drift.rb', line 59

def self.snapshot(sess, t_embed)
  n = TinyNN.tnn_tensor_nelements(t_embed)
  m = Mat.new(1, n)
  TinyNN.tnn_download_to_f64_array(sess, t_embed, m.flat, n)
  m
end