Module: ToyCorpusLoader

Defined in:
lib/toy/io/toy_corpus_loader.rb

Overview

E2.4 / GH#14 — streaming token-corpus loader.

Reads packed i32 tokens from a binary file (produced by prep/pretokenize_corpus.py) in fixed-size sequences. Caller owns the byte-offset cursor.

Wraparound: when a read would land past EOF, the loader wraps to offset 0 and re-reads from the start. That mirrors the standard “epoch” pattern — finite corpus, train_steps × T may exceed corpus token count → wrap and repeat.

The C-side primitive (tnn_read_i32_file) does one fopen+fseek+fread per call. For a streaming training loop, this is one syscall per training step — negligible. If we ever need higher throughput, mmap the file once and slice (a follow-up; not urgent at TinyStories or 10M-token-shard scale).

Constant Summary collapse

TOKEN_BYTES =

int32

4

Class Method Summary collapse

Class Method Details

.read_seq(path, byte_offset, n_tokens) ⇒ Object

Read exactly n_tokens tokens starting at byte_offset. Returns the tokens Array<Int>. If reading would short-cut at EOF, wraps to 0 and retries; if even that fails (corpus < n_tokens), pads with 0s and emits a warning.



25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/toy/io/toy_corpus_loader.rb', line 25

def self.read_seq(path, byte_offset, n_tokens)
  buf = Array.new(n_tokens, 0)
  got = TinyNN.tnn_read_i32_file(path, byte_offset, n_tokens, buf)
  if got == n_tokens
    return buf
  end
  if got < 0
    puts "WARN: ToyCorpusLoader.read_seq rc=" + got.to_s + " path=" + path + " offset=" + byte_offset.to_s
    return buf
  end
  # Short read at EOF — wrap to 0 and try to fill the rest from the
  # start. If THAT also short-reads, the corpus is shorter than
  # n_tokens — pad with zeros and warn once.
  remainder = n_tokens - got
  wrap_buf = Array.new(remainder, 0)
  got2 = TinyNN.tnn_read_i32_file(path, 0, remainder, wrap_buf)
  if got2 < remainder
    puts "WARN: corpus shorter than n_tokens=" + n_tokens.to_s +
         " (got " + got.to_s + " + " + got2.to_s + "); padding with 0"
  end
  i = 0
  while i < remainder
    buf[got + i] = wrap_buf[i]
    i = i + 1
  end
  buf
end