Class: Ignis::Shared::NvArray

Inherits:
Object
  • Object
show all
Defined in:
lib/nnw/shared/nv_array.rb

Overview

NvArray — The canonical GPU tensor type for the entire Ignis system.

Ignis creates them. NvCCL moves them. WNAIS serializes them to NOVA. There is exactly ONE definition of NvArray in the codebase.

Thread safety: owner transitions, ref_count changes, and free operations are protected by a per-instance Mutex.

Constant Summary collapse

DTYPE_SIZES =

Returns dtype to byte size mapping.

Returns:

  • (Hash{Symbol => Integer})

    dtype to byte size mapping

{
  float16:  2,
  float32:  4,
  float64:  8,
  int32:    4,
  int64:    8,
  uint8:    1,
  bfloat16: 2
}.freeze
VALID_DTYPES =

Returns valid dtype symbols.

Returns:

  • (Array<Symbol>)

    valid dtype symbols

DTYPE_SIZES.keys.freeze
VALID_OWNERS =

Returns valid owner symbols.

Returns:

  • (Array<Symbol>)

    valid owner symbols

%i[nvruby nvccl wnais].freeze
MEMCPY_HOST_TO_DEVICE =

cudaMemcpy direction constants

1
MEMCPY_DEVICE_TO_HOST =
2
MEMCPY_DEVICE_TO_DEVICE =
3
CUDA_HOST_ALLOC_DEFAULT =

cudaHostAlloc flags

0
@@next_id =
0
@@id_mutex =
Mutex.new

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ NvArray

Initialize a new NvArray.

Parameters:

  • shape (Array<Integer>)

    tensor dimensions

  • dtype (Symbol)

    data type

  • device_id (Integer) (defaults to: 0)

    GPU device index

  • ptr (Fiddle::Pointer, nil) (defaults to: nil)

    pre-allocated device memory pointer

  • stream (Fiddle::Pointer, nil) (defaults to: nil)

    CUDA stream pointer

  • owner (Symbol) (defaults to: :nvruby)

    initial memory owner

Raises:

  • (ArgumentError)

    if shape, dtype, or owner are invalid



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# File 'lib/nnw/shared/nv_array.rb', line 87

def initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil)
  validate_shape!(shape)
  validate_dtype!(dtype)
  validate_owner!(owner)

  @shape = shape.dup.freeze
  @dtype = dtype
  @device_id = device_id
  @ptr = ptr
  @stream = stream
  @pinned_host_ptr = nil
  @owner = owner
  @ref_count = 0
  @mutex = Mutex.new
  @freed = false
  @created_at = Time.now
  # Memory ownership: we own (and must free) the device buffer only if we
  # allocate it ourselves. Arrays constructed with an external `ptr:` (slice,
  # reshape, from_device_ptr) are VIEWS — they must never free it.
  @owns_memory = ptr.nil?
  # Views retain a reference to their parent so the parent (and its memory)
  # stays alive for at least as long as the view does.
  @parent = parent

  @@id_mutex.synchronize do
    @id = @@next_id
    @@next_id += 1
  end
end

Instance Attribute Details

#created_atTime (readonly)

Returns creation timestamp.

Returns:

  • (Time)

    creation timestamp



73
74
75
# File 'lib/nnw/shared/nv_array.rb', line 73

def created_at
  @created_at
end

#device_idInteger (readonly)

Returns GPU device index.

Returns:

  • (Integer)

    GPU device index



52
53
54
# File 'lib/nnw/shared/nv_array.rb', line 52

def device_id
  @device_id
end

#dtypeSymbol (readonly)

Returns data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16).

Returns:

  • (Symbol)

    data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16)



49
50
51
# File 'lib/nnw/shared/nv_array.rb', line 49

def dtype
  @dtype
end

#idInteger (readonly)

Returns unique identifier for this array instance.

Returns:

  • (Integer)

    unique identifier for this array instance



70
71
72
# File 'lib/nnw/shared/nv_array.rb', line 70

def id
  @id
end

#ownerSymbol (readonly)

Returns current memory owner (:nvruby, :nvccl, or :wnais).

Returns:

  • (Symbol)

    current memory owner (:nvruby, :nvccl, or :wnais)



64
65
66
# File 'lib/nnw/shared/nv_array.rb', line 64

def owner
  @owner
end

#pinned_host_ptrFiddle::Pointer? (readonly)

Returns pinned host memory pointer (nullable, for P2P staging).

Returns:

  • (Fiddle::Pointer, nil)

    pinned host memory pointer (nullable, for P2P staging)



61
62
63
# File 'lib/nnw/shared/nv_array.rb', line 61

def pinned_host_ptr
  @pinned_host_ptr
end

#ptrFiddle::Pointer? (readonly)

Returns device memory pointer.

Returns:

  • (Fiddle::Pointer, nil)

    device memory pointer



55
56
57
# File 'lib/nnw/shared/nv_array.rb', line 55

def ptr
  @ptr
end

#ref_countInteger (readonly)

Returns thread-safe reference count for shared staging.

Returns:

  • (Integer)

    thread-safe reference count for shared staging



67
68
69
# File 'lib/nnw/shared/nv_array.rb', line 67

def ref_count
  @ref_count
end

#shapeArray<Integer> (readonly)

Returns tensor shape dimensions.

Returns:

  • (Array<Integer>)

    tensor shape dimensions



46
47
48
# File 'lib/nnw/shared/nv_array.rb', line 46

def shape
  @shape
end

#streamFiddle::Pointer? (readonly)

Returns CUDA stream pointer (nullable).

Returns:

  • (Fiddle::Pointer, nil)

    CUDA stream pointer (nullable)



58
59
60
# File 'lib/nnw/shared/nv_array.rb', line 58

def stream
  @stream
end

Class Method Details

.cuda_runtimeModule

CUDA runtime Fiddle bindings — lazily loaded singleton.

Returns:

  • (Module)

    module with CUDA runtime functions



612
613
614
# File 'lib/nnw/shared/nv_array.rb', line 612

def cuda_runtime
  @cuda_runtime ||= load_cuda_runtime
end

.release_finalizer(addr) ⇒ Proc

Finalizer that frees an owned device allocation on GC. Captures only the raw address (not self, which would pin the object and defeat GC) and swallows errors (interpreter shutdown may have unloaded the runtime).

Parameters:

  • addr (Integer)

    device pointer address

Returns:

  • (Proc)


621
622
623
624
625
626
627
628
629
# File 'lib/nnw/shared/nv_array.rb', line 621

def release_finalizer(addr)
  proc do
    begin
      cuda_runtime.cudaFree(Fiddle::Pointer.new(addr))
    rescue StandardError
      nil
    end
  end
end

Instance Method Details

#cloneNvArray

Deep-copy into a fresh, independently-owned device buffer (device→device cudaMemcpy). Unlike #slice, the returned array shares NO storage with self and owns its memory (registered for finalization, freed on GC).

The autograd tape relies on this: it accumulates gradients in place, so it must guarantee accumulator buffers never alias. Backward closures are free to return shared buffers (e.g. ‘+` returns [grad, grad]); the tape clones to restore exclusive ownership. DtoD copies raw bytes, so it is exact and dtype-agnostic (no float↔half round trip).

Returns:

  • (NvArray)

    independent owned copy with identical shape/dtype/values

Raises:

  • (RuntimeError)

    if this array has been freed or has no device pointer



285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/nnw/shared/nv_array.rb', line 285

def clone
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no device pointer" if @ptr.nil?
  end

  copy = NvArray.new(shape: @shape, dtype: @dtype, device_id: @device_id,
                     stream: @stream, owner: @owner)
  # allocate_into the copy so the finalizer is registered on `copy` (owns_memory).
  dst = copy.send(:allocate_device_memory, size_bytes)
  copy.instance_variable_set(:@ptr, dst)

  status = cuda_rt.cudaMemcpy(dst, @ptr, size_bytes, MEMCPY_DEVICE_TO_DEVICE)
  raise "cudaMemcpy DtoD failed with status #{status}" unless status.zero?

  copy
end

#decrement_ref!Integer

Atomically decrement the reference count.

Returns:

  • (Integer)

    new ref_count value

Raises:

  • (RuntimeError)

    if ref_count is already 0



490
491
492
493
494
495
496
# File 'lib/nnw/shared/nv_array.rb', line 490

def decrement_ref!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} ref_count is already 0" if @ref_count <= 0
    @ref_count -= 1
  end
end

#device_ffi_ptrFFI::Pointer

Device pointer wrapped as an FFI::Pointer for FFI-bound library calls (cuBLAS/cuSOLVER/cuFFT/cuRAND/cuSPARSE and the JIT kernel launcher).

Returns:

  • (FFI::Pointer)


175
176
177
178
# File 'lib/nnw/shared/nv_array.rb', line 175

def device_ffi_ptr
  to_device if @ptr.nil?
  ::FFI::Pointer.new(@ptr.to_i)
end

#device_indexInteger

Returns device index (Ignis::NvArray naming).

Returns:

  • (Integer)

    device index (Ignis::NvArray naming)



150
151
152
# File 'lib/nnw/shared/nv_array.rb', line 150

def device_index
  @device_id
end

#dtype_sizeInteger

Bytes per element for the current dtype.

Returns:

  • (Integer)


131
132
133
# File 'lib/nnw/shared/nv_array.rb', line 131

def dtype_size
  DTYPE_SIZES.fetch(@dtype)
end

#free!void

This method returns an undefined value.

Free device memory. Raises if ref_count > 0.

Raises:



445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
# File 'lib/nnw/shared/nv_array.rb', line 445

def free!
  @mutex.synchronize do
    raise "NvArray##{@id} has already been freed" if @freed

    # Refuse to free while pinned (ref_count > 0): another component holds
    # this buffer for shared staging (the documented contract). GC-time
    # reclamation goes through the finalizer, not free!, so this guard only
    # gates EXPLICIT frees — it never blocks normal collection.
    if @ref_count > 0
      raise MemoryContractViolation,
            "Cannot free NvArray##{@id} while ref_count=#{@ref_count} > 0"
    end

    unless @pinned_host_ptr.nil?
      cuda_rt.cudaFreeHost(@pinned_host_ptr)
      @pinned_host_ptr = nil
    end

    # Only the owner frees the device buffer. Views (slice/reshape) share the
    # parent's allocation and must NOT free it (that was the reshape/slice
    # double-free path); the parent frees it via free! or its GC finalizer.
    if @owns_memory && !@ptr.nil?
      status = cuda_rt.cudaFree(@ptr)
      raise "cudaFree failed with status #{status}" unless status.zero?
    end

    @ptr = nil
    @freed = true
  end
  # Cancel the GC finalizer so we don't cudaFree the same pointer twice.
  ObjectSpace.undefine_finalizer(self)
end

#freed?Boolean

Whether this array has been freed.

Returns:

  • (Boolean)


194
195
196
# File 'lib/nnw/shared/nv_array.rb', line 194

def freed?
  @mutex.synchronize { @freed }
end

#from_host(data) ⇒ self

Copy data from a Ruby Array to device memory.

Parameters:

  • data (Array<Numeric>)

    flat array of values to copy

Returns:

  • (self)

Raises:

  • (ArgumentError)

    if data size doesn’t match tensor element count

  • (RuntimeError)

    if array has been freed or cudaMemcpy fails



224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# File 'lib/nnw/shared/nv_array.rb', line 224

def from_host(data)
  unless data.is_a?(Array) && data.length == numel
    raise ArgumentError, "Expected #{numel} elements, got #{data.length}"
  end

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    if @ptr.nil?
      @ptr = allocate_device_memory(size_bytes)
    end
  end

  host_buf = pack_host_buffer(data)
  status = cuda_rt.cudaMemcpy(@ptr, host_buf, size_bytes, MEMCPY_HOST_TO_DEVICE)
  raise "cudaMemcpy HtoD failed with status #{status}" unless status.zero?

  self
end

#from_host_raw(bytes) ⇒ self

Copy a raw little-endian binary string straight to device memory.

The bytes must already be in the device dtype’s native layout (this is how safetensors / NOVA data is stored), so no per-element conversion is done —this avoids the lossy float<->half round trip that #from_host would incur.

Parameters:

  • bytes (String)

    binary string of exactly size_bytes length

Returns:

  • (self)

Raises:

  • (ArgumentError)

    if the byte count doesn’t match size_bytes

  • (RuntimeError)

    if array has been freed or cudaMemcpy fails



254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# File 'lib/nnw/shared/nv_array.rb', line 254

def from_host_raw(bytes)
  unless bytes.bytesize == size_bytes
    raise ArgumentError, "Expected #{size_bytes} bytes, got #{bytes.bytesize}"
  end

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    @ptr = allocate_device_memory(size_bytes) if @ptr.nil?
  end

  host_buf = Fiddle::Pointer.malloc(bytes.bytesize)
  host_buf[0, bytes.bytesize] = bytes
  status = cuda_rt.cudaMemcpy(@ptr, host_buf, size_bytes, MEMCPY_HOST_TO_DEVICE)
  raise "cudaMemcpy HtoD failed with status #{status}" unless status.zero?

  self
end

#increment_ref!Integer

Atomically increment the reference count.

Returns:

  • (Integer)

    new ref_count value



480
481
482
483
484
485
# File 'lib/nnw/shared/nv_array.rb', line 480

def increment_ref!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    @ref_count += 1
  end
end

#ndimInteger

Returns number of dimensions.

Returns:

  • (Integer)

    number of dimensions



145
146
147
# File 'lib/nnw/shared/nv_array.rb', line 145

def ndim
  @shape.length
end

#numelInteger

Total number of elements in the tensor.

Returns:

  • (Integer)


119
120
121
# File 'lib/nnw/shared/nv_array.rb', line 119

def numel
  @shape.reduce(1, :*)
end

#on_device?Boolean

Returns whether device memory is allocated.

Returns:

  • (Boolean)

    whether device memory is allocated



155
156
157
# File 'lib/nnw/shared/nv_array.rb', line 155

def on_device?
  !@ptr.nil?
end

#pin!Fiddle::Pointer

Allocate pinned host memory for P2P staging.

Returns:

  • (Fiddle::Pointer)

    the pinned host pointer

Raises:

  • (RuntimeError)

    if cudaHostAlloc fails or already pinned



409
410
411
412
413
414
415
416
417
418
419
420
421
422
# File 'lib/nnw/shared/nv_array.rb', line 409

def pin!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} is already pinned" unless @pinned_host_ptr.nil?

    ptr_buf = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
    status = cuda_rt.cudaHostAlloc(ptr_buf, size_bytes, CUDA_HOST_ALLOC_DEFAULT)
    raise "cudaHostAlloc failed with status #{status}" unless status.zero?

    @pinned_host_ptr = Fiddle::Pointer.new(ptr_buf[0, Fiddle::SIZEOF_VOIDP].unpack1('Q'))
  end

  @pinned_host_ptr
end

#size_bytesInteger

Size in bytes of the tensor data on device.

Returns:

  • (Integer)


125
126
127
# File 'lib/nnw/shared/nv_array.rb', line 125

def size_bytes
  numel * dtype_size
end

#slice(dim, start, len) ⇒ NvArray

Create a zero-copy slice along a dimension.

Returns a new NvArray that shares the same device memory but with an offset pointer and adjusted shape. No data is copied.

Parameters:

  • dim (Integer)

    dimension to slice along

  • start (Integer)

    starting index in the dimension

  • len (Integer)

    number of elements to include

Returns:

  • (NvArray)

    new array sharing device memory (no copy)

Raises:

  • (ArgumentError)

    if dim, start, or len are out of bounds



338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
# File 'lib/nnw/shared/nv_array.rb', line 338

def slice(dim, start, len)
  raise ArgumentError, "Dimension #{dim} out of range for shape #{@shape}" unless dim >= 0 && dim < @shape.length
  raise ArgumentError, "Slice range [#{start}, #{start + len}) exceeds dim size #{@shape[dim]}" unless start >= 0 && (start + len) <= @shape[dim]

  # A pointer-offset view is only CONTIGUOUS — and thus correct for a plain
  # numel-length read (to_host, kernels) — when nothing varies in the
  # dimensions BEFORE `dim`. For dim>0 with non-unit leading dims the slice
  # is strided (scattered across memory), which this zero-copy view cannot
  # represent: a consumer would read the wrong, contiguous elements with no
  # error. Fail loud instead of returning silently-wrong data.
  leading = @shape[0...dim].reduce(1, :*)
  if leading > 1
    raise ArgumentError,
          "slice(dim=#{dim}, ...) on shape #{@shape} is a strided (non-contiguous) view, " \
          "which NvArray#slice cannot represent; only contiguous slices are supported " \
          "(dim 0, or leading dims of size 1). Use a gather/copy kernel for strided slices."
  end

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
  end

  new_shape = @shape.dup
  new_shape[dim] = len

  # Compute byte offset: product of trailing dimensions * start * dtype_size
  trailing = @shape[(dim + 1)..].reduce(1, :*)
  offset_bytes = start * trailing * dtype_size

  sliced_ptr = @ptr.nil? ? nil : Fiddle::Pointer.new(@ptr.to_i + offset_bytes, size_bytes - offset_bytes)

  sliced = NvArray.new(
    shape: new_shape,
    dtype: @dtype,
    device_id: @device_id,
    ptr: sliced_ptr,
    stream: @stream,
    owner: @owner,
    parent: self # view: non-owning, keeps parent alive (no leak / no double-free)
  )

  sliced
end

#to_deviceself

Ensure device memory is allocated. Shared arrays are device-resident, so this just allocates on first use; it exists for API parity with Ignis::NvArray#to_device.

Returns:

  • (self)


163
164
165
166
167
168
169
170
# File 'lib/nnw/shared/nv_array.rb', line 163

def to_device(*)
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    @ptr = allocate_device_memory(size_bytes) if @ptr.nil?
  end
  self
end

#to_hostArray<Numeric>

Copy device memory to host and return as a flat Ruby Array.

Uses cudaMemcpy with DtoH direction. The returned array contains numeric values decoded according to the dtype.

Returns:

  • (Array<Numeric>)

    flat array of host-side values

Raises:

  • (RuntimeError)

    if array has been freed or cudaMemcpy fails



205
206
207
208
209
210
211
212
213
214
215
216
# File 'lib/nnw/shared/nv_array.rb', line 205

def to_host
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no device pointer" if @ptr.nil?
  end

  host_buf = Fiddle::Pointer.malloc(size_bytes)
  status = cuda_rt.cudaMemcpy(host_buf, @ptr, size_bytes, MEMCPY_DEVICE_TO_HOST)
  raise "cudaMemcpy DtoH failed with status #{status}" unless status.zero?

  unpack_host_buffer(host_buf)
end

#to_sString Also known as: inspect

Human-readable string representation.

Returns:

  • (String)


500
501
502
503
# File 'lib/nnw/shared/nv_array.rb', line 500

def to_s
  "#<Ignis::Shared::NvArray id=#{@id} shape=#{@shape} dtype=#{@dtype} " \
    "device=#{@device_id} owner=#{@owner} ref_count=#{@ref_count} freed=#{@freed}>"
end

#transfer_ownership(new_owner) ⇒ Symbol

Atomically transfer memory ownership to a new owner.

Parameters:

  • new_owner (Symbol)

    the new owner (:nvruby, :nvccl, or :wnais)

Returns:

  • (Symbol)

    the new owner

Raises:



388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
# File 'lib/nnw/shared/nv_array.rb', line 388

def transfer_ownership(new_owner)
  validate_owner!(new_owner)

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    if @ref_count > 1
      raise MemoryContractViolation,
            "Cannot transfer ownership of NvArray##{@id} while ref_count=#{@ref_count} > 1"
    end

    @owner = new_owner
  end

  new_owner
end

#unpin!void

This method returns an undefined value.

Free pinned host memory.

Raises:

  • (RuntimeError)

    if no pinned memory exists or cudaFreeHost fails



428
429
430
431
432
433
434
435
436
437
438
# File 'lib/nnw/shared/nv_array.rb', line 428

def unpin!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no pinned memory" if @pinned_host_ptr.nil?

    status = cuda_rt.cudaFreeHost(@pinned_host_ptr)
    raise "cudaFreeHost failed with status #{status}" unless status.zero?

    @pinned_host_ptr = nil
  end
end

#write_rows!(src, start_row) ⇒ self

Copy a contiguous source array into this buffer starting at row start_row (device→device). Used to append K/V rows into a preallocated KV cache in O(row) instead of reallocating + recopying the whole cache each step.

Parameters:

  • src (NvArray)

    contiguous source ([r, cols] matching this array’s cols)

  • start_row (Integer)

    destination row offset (0-based)

Returns:

  • (self)

Raises:

  • (RuntimeError)

    if freed/unallocated, or the write would overflow



311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
# File 'lib/nnw/shared/nv_array.rb', line 311

def write_rows!(src, start_row)
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no device pointer" if @ptr.nil?
  end
  row_bytes = (numel / @shape[0]) * dtype_size
  offset = start_row * row_bytes
  if offset + src.size_bytes > size_bytes
    raise "write_rows! overflow: writing #{src.size_bytes} bytes at row #{start_row} " \
          "(offset #{offset}) exceeds #{size_bytes}-byte buffer"
  end
  dst = Fiddle::Pointer.new(@ptr.to_i + offset)
  status = cuda_rt.cudaMemcpy(dst, src.ptr, src.size_bytes, MEMCPY_DEVICE_TO_DEVICE)
  raise "cudaMemcpy DtoD (write_rows!) failed with status #{status}" unless status.zero?
  self
end

#zero!self

Zero the device buffer with cudaMemset (device-side). This is ~20x faster than the old ‘from_host(Array.new(numel, 0.0))` idiom, which allocated a huge Ruby array, packed it, and H2D-copied it on every op (0.5ms+/op, and seconds for the 38M-element LM-head weight transpose).

Returns:

  • (self)


185
186
187
188
189
190
# File 'lib/nnw/shared/nv_array.rb', line 185

def zero!
  to_device if @ptr.nil?
  status = cuda_rt.cudaMemset(@ptr, 0, size_bytes)
  raise "cudaMemset failed with status #{status}" unless status.zero?
  self
end