Class: Ignis::Shared::NvArray
- Inherits:
-
Object
- Object
- Ignis::Shared::NvArray
- Defined in:
- lib/nnw/shared/nv_array.rb
Overview
NvArray — The canonical GPU tensor type for the entire Ignis system.
Ignis creates them. NvCCL moves them. WNAIS serializes them to NOVA. There is exactly ONE definition of NvArray in the codebase.
Thread safety: owner transitions, ref_count changes, and free operations are protected by a per-instance Mutex.
Constant Summary collapse
- DTYPE_SIZES =
Returns dtype to byte size mapping.
{ float16: 2, float32: 4, float64: 8, int32: 4, int64: 8, uint8: 1, bfloat16: 2 }.freeze
- VALID_DTYPES =
Returns valid dtype symbols.
DTYPE_SIZES.keys.freeze
- VALID_OWNERS =
Returns valid owner symbols.
%i[nvruby nvccl wnais].freeze
- MEMCPY_HOST_TO_DEVICE =
cudaMemcpy direction constants
1- MEMCPY_DEVICE_TO_HOST =
2- MEMCPY_DEVICE_TO_DEVICE =
3- CUDA_HOST_ALLOC_DEFAULT =
cudaHostAlloc flags
0- @@next_id =
0- @@id_mutex =
Mutex.new
Instance Attribute Summary collapse
-
#created_at ⇒ Time
readonly
Creation timestamp.
-
#device_id ⇒ Integer
readonly
GPU device index.
-
#dtype ⇒ Symbol
readonly
Data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16).
-
#id ⇒ Integer
readonly
Unique identifier for this array instance.
-
#owner ⇒ Symbol
readonly
Current memory owner (:nvruby, :nvccl, or :wnais).
-
#pinned_host_ptr ⇒ Fiddle::Pointer?
readonly
Pinned host memory pointer (nullable, for P2P staging).
-
#ptr ⇒ Fiddle::Pointer?
readonly
Device memory pointer.
-
#ref_count ⇒ Integer
readonly
Thread-safe reference count for shared staging.
-
#shape ⇒ Array<Integer>
readonly
Tensor shape dimensions.
-
#stream ⇒ Fiddle::Pointer?
readonly
CUDA stream pointer (nullable).
Class Method Summary collapse
-
.cuda_runtime ⇒ Module
CUDA runtime Fiddle bindings — lazily loaded singleton.
-
.release_finalizer(addr) ⇒ Proc
Finalizer that frees an owned device allocation on GC.
Instance Method Summary collapse
-
#clone ⇒ NvArray
Deep-copy into a fresh, independently-owned device buffer (device→device cudaMemcpy).
-
#decrement_ref! ⇒ Integer
Atomically decrement the reference count.
-
#device_ffi_ptr ⇒ FFI::Pointer
Device pointer wrapped as an FFI::Pointer for FFI-bound library calls (cuBLAS/cuSOLVER/cuFFT/cuRAND/cuSPARSE and the JIT kernel launcher).
-
#device_index ⇒ Integer
Device index (Ignis::NvArray naming).
-
#dtype_size ⇒ Integer
Bytes per element for the current dtype.
-
#free! ⇒ void
Free device memory.
-
#freed? ⇒ Boolean
Whether this array has been freed.
-
#from_host(data) ⇒ self
Copy data from a Ruby Array to device memory.
-
#from_host_raw(bytes) ⇒ self
Copy a raw little-endian binary string straight to device memory.
-
#increment_ref! ⇒ Integer
Atomically increment the reference count.
-
#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ NvArray
constructor
Initialize a new NvArray.
-
#ndim ⇒ Integer
Number of dimensions.
-
#numel ⇒ Integer
Total number of elements in the tensor.
-
#on_device? ⇒ Boolean
Whether device memory is allocated.
-
#pin! ⇒ Fiddle::Pointer
Allocate pinned host memory for P2P staging.
-
#size_bytes ⇒ Integer
Size in bytes of the tensor data on device.
-
#slice(dim, start, len) ⇒ NvArray
Create a zero-copy slice along a dimension.
-
#to_device ⇒ self
Ensure device memory is allocated.
-
#to_host ⇒ Array<Numeric>
Copy device memory to host and return as a flat Ruby Array.
-
#to_s ⇒ String
(also: #inspect)
Human-readable string representation.
-
#transfer_ownership(new_owner) ⇒ Symbol
Atomically transfer memory ownership to a new owner.
-
#unpin! ⇒ void
Free pinned host memory.
-
#write_rows!(src, start_row) ⇒ self
Copy a contiguous source array into this buffer starting at row
start_row(device→device). -
#zero! ⇒ self
Zero the device buffer with cudaMemset (device-side).
Constructor Details
#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ NvArray
Initialize a new NvArray.
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
# File 'lib/nnw/shared/nv_array.rb', line 87 def initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) validate_shape!(shape) validate_dtype!(dtype) validate_owner!(owner) @shape = shape.dup.freeze @dtype = dtype @device_id = device_id @ptr = ptr @stream = stream @pinned_host_ptr = nil @owner = owner @ref_count = 0 @mutex = Mutex.new @freed = false @created_at = Time.now # Memory ownership: we own (and must free) the device buffer only if we # allocate it ourselves. Arrays constructed with an external `ptr:` (slice, # reshape, from_device_ptr) are VIEWS — they must never free it. @owns_memory = ptr.nil? # Views retain a reference to their parent so the parent (and its memory) # stays alive for at least as long as the view does. @parent = parent @@id_mutex.synchronize do @id = @@next_id @@next_id += 1 end end |
Instance Attribute Details
#created_at ⇒ Time (readonly)
Returns creation timestamp.
73 74 75 |
# File 'lib/nnw/shared/nv_array.rb', line 73 def created_at @created_at end |
#device_id ⇒ Integer (readonly)
Returns GPU device index.
52 53 54 |
# File 'lib/nnw/shared/nv_array.rb', line 52 def device_id @device_id end |
#dtype ⇒ Symbol (readonly)
Returns data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16).
49 50 51 |
# File 'lib/nnw/shared/nv_array.rb', line 49 def dtype @dtype end |
#id ⇒ Integer (readonly)
Returns unique identifier for this array instance.
70 71 72 |
# File 'lib/nnw/shared/nv_array.rb', line 70 def id @id end |
#owner ⇒ Symbol (readonly)
Returns current memory owner (:nvruby, :nvccl, or :wnais).
64 65 66 |
# File 'lib/nnw/shared/nv_array.rb', line 64 def owner @owner end |
#pinned_host_ptr ⇒ Fiddle::Pointer? (readonly)
Returns pinned host memory pointer (nullable, for P2P staging).
61 62 63 |
# File 'lib/nnw/shared/nv_array.rb', line 61 def pinned_host_ptr @pinned_host_ptr end |
#ptr ⇒ Fiddle::Pointer? (readonly)
Returns device memory pointer.
55 56 57 |
# File 'lib/nnw/shared/nv_array.rb', line 55 def ptr @ptr end |
#ref_count ⇒ Integer (readonly)
Returns thread-safe reference count for shared staging.
67 68 69 |
# File 'lib/nnw/shared/nv_array.rb', line 67 def ref_count @ref_count end |
#shape ⇒ Array<Integer> (readonly)
Returns tensor shape dimensions.
46 47 48 |
# File 'lib/nnw/shared/nv_array.rb', line 46 def shape @shape end |
#stream ⇒ Fiddle::Pointer? (readonly)
Returns CUDA stream pointer (nullable).
58 59 60 |
# File 'lib/nnw/shared/nv_array.rb', line 58 def stream @stream end |
Class Method Details
.cuda_runtime ⇒ Module
CUDA runtime Fiddle bindings — lazily loaded singleton.
612 613 614 |
# File 'lib/nnw/shared/nv_array.rb', line 612 def cuda_runtime @cuda_runtime ||= load_cuda_runtime end |
.release_finalizer(addr) ⇒ Proc
Finalizer that frees an owned device allocation on GC. Captures only the raw address (not self, which would pin the object and defeat GC) and swallows errors (interpreter shutdown may have unloaded the runtime).
621 622 623 624 625 626 627 628 629 |
# File 'lib/nnw/shared/nv_array.rb', line 621 def release_finalizer(addr) proc do begin cuda_runtime.cudaFree(Fiddle::Pointer.new(addr)) rescue StandardError nil end end end |
Instance Method Details
#clone ⇒ NvArray
Deep-copy into a fresh, independently-owned device buffer (device→device cudaMemcpy). Unlike #slice, the returned array shares NO storage with self and owns its memory (registered for finalization, freed on GC).
The autograd tape relies on this: it accumulates gradients in place, so it must guarantee accumulator buffers never alias. Backward closures are free to return shared buffers (e.g. ‘+` returns [grad, grad]); the tape clones to restore exclusive ownership. DtoD copies raw bytes, so it is exact and dtype-agnostic (no float↔half round trip).
285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# File 'lib/nnw/shared/nv_array.rb', line 285 def clone @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed raise "NvArray##{@id} has no device pointer" if @ptr.nil? end copy = NvArray.new(shape: @shape, dtype: @dtype, device_id: @device_id, stream: @stream, owner: @owner) # allocate_into the copy so the finalizer is registered on `copy` (owns_memory). dst = copy.send(:allocate_device_memory, size_bytes) copy.instance_variable_set(:@ptr, dst) status = cuda_rt.cudaMemcpy(dst, @ptr, size_bytes, MEMCPY_DEVICE_TO_DEVICE) raise "cudaMemcpy DtoD failed with status #{status}" unless status.zero? copy end |
#decrement_ref! ⇒ Integer
Atomically decrement the reference count.
490 491 492 493 494 495 496 |
# File 'lib/nnw/shared/nv_array.rb', line 490 def decrement_ref! @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed raise "NvArray##{@id} ref_count is already 0" if @ref_count <= 0 @ref_count -= 1 end end |
#device_ffi_ptr ⇒ FFI::Pointer
Device pointer wrapped as an FFI::Pointer for FFI-bound library calls (cuBLAS/cuSOLVER/cuFFT/cuRAND/cuSPARSE and the JIT kernel launcher).
175 176 177 178 |
# File 'lib/nnw/shared/nv_array.rb', line 175 def device_ffi_ptr to_device if @ptr.nil? ::FFI::Pointer.new(@ptr.to_i) end |
#device_index ⇒ Integer
Returns device index (Ignis::NvArray naming).
150 151 152 |
# File 'lib/nnw/shared/nv_array.rb', line 150 def device_index @device_id end |
#dtype_size ⇒ Integer
Bytes per element for the current dtype.
131 132 133 |
# File 'lib/nnw/shared/nv_array.rb', line 131 def dtype_size DTYPE_SIZES.fetch(@dtype) end |
#free! ⇒ void
This method returns an undefined value.
Free device memory. Raises if ref_count > 0.
445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 |
# File 'lib/nnw/shared/nv_array.rb', line 445 def free! @mutex.synchronize do raise "NvArray##{@id} has already been freed" if @freed # Refuse to free while pinned (ref_count > 0): another component holds # this buffer for shared staging (the documented contract). GC-time # reclamation goes through the finalizer, not free!, so this guard only # gates EXPLICIT frees — it never blocks normal collection. if @ref_count > 0 raise MemoryContractViolation, "Cannot free NvArray##{@id} while ref_count=#{@ref_count} > 0" end unless @pinned_host_ptr.nil? cuda_rt.cudaFreeHost(@pinned_host_ptr) @pinned_host_ptr = nil end # Only the owner frees the device buffer. Views (slice/reshape) share the # parent's allocation and must NOT free it (that was the reshape/slice # double-free path); the parent frees it via free! or its GC finalizer. if @owns_memory && !@ptr.nil? status = cuda_rt.cudaFree(@ptr) raise "cudaFree failed with status #{status}" unless status.zero? end @ptr = nil @freed = true end # Cancel the GC finalizer so we don't cudaFree the same pointer twice. ObjectSpace.undefine_finalizer(self) end |
#freed? ⇒ Boolean
Whether this array has been freed.
194 195 196 |
# File 'lib/nnw/shared/nv_array.rb', line 194 def freed? @mutex.synchronize { @freed } end |
#from_host(data) ⇒ self
Copy data from a Ruby Array to device memory.
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
# File 'lib/nnw/shared/nv_array.rb', line 224 def from_host(data) unless data.is_a?(Array) && data.length == numel raise ArgumentError, "Expected #{numel} elements, got #{data.length}" end @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed if @ptr.nil? @ptr = allocate_device_memory(size_bytes) end end host_buf = pack_host_buffer(data) status = cuda_rt.cudaMemcpy(@ptr, host_buf, size_bytes, MEMCPY_HOST_TO_DEVICE) raise "cudaMemcpy HtoD failed with status #{status}" unless status.zero? self end |
#from_host_raw(bytes) ⇒ self
Copy a raw little-endian binary string straight to device memory.
The bytes must already be in the device dtype’s native layout (this is how safetensors / NOVA data is stored), so no per-element conversion is done —this avoids the lossy float<->half round trip that #from_host would incur.
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
# File 'lib/nnw/shared/nv_array.rb', line 254 def from_host_raw(bytes) unless bytes.bytesize == size_bytes raise ArgumentError, "Expected #{size_bytes} bytes, got #{bytes.bytesize}" end @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed @ptr = allocate_device_memory(size_bytes) if @ptr.nil? end host_buf = Fiddle::Pointer.malloc(bytes.bytesize) host_buf[0, bytes.bytesize] = bytes status = cuda_rt.cudaMemcpy(@ptr, host_buf, size_bytes, MEMCPY_HOST_TO_DEVICE) raise "cudaMemcpy HtoD failed with status #{status}" unless status.zero? self end |
#increment_ref! ⇒ Integer
Atomically increment the reference count.
480 481 482 483 484 485 |
# File 'lib/nnw/shared/nv_array.rb', line 480 def increment_ref! @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed @ref_count += 1 end end |
#ndim ⇒ Integer
Returns number of dimensions.
145 146 147 |
# File 'lib/nnw/shared/nv_array.rb', line 145 def ndim @shape.length end |
#numel ⇒ Integer
Total number of elements in the tensor.
119 120 121 |
# File 'lib/nnw/shared/nv_array.rb', line 119 def numel @shape.reduce(1, :*) end |
#on_device? ⇒ Boolean
Returns whether device memory is allocated.
155 156 157 |
# File 'lib/nnw/shared/nv_array.rb', line 155 def on_device? !@ptr.nil? end |
#pin! ⇒ Fiddle::Pointer
Allocate pinned host memory for P2P staging.
409 410 411 412 413 414 415 416 417 418 419 420 421 422 |
# File 'lib/nnw/shared/nv_array.rb', line 409 def pin! @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed raise "NvArray##{@id} is already pinned" unless @pinned_host_ptr.nil? ptr_buf = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP) status = cuda_rt.cudaHostAlloc(ptr_buf, size_bytes, CUDA_HOST_ALLOC_DEFAULT) raise "cudaHostAlloc failed with status #{status}" unless status.zero? @pinned_host_ptr = Fiddle::Pointer.new(ptr_buf[0, Fiddle::SIZEOF_VOIDP].unpack1('Q')) end @pinned_host_ptr end |
#size_bytes ⇒ Integer
Size in bytes of the tensor data on device.
125 126 127 |
# File 'lib/nnw/shared/nv_array.rb', line 125 def size_bytes numel * dtype_size end |
#slice(dim, start, len) ⇒ NvArray
Create a zero-copy slice along a dimension.
Returns a new NvArray that shares the same device memory but with an offset pointer and adjusted shape. No data is copied.
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 |
# File 'lib/nnw/shared/nv_array.rb', line 338 def slice(dim, start, len) raise ArgumentError, "Dimension #{dim} out of range for shape #{@shape}" unless dim >= 0 && dim < @shape.length raise ArgumentError, "Slice range [#{start}, #{start + len}) exceeds dim size #{@shape[dim]}" unless start >= 0 && (start + len) <= @shape[dim] # A pointer-offset view is only CONTIGUOUS — and thus correct for a plain # numel-length read (to_host, kernels) — when nothing varies in the # dimensions BEFORE `dim`. For dim>0 with non-unit leading dims the slice # is strided (scattered across memory), which this zero-copy view cannot # represent: a consumer would read the wrong, contiguous elements with no # error. Fail loud instead of returning silently-wrong data. leading = @shape[0...dim].reduce(1, :*) if leading > 1 raise ArgumentError, "slice(dim=#{dim}, ...) on shape #{@shape} is a strided (non-contiguous) view, " \ "which NvArray#slice cannot represent; only contiguous slices are supported " \ "(dim 0, or leading dims of size 1). Use a gather/copy kernel for strided slices." end @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed end new_shape = @shape.dup new_shape[dim] = len # Compute byte offset: product of trailing dimensions * start * dtype_size trailing = @shape[(dim + 1)..].reduce(1, :*) offset_bytes = start * trailing * dtype_size sliced_ptr = @ptr.nil? ? nil : Fiddle::Pointer.new(@ptr.to_i + offset_bytes, size_bytes - offset_bytes) sliced = NvArray.new( shape: new_shape, dtype: @dtype, device_id: @device_id, ptr: sliced_ptr, stream: @stream, owner: @owner, parent: self # view: non-owning, keeps parent alive (no leak / no double-free) ) sliced end |
#to_device ⇒ self
Ensure device memory is allocated. Shared arrays are device-resident, so this just allocates on first use; it exists for API parity with Ignis::NvArray#to_device.
163 164 165 166 167 168 169 170 |
# File 'lib/nnw/shared/nv_array.rb', line 163 def to_device(*) @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed @ptr = allocate_device_memory(size_bytes) if @ptr.nil? end self end |
#to_host ⇒ Array<Numeric>
Copy device memory to host and return as a flat Ruby Array.
Uses cudaMemcpy with DtoH direction. The returned array contains numeric values decoded according to the dtype.
205 206 207 208 209 210 211 212 213 214 215 216 |
# File 'lib/nnw/shared/nv_array.rb', line 205 def to_host @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed raise "NvArray##{@id} has no device pointer" if @ptr.nil? end host_buf = Fiddle::Pointer.malloc(size_bytes) status = cuda_rt.cudaMemcpy(host_buf, @ptr, size_bytes, MEMCPY_DEVICE_TO_HOST) raise "cudaMemcpy DtoH failed with status #{status}" unless status.zero? unpack_host_buffer(host_buf) end |
#to_s ⇒ String Also known as: inspect
Human-readable string representation.
500 501 502 503 |
# File 'lib/nnw/shared/nv_array.rb', line 500 def to_s "#<Ignis::Shared::NvArray id=#{@id} shape=#{@shape} dtype=#{@dtype} " \ "device=#{@device_id} owner=#{@owner} ref_count=#{@ref_count} freed=#{@freed}>" end |
#transfer_ownership(new_owner) ⇒ Symbol
Atomically transfer memory ownership to a new owner.
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 |
# File 'lib/nnw/shared/nv_array.rb', line 388 def transfer_ownership(new_owner) validate_owner!(new_owner) @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed if @ref_count > 1 raise MemoryContractViolation, "Cannot transfer ownership of NvArray##{@id} while ref_count=#{@ref_count} > 1" end @owner = new_owner end new_owner end |
#unpin! ⇒ void
This method returns an undefined value.
Free pinned host memory.
428 429 430 431 432 433 434 435 436 437 438 |
# File 'lib/nnw/shared/nv_array.rb', line 428 def unpin! @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed raise "NvArray##{@id} has no pinned memory" if @pinned_host_ptr.nil? status = cuda_rt.cudaFreeHost(@pinned_host_ptr) raise "cudaFreeHost failed with status #{status}" unless status.zero? @pinned_host_ptr = nil end end |
#write_rows!(src, start_row) ⇒ self
Copy a contiguous source array into this buffer starting at row start_row (device→device). Used to append K/V rows into a preallocated KV cache in O(row) instead of reallocating + recopying the whole cache each step.
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
# File 'lib/nnw/shared/nv_array.rb', line 311 def write_rows!(src, start_row) @mutex.synchronize do raise "NvArray##{@id} has been freed" if @freed raise "NvArray##{@id} has no device pointer" if @ptr.nil? end row_bytes = (numel / @shape[0]) * dtype_size offset = start_row * row_bytes if offset + src.size_bytes > size_bytes raise "write_rows! overflow: writing #{src.size_bytes} bytes at row #{start_row} " \ "(offset #{offset}) exceeds #{size_bytes}-byte buffer" end dst = Fiddle::Pointer.new(@ptr.to_i + offset) status = cuda_rt.cudaMemcpy(dst, src.ptr, src.size_bytes, MEMCPY_DEVICE_TO_DEVICE) raise "cudaMemcpy DtoD (write_rows!) failed with status #{status}" unless status.zero? self end |
#zero! ⇒ self
Zero the device buffer with cudaMemset (device-side). This is ~20x faster than the old ‘from_host(Array.new(numel, 0.0))` idiom, which allocated a huge Ruby array, packed it, and H2D-copied it on every op (0.5ms+/op, and seconds for the 38M-element LM-head weight transpose).
185 186 187 188 189 190 |
# File 'lib/nnw/shared/nv_array.rb', line 185 def zero! to_device if @ptr.nil? status = cuda_rt.cudaMemset(@ptr, 0, size_bytes) raise "cudaMemset failed with status #{status}" unless status.zero? self end |