Class: Ignis::Shared::NvArray

Inherits:

Object

Object
Ignis::Shared::NvArray

show all

Defined in:: lib/nnw/shared/nv_array.rb

Overview

NvArray — The canonical GPU tensor type for the entire Ignis system.

Ignis creates them. NvCCL moves them. WNAIS serializes them to NOVA. There is exactly ONE definition of NvArray in the codebase.

Thread safety: owner transitions, ref_count changes, and free operations are protected by a per-instance Mutex.

Constant Summary collapse

DTYPE_SIZES = Returns dtype to byte size mapping. Returns: (Hash{Symbol => Integer}) — dtype to byte size mapping

{
  float16:  2,
  float32:  4,
  float64:  8,
  int32:    4,
  int64:    8,
  uint8:    1,
  bfloat16: 2
}.freeze

VALID_DTYPES = Returns valid dtype symbols. Returns: (Array<Symbol>) — valid dtype symbols

DTYPE_SIZES.keys.freeze

VALID_OWNERS = Returns valid owner symbols. Returns: (Array<Symbol>) — valid owner symbols

%i[nvruby nvccl wnais].freeze

MEMCPY_HOST_TO_DEVICE = cudaMemcpy direction constants

MEMCPY_DEVICE_TO_HOST =

MEMCPY_DEVICE_TO_DEVICE =

CUDA_HOST_ALLOC_DEFAULT = cudaHostAlloc flags

@@next_id =

@@id_mutex =

Mutex.new

Instance Attribute Summary collapse

#created_at ⇒ Time readonly

Creation timestamp.
#device_id ⇒ Integer readonly

GPU device index.
#dtype ⇒ Symbol readonly

Data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16).
#id ⇒ Integer readonly

Unique identifier for this array instance.
#owner ⇒ Symbol readonly

Current memory owner (:nvruby, :nvccl, or :wnais).
#pinned_host_ptr ⇒ Fiddle::Pointer^? readonly

Pinned host memory pointer (nullable, for P2P staging).
#ptr ⇒ Fiddle::Pointer^? readonly

Device memory pointer.
#ref_count ⇒ Integer readonly

Thread-safe reference count for shared staging.
#shape ⇒ Array<Integer> readonly

Tensor shape dimensions.
#stream ⇒ Fiddle::Pointer^? readonly

CUDA stream pointer (nullable).

Class Method Summary collapse

.cuda_runtime ⇒ Module

CUDA runtime Fiddle bindings — lazily loaded singleton.
.release_finalizer(addr) ⇒ Proc

Finalizer that frees an owned device allocation on GC.

Instance Method Summary collapse

#clone ⇒ NvArray

Deep-copy into a fresh, independently-owned device buffer (device→device cudaMemcpy).
#decrement_ref! ⇒ Integer

Atomically decrement the reference count.
#device_ffi_ptr ⇒ FFI::Pointer

Device pointer wrapped as an FFI::Pointer for FFI-bound library calls (cuBLAS/cuSOLVER/cuFFT/cuRAND/cuSPARSE and the JIT kernel launcher).
#device_index ⇒ Integer

Device index (Ignis::NvArray naming).
#dtype_size ⇒ Integer

Bytes per element for the current dtype.
#free! ⇒ void

Free device memory.
#freed? ⇒ Boolean

Whether this array has been freed.
#from_host(data) ⇒ self

Copy data from a Ruby Array to device memory.
#from_host_raw(bytes) ⇒ self

Copy a raw little-endian binary string straight to device memory.
#increment_ref! ⇒ Integer

Atomically increment the reference count.
#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ NvArray constructor

Initialize a new NvArray.
#ndim ⇒ Integer

Number of dimensions.
#numel ⇒ Integer

Total number of elements in the tensor.
#on_device? ⇒ Boolean

Whether device memory is allocated.
#pin! ⇒ Fiddle::Pointer

Allocate pinned host memory for P2P staging.
#size_bytes ⇒ Integer

Size in bytes of the tensor data on device.
#slice(dim, start, len) ⇒ NvArray

Create a zero-copy slice along a dimension.
#to_device ⇒ self

Ensure device memory is allocated.
#to_host ⇒ Array<Numeric>

Copy device memory to host and return as a flat Ruby Array.
#to_s ⇒ String (also: #inspect)

Human-readable string representation.
#transfer_ownership(new_owner) ⇒ Symbol

Atomically transfer memory ownership to a new owner.
#unpin! ⇒ void

Free pinned host memory.
#write_rows!(src, start_row) ⇒ self

Copy a contiguous source array into this buffer starting at row start_row (device→device).
#zero! ⇒ self

Zero the device buffer with cudaMemset (device-side).

Constructor Details

#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ `NvArray`

Initialize a new NvArray.

Parameters:

shape (Array<Integer>) —

tensor dimensions
dtype (Symbol) —

data type
device_id (Integer) (defaults to: 0) —

GPU device index
ptr (Fiddle::Pointer, nil) (defaults to: nil) —

pre-allocated device memory pointer
stream (Fiddle::Pointer, nil) (defaults to: nil) —

CUDA stream pointer
owner (Symbol) (defaults to: :nvruby) —

initial memory owner

Raises:

(ArgumentError) —

if shape, dtype, or owner are invalid

# File 'lib/nnw/shared/nv_array.rb', line 87

def initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil)
  validate_shape!(shape)
  validate_dtype!(dtype)
  validate_owner!(owner)

  @shape = shape.dup.freeze
  @dtype = dtype
  @device_id = device_id
  @ptr = ptr
  @stream = stream
  @pinned_host_ptr = nil
  @owner = owner
  @ref_count = 0
  @mutex = Mutex.new
  @freed = false
  @created_at = Time.now
  # Memory ownership: we own (and must free) the device buffer only if we
  # allocate it ourselves. Arrays constructed with an external `ptr:` (slice,
  # reshape, from_device_ptr) are VIEWS — they must never free it.
  @owns_memory = ptr.nil?
  # Views retain a reference to their parent so the parent (and its memory)
  # stays alive for at least as long as the view does.
  @parent = parent

  @@id_mutex.synchronize do
    @id = @@next_id
    @@next_id += 1
  end
end

Instance Attribute Details

#created_at ⇒ `Time` (readonly)

Returns creation timestamp.

Returns:

(Time) —

creation timestamp



73
74
75

# File 'lib/nnw/shared/nv_array.rb', line 73

def created_at
  @created_at
end

#device_id ⇒ `Integer` (readonly)

Returns GPU device index.

Returns:

(Integer) —

GPU device index



52
53
54

# File 'lib/nnw/shared/nv_array.rb', line 52

def device_id
  @device_id
end

#dtype ⇒ `Symbol` (readonly)

Returns data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16).

Returns:

(Symbol) —

data type (:float16, :float32, :float64, :int32, :int64, :uint8, :bfloat16)



49
50
51

# File 'lib/nnw/shared/nv_array.rb', line 49

def dtype
  @dtype
end

#id ⇒ `Integer` (readonly)

Returns unique identifier for this array instance.

Returns:

(Integer) —

unique identifier for this array instance



70
71
72

# File 'lib/nnw/shared/nv_array.rb', line 70

def id
  @id
end

#owner ⇒ `Symbol` (readonly)

Returns current memory owner (:nvruby, :nvccl, or :wnais).

Returns:

(Symbol) —

current memory owner (:nvruby, :nvccl, or :wnais)



64
65
66

# File 'lib/nnw/shared/nv_array.rb', line 64

def owner
  @owner
end

#pinned_host_ptr ⇒ `Fiddle::Pointer`^? (readonly)

Returns pinned host memory pointer (nullable, for P2P staging).

Returns:

(Fiddle::Pointer, nil) —

pinned host memory pointer (nullable, for P2P staging)



61
62
63

# File 'lib/nnw/shared/nv_array.rb', line 61

def pinned_host_ptr
  @pinned_host_ptr
end

#ptr ⇒ `Fiddle::Pointer`^? (readonly)

Returns device memory pointer.

Returns:

(Fiddle::Pointer, nil) —

device memory pointer



55
56
57

# File 'lib/nnw/shared/nv_array.rb', line 55

def ptr
  @ptr
end

#ref_count ⇒ `Integer` (readonly)

Returns thread-safe reference count for shared staging.

Returns:

(Integer) —

thread-safe reference count for shared staging



67
68
69

# File 'lib/nnw/shared/nv_array.rb', line 67

def ref_count
  @ref_count
end

#shape ⇒ `Array<Integer>` (readonly)

Returns tensor shape dimensions.

Returns:

(Array<Integer>) —

tensor shape dimensions



46
47
48

# File 'lib/nnw/shared/nv_array.rb', line 46

def shape
  @shape
end

#stream ⇒ `Fiddle::Pointer`^? (readonly)

Returns CUDA stream pointer (nullable).

Returns:

(Fiddle::Pointer, nil) —

CUDA stream pointer (nullable)



58
59
60

# File 'lib/nnw/shared/nv_array.rb', line 58

def stream
  @stream
end

Class Method Details

.cuda_runtime ⇒ `Module`

CUDA runtime Fiddle bindings — lazily loaded singleton.

Returns:

(Module) —

module with CUDA runtime functions



612
613
614

# File 'lib/nnw/shared/nv_array.rb', line 612

def cuda_runtime
  @cuda_runtime ||= load_cuda_runtime
end

.release_finalizer(addr) ⇒ `Proc`

Finalizer that frees an owned device allocation on GC. Captures only the raw address (not self, which would pin the object and defeat GC) and swallows errors (interpreter shutdown may have unloaded the runtime).

Parameters:

addr (Integer) —

device pointer address

Returns:

(Proc)

# File 'lib/nnw/shared/nv_array.rb', line 621

def release_finalizer(addr)
  proc do
    begin
      cuda_runtime.cudaFree(Fiddle::Pointer.new(addr))
    rescue StandardError
      nil
    end
  end
end

Instance Method Details

#clone ⇒ `NvArray`

Deep-copy into a fresh, independently-owned device buffer (device→device cudaMemcpy). Unlike #slice, the returned array shares NO storage with self and owns its memory (registered for finalization, freed on GC).

The autograd tape relies on this: it accumulates gradients in place, so it must guarantee accumulator buffers never alias. Backward closures are free to return shared buffers (e.g. ‘+` returns [grad, grad]); the tape clones to restore exclusive ownership. DtoD copies raw bytes, so it is exact and dtype-agnostic (no float↔half round trip).

Returns:

(NvArray) —

independent owned copy with identical shape/dtype/values

Raises:

(RuntimeError) —

if this array has been freed or has no device pointer

# File 'lib/nnw/shared/nv_array.rb', line 285

def clone
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no device pointer" if @ptr.nil?
  end

  copy = NvArray.new(shape: @shape, dtype: @dtype, device_id: @device_id,
                     stream: @stream, owner: @owner)
  # allocate_into the copy so the finalizer is registered on `copy` (owns_memory).
  dst = copy.send(:allocate_device_memory, size_bytes)
  copy.instance_variable_set(:@ptr, dst)

  status = cuda_rt.cudaMemcpy(dst, @ptr, size_bytes, MEMCPY_DEVICE_TO_DEVICE)
  raise "cudaMemcpy DtoD failed with status #{status}" unless status.zero?

  copy
end

#decrement_ref! ⇒ `Integer`

Atomically decrement the reference count.

Returns:

(Integer) —

new ref_count value

Raises:

(RuntimeError) —

if ref_count is already 0

# File 'lib/nnw/shared/nv_array.rb', line 490

def decrement_ref!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} ref_count is already 0" if @ref_count <= 0
    @ref_count -= 1
  end
end

#device_ffi_ptr ⇒ `FFI::Pointer`

Device pointer wrapped as an FFI::Pointer for FFI-bound library calls (cuBLAS/cuSOLVER/cuFFT/cuRAND/cuSPARSE and the JIT kernel launcher).

Returns:

(FFI::Pointer)

# File 'lib/nnw/shared/nv_array.rb', line 175

def device_ffi_ptr
  to_device if @ptr.nil?
  ::FFI::Pointer.new(@ptr.to_i)
end

#device_index ⇒ `Integer`

Returns device index (Ignis::NvArray naming).

Returns:

(Integer) —

device index (Ignis::NvArray naming)



150
151
152

# File 'lib/nnw/shared/nv_array.rb', line 150

def device_index
  @device_id
end

#dtype_size ⇒ `Integer`

Bytes per element for the current dtype.

Returns:

(Integer)



131
132
133

# File 'lib/nnw/shared/nv_array.rb', line 131

def dtype_size
  DTYPE_SIZES.fetch(@dtype)
end

#free! ⇒ `void`

This method returns an undefined value.

Free device memory. Raises if ref_count > 0.

Raises:

(MemoryContractViolation) —

if ref_count > 0
(RuntimeError) —

if already freed or cudaFree fails

# File 'lib/nnw/shared/nv_array.rb', line 445

def free!
  @mutex.synchronize do
    raise "NvArray##{@id} has already been freed" if @freed

    # Refuse to free while pinned (ref_count > 0): another component holds
    # this buffer for shared staging (the documented contract). GC-time
    # reclamation goes through the finalizer, not free!, so this guard only
    # gates EXPLICIT frees — it never blocks normal collection.
    if @ref_count > 0
      raise MemoryContractViolation,
            "Cannot free NvArray##{@id} while ref_count=#{@ref_count} > 0"
    end

    unless @pinned_host_ptr.nil?
      cuda_rt.cudaFreeHost(@pinned_host_ptr)
      @pinned_host_ptr = nil
    end

    # Only the owner frees the device buffer. Views (slice/reshape) share the
    # parent's allocation and must NOT free it (that was the reshape/slice
    # double-free path); the parent frees it via free! or its GC finalizer.
    if @owns_memory && !@ptr.nil?
      status = cuda_rt.cudaFree(@ptr)
      raise "cudaFree failed with status #{status}" unless status.zero?
    end

    @ptr = nil
    @freed = true
  end
  # Cancel the GC finalizer so we don't cudaFree the same pointer twice.
  ObjectSpace.undefine_finalizer(self)
end

#freed? ⇒ `Boolean`

Whether this array has been freed.

Returns:

(Boolean)



194
195
196

# File 'lib/nnw/shared/nv_array.rb', line 194

def freed?
  @mutex.synchronize { @freed }
end

#from_host(data) ⇒ `self`

Copy data from a Ruby Array to device memory.

Parameters:

data (Array<Numeric>) —

flat array of values to copy

Returns:

(self)

Raises:

(ArgumentError) —

if data size doesn’t match tensor element count
(RuntimeError) —

if array has been freed or cudaMemcpy fails

# File 'lib/nnw/shared/nv_array.rb', line 224

def from_host(data)
  unless data.is_a?(Array) && data.length == numel
    raise ArgumentError, "Expected #{numel} elements, got #{data.length}"
  end

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    if @ptr.nil?
      @ptr = allocate_device_memory(size_bytes)
    end
  end

  host_buf = pack_host_buffer(data)
  status = cuda_rt.cudaMemcpy(@ptr, host_buf, size_bytes, MEMCPY_HOST_TO_DEVICE)
  raise "cudaMemcpy HtoD failed with status #{status}" unless status.zero?

  self
end

#from_host_raw(bytes) ⇒ `self`

Copy a raw little-endian binary string straight to device memory.

The bytes must already be in the device dtype’s native layout (this is how safetensors / NOVA data is stored), so no per-element conversion is done —this avoids the lossy float<->half round trip that #from_host would incur.

Parameters:

bytes (String) —

binary string of exactly size_bytes length

Returns:

(self)

Raises:

(ArgumentError) —

if the byte count doesn’t match size_bytes
(RuntimeError) —

if array has been freed or cudaMemcpy fails

# File 'lib/nnw/shared/nv_array.rb', line 254

def from_host_raw(bytes)
  unless bytes.bytesize == size_bytes
    raise ArgumentError, "Expected #{size_bytes} bytes, got #{bytes.bytesize}"
  end

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    @ptr = allocate_device_memory(size_bytes) if @ptr.nil?
  end

  host_buf = Fiddle::Pointer.malloc(bytes.bytesize)
  host_buf[0, bytes.bytesize] = bytes
  status = cuda_rt.cudaMemcpy(@ptr, host_buf, size_bytes, MEMCPY_HOST_TO_DEVICE)
  raise "cudaMemcpy HtoD failed with status #{status}" unless status.zero?

  self
end

#increment_ref! ⇒ `Integer`

Atomically increment the reference count.

Returns:

(Integer) —

new ref_count value

# File 'lib/nnw/shared/nv_array.rb', line 480

def increment_ref!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    @ref_count += 1
  end
end

#ndim ⇒ `Integer`

Returns number of dimensions.

Returns:

(Integer) —

number of dimensions



145
146
147

# File 'lib/nnw/shared/nv_array.rb', line 145

def ndim
  @shape.length
end

#numel ⇒ `Integer`

Total number of elements in the tensor.

Returns:

(Integer)



119
120
121

# File 'lib/nnw/shared/nv_array.rb', line 119

def numel
  @shape.reduce(1, :*)
end

#on_device? ⇒ `Boolean`

Returns whether device memory is allocated.

Returns:

(Boolean) —

whether device memory is allocated



155
156
157

# File 'lib/nnw/shared/nv_array.rb', line 155

def on_device?
  !@ptr.nil?
end

#pin! ⇒ `Fiddle::Pointer`

Allocate pinned host memory for P2P staging.

Returns:

(Fiddle::Pointer) —

the pinned host pointer

Raises:

(RuntimeError) —

if cudaHostAlloc fails or already pinned

# File 'lib/nnw/shared/nv_array.rb', line 409

def pin!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} is already pinned" unless @pinned_host_ptr.nil?

    ptr_buf = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
    status = cuda_rt.cudaHostAlloc(ptr_buf, size_bytes, CUDA_HOST_ALLOC_DEFAULT)
    raise "cudaHostAlloc failed with status #{status}" unless status.zero?

    @pinned_host_ptr = Fiddle::Pointer.new(ptr_buf[0, Fiddle::SIZEOF_VOIDP].unpack1('Q'))
  end

  @pinned_host_ptr
end

#size_bytes ⇒ `Integer`

Size in bytes of the tensor data on device.

Returns:

(Integer)



125
126
127

# File 'lib/nnw/shared/nv_array.rb', line 125

def size_bytes
  numel * dtype_size
end

#slice(dim, start, len) ⇒ `NvArray`

Create a zero-copy slice along a dimension.

Returns a new NvArray that shares the same device memory but with an offset pointer and adjusted shape. No data is copied.

Parameters:

dim (Integer) —

dimension to slice along
start (Integer) —

starting index in the dimension
len (Integer) —

number of elements to include

Returns:

(NvArray) —

new array sharing device memory (no copy)

Raises:

(ArgumentError) —

if dim, start, or len are out of bounds

# File 'lib/nnw/shared/nv_array.rb', line 338

def slice(dim, start, len)
  raise ArgumentError, "Dimension #{dim} out of range for shape #{@shape}" unless dim >= 0 && dim < @shape.length
  raise ArgumentError, "Slice range [#{start}, #{start + len}) exceeds dim size #{@shape[dim]}" unless start >= 0 && (start + len) <= @shape[dim]

  # A pointer-offset view is only CONTIGUOUS — and thus correct for a plain
  # numel-length read (to_host, kernels) — when nothing varies in the
  # dimensions BEFORE `dim`. For dim>0 with non-unit leading dims the slice
  # is strided (scattered across memory), which this zero-copy view cannot
  # represent: a consumer would read the wrong, contiguous elements with no
  # error. Fail loud instead of returning silently-wrong data.
  leading = @shape[0...dim].reduce(1, :*)
  if leading > 1
    raise ArgumentError,
          "slice(dim=#{dim}, ...) on shape #{@shape} is a strided (non-contiguous) view, " \
          "which NvArray#slice cannot represent; only contiguous slices are supported " \
          "(dim 0, or leading dims of size 1). Use a gather/copy kernel for strided slices."
  end

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
  end

  new_shape = @shape.dup
  new_shape[dim] = len

  # Compute byte offset: product of trailing dimensions * start * dtype_size
  trailing = @shape[(dim + 1)..].reduce(1, :*)
  offset_bytes = start * trailing * dtype_size

  sliced_ptr = @ptr.nil? ? nil : Fiddle::Pointer.new(@ptr.to_i + offset_bytes, size_bytes - offset_bytes)

  sliced = NvArray.new(
    shape: new_shape,
    dtype: @dtype,
    device_id: @device_id,
    ptr: sliced_ptr,
    stream: @stream,
    owner: @owner,
    parent: self # view: non-owning, keeps parent alive (no leak / no double-free)
  )

  sliced
end

#to_device ⇒ `self`

Ensure device memory is allocated. Shared arrays are device-resident, so this just allocates on first use; it exists for API parity with Ignis::NvArray#to_device.

Returns:

(self)

# File 'lib/nnw/shared/nv_array.rb', line 163

def to_device(*)
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    @ptr = allocate_device_memory(size_bytes) if @ptr.nil?
  end
  self
end

#to_host ⇒ `Array<Numeric>`

Copy device memory to host and return as a flat Ruby Array.

Uses cudaMemcpy with DtoH direction. The returned array contains numeric values decoded according to the dtype.

Returns:

(Array<Numeric>) —

flat array of host-side values

Raises:

(RuntimeError) —

if array has been freed or cudaMemcpy fails

# File 'lib/nnw/shared/nv_array.rb', line 205

def to_host
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no device pointer" if @ptr.nil?
  end

  host_buf = Fiddle::Pointer.malloc(size_bytes)
  status = cuda_rt.cudaMemcpy(host_buf, @ptr, size_bytes, MEMCPY_DEVICE_TO_HOST)
  raise "cudaMemcpy DtoH failed with status #{status}" unless status.zero?

  unpack_host_buffer(host_buf)
end

#to_s ⇒ `String` Also known as: inspect

Human-readable string representation.

Returns:

(String)

# File 'lib/nnw/shared/nv_array.rb', line 500

def to_s
  "#<Ignis::Shared::NvArray id=#{@id} shape=#{@shape} dtype=#{@dtype} " \
    "device=#{@device_id} owner=#{@owner} ref_count=#{@ref_count} freed=#{@freed}>"
end

#transfer_ownership(new_owner) ⇒ `Symbol`

Atomically transfer memory ownership to a new owner.

Parameters:

new_owner (Symbol) —

the new owner (:nvruby, :nvccl, or :wnais)

Returns:

(Symbol) —

the new owner

Raises:

(MemoryContractViolation) —

if ref_count > 1 during transfer
(ArgumentError) —

if new_owner is invalid

# File 'lib/nnw/shared/nv_array.rb', line 388

def transfer_ownership(new_owner)
  validate_owner!(new_owner)

  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed

    if @ref_count > 1
      raise MemoryContractViolation,
            "Cannot transfer ownership of NvArray##{@id} while ref_count=#{@ref_count} > 1"
    end

    @owner = new_owner
  end

  new_owner
end

#unpin! ⇒ `void`

This method returns an undefined value.

Free pinned host memory.

Raises:

(RuntimeError) —

if no pinned memory exists or cudaFreeHost fails

# File 'lib/nnw/shared/nv_array.rb', line 428

def unpin!
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no pinned memory" if @pinned_host_ptr.nil?

    status = cuda_rt.cudaFreeHost(@pinned_host_ptr)
    raise "cudaFreeHost failed with status #{status}" unless status.zero?

    @pinned_host_ptr = nil
  end
end

#write_rows!(src, start_row) ⇒ `self`

Copy a contiguous source array into this buffer starting at row start_row (device→device). Used to append K/V rows into a preallocated KV cache in O(row) instead of reallocating + recopying the whole cache each step.

Parameters:

src (NvArray) —

contiguous source ([r, cols] matching this array’s cols)
start_row (Integer) —

destination row offset (0-based)

Returns:

(self)

Raises:

(RuntimeError) —

if freed/unallocated, or the write would overflow

# File 'lib/nnw/shared/nv_array.rb', line 311

def write_rows!(src, start_row)
  @mutex.synchronize do
    raise "NvArray##{@id} has been freed" if @freed
    raise "NvArray##{@id} has no device pointer" if @ptr.nil?
  end
  row_bytes = (numel / @shape[0]) * dtype_size
  offset = start_row * row_bytes
  if offset + src.size_bytes > size_bytes
    raise "write_rows! overflow: writing #{src.size_bytes} bytes at row #{start_row} " \
          "(offset #{offset}) exceeds #{size_bytes}-byte buffer"
  end
  dst = Fiddle::Pointer.new(@ptr.to_i + offset)
  status = cuda_rt.cudaMemcpy(dst, src.ptr, src.size_bytes, MEMCPY_DEVICE_TO_DEVICE)
  raise "cudaMemcpy DtoD (write_rows!) failed with status #{status}" unless status.zero?
  self
end

#zero! ⇒ `self`

Zero the device buffer with cudaMemset (device-side). This is ~20x faster than the old ‘from_host(Array.new(numel, 0.0))` idiom, which allocated a huge Ruby array, packed it, and H2D-copied it on every op (0.5ms+/op, and seconds for the 38M-element LM-head weight transpose).

Returns:

(self)

# File 'lib/nnw/shared/nv_array.rb', line 185

def zero!
  to_device if @ptr.nil?
  status = cuda_rt.cudaMemset(@ptr, 0, size_bytes)
  raise "cudaMemset failed with status #{status}" unless status.zero?
  self
end

Class: Ignis::Shared::NvArray

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ NvArray

Instance Attribute Details

#created_at ⇒ Time (readonly)

#device_id ⇒ Integer (readonly)

#dtype ⇒ Symbol (readonly)

#id ⇒ Integer (readonly)

#owner ⇒ Symbol (readonly)

#pinned_host_ptr ⇒ Fiddle::Pointer? (readonly)

#ptr ⇒ Fiddle::Pointer? (readonly)

#ref_count ⇒ Integer (readonly)

#shape ⇒ Array<Integer> (readonly)

#stream ⇒ Fiddle::Pointer? (readonly)

Class Method Details

.cuda_runtime ⇒ Module

.release_finalizer(addr) ⇒ Proc

Instance Method Details

#clone ⇒ NvArray

#decrement_ref! ⇒ Integer

#device_ffi_ptr ⇒ FFI::Pointer

#device_index ⇒ Integer

#dtype_size ⇒ Integer

#free! ⇒ void

#freed? ⇒ Boolean

#from_host(data) ⇒ self

#from_host_raw(bytes) ⇒ self

#increment_ref! ⇒ Integer

#ndim ⇒ Integer

#numel ⇒ Integer

#on_device? ⇒ Boolean

#pin! ⇒ Fiddle::Pointer

#size_bytes ⇒ Integer

#slice(dim, start, len) ⇒ NvArray

#to_device ⇒ self

#to_host ⇒ Array<Numeric>

#to_s ⇒ String Also known as: inspect

#transfer_ownership(new_owner) ⇒ Symbol

#unpin! ⇒ void

#write_rows!(src, start_row) ⇒ self

#zero! ⇒ self

#initialize(shape:, dtype:, device_id: 0, ptr: nil, stream: nil, owner: :nvruby, parent: nil) ⇒ `NvArray`

#created_at ⇒ `Time` (readonly)

#device_id ⇒ `Integer` (readonly)

#dtype ⇒ `Symbol` (readonly)

#id ⇒ `Integer` (readonly)

#owner ⇒ `Symbol` (readonly)

#pinned_host_ptr ⇒ `Fiddle::Pointer`^? (readonly)

#ptr ⇒ `Fiddle::Pointer`^? (readonly)

#ref_count ⇒ `Integer` (readonly)

#shape ⇒ `Array<Integer>` (readonly)

#stream ⇒ `Fiddle::Pointer`^? (readonly)

.cuda_runtime ⇒ `Module`

.release_finalizer(addr) ⇒ `Proc`

#clone ⇒ `NvArray`

#decrement_ref! ⇒ `Integer`

#device_ffi_ptr ⇒ `FFI::Pointer`

#device_index ⇒ `Integer`

#dtype_size ⇒ `Integer`

#free! ⇒ `void`

#freed? ⇒ `Boolean`

#from_host(data) ⇒ `self`

#from_host_raw(bytes) ⇒ `self`

#increment_ref! ⇒ `Integer`

#ndim ⇒ `Integer`

#numel ⇒ `Integer`

#on_device? ⇒ `Boolean`

#pin! ⇒ `Fiddle::Pointer`

#size_bytes ⇒ `Integer`

#slice(dim, start, len) ⇒ `NvArray`

#to_device ⇒ `self`

#to_host ⇒ `Array<Numeric>`

#to_s ⇒ `String` Also known as: inspect

#transfer_ownership(new_owner) ⇒ `Symbol`

#unpin! ⇒ `void`

#write_rows!(src, start_row) ⇒ `self`

#zero! ⇒ `self`