Module: Ignis::Half

Defined in:: lib/nvruby/half.rb

Overview

Pure-Ruby IEEE-754 half-precision (binary16) and bfloat16 <-> float32 conversion.

Ruby and FFI have no native 16-bit float type, so NvArray stores fp16/bf16 as raw uint16 bit patterns. These helpers convert to/from Ruby Floats with correct round-to-nearest-even rounding and proper subnormal / overflow / inf / NaN handling.

This is the single source of truth for half conversion across BOTH NvArray classes (Ignis::NvArray and Ignis::Shared::NvArray) and the safetensors codec, so the math cannot drift between them.

Class Method Summary collapse

.bf16_to_f32(bits) ⇒ Float

Decode a bfloat16 bit pattern to a Ruby Float.
.f16_to_f32(bits) ⇒ Float

Decode an IEEE-754 binary16 (fp16) bit pattern to a Ruby Float.
.f32_to_bf16(value) ⇒ Integer

Encode a Float as bfloat16 (the upper 16 bits of float32), round-to-nearest-even.
.f32_to_f16(value) ⇒ Integer

Encode a Float as IEEE-754 binary16 (fp16) bit pattern.

Class Method Details

.bf16_to_f32(bits) ⇒ `Float`

Decode a bfloat16 bit pattern to a Ruby Float.

Parameters:

bits (Integer) —

16-bit unsigned

Returns:

(Float)



93
94
95

# File 'lib/nvruby/half.rb', line 93

def bf16_to_f32(bits)
  [(bits & 0xFFFF) << 16].pack("V").unpack1("e")
end

.f16_to_f32(bits) ⇒ `Float`

Decode an IEEE-754 binary16 (fp16) bit pattern to a Ruby Float.

Parameters:

bits (Integer) —

16-bit unsigned

Returns:

(Float)

# File 'lib/nvruby/half.rb', line 57

def f16_to_f32(bits)
  sign = (bits >> 15) & 0x1
  exp  = (bits >> 10) & 0x1F
  mant = bits & 0x3FF

  if exp.zero?
    return sign.zero? ? 0.0 : -0.0 if mant.zero?

    val = (mant / 1024.0) * (2.0**-14)
  elsif exp == 0x1F
    return Float::NAN unless mant.zero?

    return sign.zero? ? Float::INFINITY : -Float::INFINITY
  else
    val = (1.0 + mant / 1024.0) * (2.0**(exp - 15))
  end

  sign.zero? ? val : -val
end

.f32_to_bf16(value) ⇒ `Integer`

Encode a Float as bfloat16 (the upper 16 bits of float32), round-to-nearest-even.

Parameters:

value (Numeric)

Returns:

(Integer) —

16-bit unsigned (0..0xFFFF)

# File 'lib/nvruby/half.rb', line 80

def f32_to_bf16(value)
  bits = [value.to_f].pack("e").unpack1("V")
  # NaN: keep sign+exponent, force a non-zero mantissa (quiet NaN)
  return ((bits >> 16) | 0x0040) & 0xFFFF if (bits & 0x7FFFFFFF) > 0x7F800000

  lsb = (bits >> 16) & 1
  bits += 0x7FFF + lsb # round half to even
  (bits >> 16) & 0xFFFF
end

.f32_to_f16(value) ⇒ `Integer`

Encode a Float as IEEE-754 binary16 (fp16) bit pattern.

Parameters:

value (Numeric)

Returns:

(Integer) —

16-bit unsigned (0..0xFFFF)

# File 'lib/nvruby/half.rb', line 19

def f32_to_f16(value)
  bits = [value.to_f].pack("e").unpack1("V") # float32 little-endian bit pattern
  sign = (bits >> 16) & 0x8000
  exp  = (bits >> 23) & 0xFF
  mant = bits & 0x7FFFFF

  # Inf / NaN
  return sign | (mant.zero? ? 0x7C00 : 0x7E00) if exp == 0xFF

  e = exp - 127 + 15

  if e >= 0x1F
    # Overflow -> signed Inf
    sign | 0x7C00
  elsif e <= 0
    # Subnormal or zero
    return sign if e < -10 # too small to represent even as a subnormal

    m = mant | 0x800000 # restore the implicit leading 1
    shift = 14 - e # 14..24
    half = m >> shift
    rem = m & ((1 << shift) - 1)
    halfway = 1 << (shift - 1)
    half += 1 if rem > halfway || (rem == halfway && (half & 1) == 1) # round half to even
    sign | half
  else
    # Normal
    half = sign | (e << 10) | (mant >> 13)
    rem = mant & 0x1FFF
    # round half to even; a carry correctly propagates mantissa -> exponent (incl. -> Inf)
    half += 1 if rem > 0x1000 || (rem == 0x1000 && (half & 1) == 1)
    half & 0xFFFF
  end
end

Module: Ignis::Half

Overview

Class Method Summary collapse

Class Method Details

.bf16_to_f32(bits) ⇒ Float

.f16_to_f32(bits) ⇒ Float

.f32_to_bf16(value) ⇒ Integer

.f32_to_f16(value) ⇒ Integer

.bf16_to_f32(bits) ⇒ `Float`

.f16_to_f32(bits) ⇒ `Float`

.f32_to_bf16(value) ⇒ `Integer`

.f32_to_f16(value) ⇒ `Integer`