Module: Ignis::Half

Defined in:
lib/nvruby/half.rb

Overview

Pure-Ruby IEEE-754 half-precision (binary16) and bfloat16 <-> float32 conversion.

Ruby and FFI have no native 16-bit float type, so NvArray stores fp16/bf16 as raw uint16 bit patterns. These helpers convert to/from Ruby Floats with correct round-to-nearest-even rounding and proper subnormal / overflow / inf / NaN handling.

This is the single source of truth for half conversion across BOTH NvArray classes (Ignis::NvArray and Ignis::Shared::NvArray) and the safetensors codec, so the math cannot drift between them.

Class Method Summary collapse

Class Method Details

.bf16_to_f32(bits) ⇒ Float

Decode a bfloat16 bit pattern to a Ruby Float.

Parameters:

  • bits (Integer)

    16-bit unsigned

Returns:

  • (Float)


93
94
95
# File 'lib/nvruby/half.rb', line 93

def bf16_to_f32(bits)
  [(bits & 0xFFFF) << 16].pack("V").unpack1("e")
end

.f16_to_f32(bits) ⇒ Float

Decode an IEEE-754 binary16 (fp16) bit pattern to a Ruby Float.

Parameters:

  • bits (Integer)

    16-bit unsigned

Returns:

  • (Float)


57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/nvruby/half.rb', line 57

def f16_to_f32(bits)
  sign = (bits >> 15) & 0x1
  exp  = (bits >> 10) & 0x1F
  mant = bits & 0x3FF

  if exp.zero?
    return sign.zero? ? 0.0 : -0.0 if mant.zero?

    val = (mant / 1024.0) * (2.0**-14)
  elsif exp == 0x1F
    return Float::NAN unless mant.zero?

    return sign.zero? ? Float::INFINITY : -Float::INFINITY
  else
    val = (1.0 + mant / 1024.0) * (2.0**(exp - 15))
  end

  sign.zero? ? val : -val
end

.f32_to_bf16(value) ⇒ Integer

Encode a Float as bfloat16 (the upper 16 bits of float32), round-to-nearest-even.

Parameters:

  • value (Numeric)

Returns:

  • (Integer)

    16-bit unsigned (0..0xFFFF)



80
81
82
83
84
85
86
87
88
# File 'lib/nvruby/half.rb', line 80

def f32_to_bf16(value)
  bits = [value.to_f].pack("e").unpack1("V")
  # NaN: keep sign+exponent, force a non-zero mantissa (quiet NaN)
  return ((bits >> 16) | 0x0040) & 0xFFFF if (bits & 0x7FFFFFFF) > 0x7F800000

  lsb = (bits >> 16) & 1
  bits += 0x7FFF + lsb # round half to even
  (bits >> 16) & 0xFFFF
end

.f32_to_f16(value) ⇒ Integer

Encode a Float as IEEE-754 binary16 (fp16) bit pattern.

Parameters:

  • value (Numeric)

Returns:

  • (Integer)

    16-bit unsigned (0..0xFFFF)



19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/nvruby/half.rb', line 19

def f32_to_f16(value)
  bits = [value.to_f].pack("e").unpack1("V") # float32 little-endian bit pattern
  sign = (bits >> 16) & 0x8000
  exp  = (bits >> 23) & 0xFF
  mant = bits & 0x7FFFFF

  # Inf / NaN
  return sign | (mant.zero? ? 0x7C00 : 0x7E00) if exp == 0xFF

  e = exp - 127 + 15

  if e >= 0x1F
    # Overflow -> signed Inf
    sign | 0x7C00
  elsif e <= 0
    # Subnormal or zero
    return sign if e < -10 # too small to represent even as a subnormal

    m = mant | 0x800000 # restore the implicit leading 1
    shift = 14 - e # 14..24
    half = m >> shift
    rem = m & ((1 << shift) - 1)
    halfway = 1 << (shift - 1)
    half += 1 if rem > halfway || (rem == halfway && (half & 1) == 1) # round half to even
    sign | half
  else
    # Normal
    half = sign | (e << 10) | (mant >> 13)
    rem = mant & 0x1FFF
    # round half to even; a carry correctly propagates mantissa -> exponent (incl. -> Inf)
    half += 1 if rem > 0x1000 || (rem == 0x1000 && (half & 1) == 1)
    half & 0xFFFF
  end
end