Module: Ignis::Half
- Defined in:
- lib/nvruby/half.rb
Overview
Pure-Ruby IEEE-754 half-precision (binary16) and bfloat16 <-> float32 conversion.
Ruby and FFI have no native 16-bit float type, so NvArray stores fp16/bf16 as raw uint16 bit patterns. These helpers convert to/from Ruby Floats with correct round-to-nearest-even rounding and proper subnormal / overflow / inf / NaN handling.
This is the single source of truth for half conversion across BOTH NvArray classes (Ignis::NvArray and Ignis::Shared::NvArray) and the safetensors codec, so the math cannot drift between them.
Class Method Summary collapse
-
.bf16_to_f32(bits) ⇒ Float
Decode a bfloat16 bit pattern to a Ruby Float.
-
.f16_to_f32(bits) ⇒ Float
Decode an IEEE-754 binary16 (fp16) bit pattern to a Ruby Float.
-
.f32_to_bf16(value) ⇒ Integer
Encode a Float as bfloat16 (the upper 16 bits of float32), round-to-nearest-even.
-
.f32_to_f16(value) ⇒ Integer
Encode a Float as IEEE-754 binary16 (fp16) bit pattern.
Class Method Details
.bf16_to_f32(bits) ⇒ Float
Decode a bfloat16 bit pattern to a Ruby Float.
93 94 95 |
# File 'lib/nvruby/half.rb', line 93 def bf16_to_f32(bits) [(bits & 0xFFFF) << 16].pack("V").unpack1("e") end |
.f16_to_f32(bits) ⇒ Float
Decode an IEEE-754 binary16 (fp16) bit pattern to a Ruby Float.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/nvruby/half.rb', line 57 def f16_to_f32(bits) sign = (bits >> 15) & 0x1 exp = (bits >> 10) & 0x1F mant = bits & 0x3FF if exp.zero? return sign.zero? ? 0.0 : -0.0 if mant.zero? val = (mant / 1024.0) * (2.0**-14) elsif exp == 0x1F return Float::NAN unless mant.zero? return sign.zero? ? Float::INFINITY : -Float::INFINITY else val = (1.0 + mant / 1024.0) * (2.0**(exp - 15)) end sign.zero? ? val : -val end |
.f32_to_bf16(value) ⇒ Integer
Encode a Float as bfloat16 (the upper 16 bits of float32), round-to-nearest-even.
80 81 82 83 84 85 86 87 88 |
# File 'lib/nvruby/half.rb', line 80 def f32_to_bf16(value) bits = [value.to_f].pack("e").unpack1("V") # NaN: keep sign+exponent, force a non-zero mantissa (quiet NaN) return ((bits >> 16) | 0x0040) & 0xFFFF if (bits & 0x7FFFFFFF) > 0x7F800000 lsb = (bits >> 16) & 1 bits += 0x7FFF + lsb # round half to even (bits >> 16) & 0xFFFF end |
.f32_to_f16(value) ⇒ Integer
Encode a Float as IEEE-754 binary16 (fp16) bit pattern.
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/nvruby/half.rb', line 19 def f32_to_f16(value) bits = [value.to_f].pack("e").unpack1("V") # float32 little-endian bit pattern sign = (bits >> 16) & 0x8000 exp = (bits >> 23) & 0xFF mant = bits & 0x7FFFFF # Inf / NaN return sign | (mant.zero? ? 0x7C00 : 0x7E00) if exp == 0xFF e = exp - 127 + 15 if e >= 0x1F # Overflow -> signed Inf sign | 0x7C00 elsif e <= 0 # Subnormal or zero return sign if e < -10 # too small to represent even as a subnormal m = mant | 0x800000 # restore the implicit leading 1 shift = 14 - e # 14..24 half = m >> shift rem = m & ((1 << shift) - 1) halfway = 1 << (shift - 1) half += 1 if rem > halfway || (rem == halfway && (half & 1) == 1) # round half to even sign | half else # Normal half = sign | (e << 10) | (mant >> 13) rem = mant & 0x1FFF # round half to even; a carry correctly propagates mantissa -> exponent (incl. -> Inf) half += 1 if rem > 0x1000 || (rem == 0x1000 && (half & 1) == 1) half & 0xFFFF end end |