Float4

MXFP4-compatible 4-bit floating point types and block formats for Rust.

Generate Convert Improve

Install / Use

/learn @EricLBuehler/Float4

About this skill

Quality Score

0/100

README

`float4`

MXFP4-compatible 4-bit floating point types and block formats for Rust.

This crate provides low-precision floating-point types following the OCP MX specification, designed for efficient storage and computation in machine learning applications where extreme quantization is beneficial.

Available Types

F4E2M1: 4-bit floating-point with 2 exponent bits and 1 mantissa bit
F4E2M1x2: Packed pair of two F4E2M1 values in a single byte (NVIDIA __nv_fp4x2_e2m1 compatible)
E8M0: 8-bit scale factor representing powers of two (2^-127 to 2^127)
MXFP4Block: Block format storing 32 F4E2M1 values with a shared E8M0 scale

Features

Extreme compression: 4× smaller than f32 with MXFP4Block format
IEEE 754 compliant rounding: Round-to-nearest-even for F4E2M1
Power-of-two scales: E8M0 provides exact scaling without rounding errors
Efficient block storage: Pack multiple values with shared scale factor
NVIDIA compatible packing: F4E2M1x2 matches __nv_fp4x2_e2m1 layout for zero-copy CUDA interop
Pack/unpack utilities: Convert between F4E2M1 slices and packed F4E2M1x2 vectors
Comprehensive API: Conversions, constants, and trait implementations

Quick Start

Add this to your Cargo.toml:

[dependencies]
float4 = "0.2"

Example Usage

use float4::F4E2M1;

// Create from f64
let a = F4E2M1::from_f64(1.5);
assert_eq!(a.to_f64(), 1.5);

// Create from raw bits
let b = F4E2M1::from_bits(0x3); // 0b0011 = 1.5
assert_eq!(b.to_f64(), 1.5);

// Arithmetic operations (via f64 conversion)
let x = F4E2M1::from_f64(2.0);
let y = F4E2M1::from_f64(3.0);
let sum = F4E2M1::from_f64(x.to_f64() + y.to_f64());
assert_eq!(sum.to_f64(), 5.0); // May round to nearest representable value

// Constants
assert_eq!(F4E2M1::MAX.to_f64(), 6.0);
assert_eq!(F4E2M1::MIN.to_f64(), -6.0);
assert_eq!(F4E2M1::EPSILON.to_f64(), 0.5);

Packed Pairs (F4E2M1x2)

Two F4E2M1 values packed into a single byte, matching NVIDIA's __nv_fp4x2_e2m1 layout (lower nibble = first value, upper nibble = second value):

use float4::{F4E2M1, F4E2M1x2, pack, unpack};

// Pack two values into one byte
let pair = F4E2M1x2::new(F4E2M1::from_f64(1.5), F4E2M1::from_f64(-2.0));
assert_eq!(pair.lo().to_f64(), 1.5);
assert_eq!(pair.hi().to_f64(), -2.0);

// Convert from f32 pairs directly
let pair = F4E2M1x2::from_f32_pair(3.0, 0.5);
let (a, b) = pair.to_f32_pair();
assert_eq!(a, 3.0);
assert_eq!(b, 0.5);

// Pack a slice of F4E2M1 values into pairs
let values = vec![
    F4E2M1::from_f64(1.0),
    F4E2M1::from_f64(2.0),
    F4E2M1::from_f64(3.0),
    F4E2M1::from_f64(4.0),
];
let packed = pack(&values);   // [F4E2M1x2(1.0, 2.0), F4E2M1x2(3.0, 4.0)]
let unpacked = unpack(&packed); // [1.0, 2.0, 3.0, 4.0]
assert_eq!(values, unpacked);

Block Format Example

use float4::{F4E2M1, E8M0, MXFP4Block};

// Original data
let data = vec![1.5, -2.0, 0.5, 3.0, 1.0, -0.5];

// Compute scale factor (rounds up to power of 2)
let scale = E8M0::from_f32_slice(&data);
assert_eq!(scale.to_f64(), 4.0); // 3.0 rounds up to 4.0

// Quantize to F4E2M1
let mut quantized = [F4E2M1::from_f64(0.0); 32];
for (i, &value) in data.iter().enumerate() {
    quantized[i] = F4E2M1::from_f64(value as f64 / scale.to_f64());
}

// Pack into block (17 bytes for 32 values vs 128 bytes for f32)
let block = MXFP4Block::from_f32_slice(quantized, scale);

// Retrieve values
let restored = block.to_f32_array();
assert_eq!(restored[0], 1.5);
assert_eq!(restored[1], -2.0);

E8M0 Scale Factors

The E8M0 type represents scale factors as exact powers of two:

use float4::E8M0;

// Exact powers of two are preserved
let scale = E8M0::from(4.0);
assert_eq!(scale.to_f64(), 4.0);

// Non-powers round UP to next power of two
let scale = E8M0::from(3.0);
assert_eq!(scale.to_f64(), 4.0);  // 3.0 → 4.0

let scale = E8M0::from(5.0);
assert_eq!(scale.to_f64(), 8.0);  // 5.0 → 8.0

// Computing scale from data
let data = [1.5, -2.0, 0.5, 3.0];
let scale = E8M0::from_f32_slice(&data);
assert_eq!(scale.to_f64(), 4.0);  // max(|data|) = 3.0 → 4.0

Key characteristics:

Range: 2^-127 to 2^127
Always rounds UP (toward positive infinity)
No rounding errors when scaling by powers of two
Ideal for block quantization schemes

Representable Values

F4E2M1 can exactly represent 16 distinct values:

| Value | Bit Pattern | Type | |-------|-------------|------| | 0.0 | 0000 | Zero | | 0.5 | 0001 | Subnormal | | 1.0 | 0010 | Normal | | 1.5 | 0011 | Normal | | 2.0 | 0100 | Normal | | 3.0 | 0101 | Normal | | 4.0 | 0110 | Normal | | 6.0 | 0111 | Normal | | -0.0 | 1000 | Negative zero | | -0.5 | 1001 | Subnormal | | -1.0 | 1010 | Normal | | -1.5 | 1011 | Normal | | -2.0 | 1100 | Normal | | -3.0 | 1101 | Normal | | -4.0 | 1110 | Normal | | -6.0 | 1111 | Normal |

Special Values

Unlike standard floating point formats, F4E2M1 has no representation for infinity or NaN. These values saturate to the maximum representable value:

use float4::F4E2M1;

assert_eq!(F4E2M1::from_f64(f64::INFINITY).to_f64(), 6.0);
assert_eq!(F4E2M1::from_f64(f64::NEG_INFINITY).to_f64(), -6.0);
assert_eq!(F4E2M1::from_f64(f64::NAN).to_f64(), 6.0);

Related Skills

node-connect

345.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

106.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。