Juzfs

a comprehensive implementation of "The Google File System" paper in rust

Generate Convert Improve

Install / Use

/learn @pixperk/Juzfs

About this skill

Quality Score

0/100

README

juzfs

A complete implementation of the Google File System in Rust. Raw TCP, no frameworks, no shortcuts.

For a detailed breakdown of the GFS architecture, see The Google File System: A Detailed Breakdown.

Why GFS

GFS was built around assumptions that most filesystems ignore: files are huge, reads are sequential, writes are almost always appends, and disks fail constantly. Rather than fighting these realities with a general-purpose POSIX layer, Google designed around them. One master, big chunks, relaxed consistency for appends. juzfs takes this same approach and implements it in Rust with async I/O.

Architecture

GFS Architecture

Three components, all talking raw TCP:

Master

The single metadata server. Holds the file namespace (filenames to chunk handle lists), chunk metadata (version, primary lease, replica locations), and the state of every registered chunkserver. Everything lives in memory behind RwLocks. The master never sees file data -- clients ask it "where does this chunk live?", get an answer, and go talk to chunkservers directly. Keeping the master off the data path is the core scalability trick in GFS.

Chunk locations are deliberately not persisted. On restart, the master knows nothing about which chunkserver holds what. As chunkservers boot and heartbeat in, they report their full chunk inventory with version numbers. The master rebuilds its location map from these reports. This is more robust than persisting locations because the chunkservers are always the ground truth about what's physically on their disks.

All metadata mutations (file creation, chunk allocation, lease grants) are logged to an append-only operation log before being applied to memory. The master recovers from this log on startup, and periodically snapshots its state to a checkpoint file for faster recovery. More on this in the Operation Log and Checkpointing sections.

Chunkservers

The data layer. Each chunkserver manages a local directory with three types of files per chunk:

chunks/00000001.chunk    -- raw data, up to 64MB
checksums/00000001.csum  -- CRC32 per 64KB block, packed big-endian
versions/00000001.ver    -- chunk version number

Chunks default to 64MB (configurable via CLI for testing). Every 64KB block gets a CRC32 checksum computed on write and verified on every read. A mismatch means corruption, and the read fails so the client can try another replica.

For writes, chunkservers maintain an LRU push buffer (32 entries). Data arrives in phase 1 of the write protocol and sits in memory until the primary triggers a commit in phase 2. A monotonic AtomicU64 serial counter on the primary ensures all replicas apply mutations in identical order.

Every 5 seconds, each chunkserver heartbeats to the master with its chunk list (handles + versions) and real available disk space (calculated from actual file sizes, not hardcoded).

Client

A Rust library (src/client.rs) that handles all the protocol complexity. It caches chunk metadata locally with a 30-second staleness window to avoid redundant master lookups. The client manages the full read, write, and append flows, including transparent retry when a chunk fills up during record append.

Protocol

Custom TCP framing with magic bytes and length-prefixed payloads:

[magic: 2B "JF"][version: 1B][msg_type: 1B][payload_len: 4B][payload]

The framing code is compact. Here's the core of send_frame:

const MAGIC: [u8; 2] = [0x4A, 0x46]; // 'J' 'F'
const VERSION: u8 = 1;

pub async fn send_frame<T: Serialize>(
    stream: &mut TcpStream, msg_type: MessageType, payload: &T,
) -> io::Result<()> {
    let body = bincode::serialize(payload)
        .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e))?;
    let mut header = [0u8; 8];
    header[0..2].copy_from_slice(&MAGIC);
    header[2] = VERSION;
    header[3] = msg_type as u8;
    header[4..8].copy_from_slice(&(body.len() as u32).to_le_bytes());
    stream.write_all(&header).await?;
    stream.write_all(&body).await?;
    Ok(())
}

Magic bytes 0x4A 0x46 catch misframed connections immediately. The msg_type byte routes messages:

| msg_type | Direction | |----------|-----------| | 1 | Client to Master | | 2 | Master to Client | | 3 | ChunkServer to Master | | 4 | Master to ChunkServer | | 5 | Client to ChunkServer | | 6 | ChunkServer to Client | | 7 | ChunkServer to ChunkServer | | 8 | ChunkServer Ack |

Payloads are bincode v1. Each direction has its own serde enum, so extending the protocol is just adding a variant.

Reads

The master is never on the read path. The client asks it for metadata once, then reads directly from chunkservers.

GetFileChunks returns the ordered list of chunk handles for a file
GetChunkLocations returns which chunkservers hold each chunk
Client sends Read { handle, offset, length } directly to a chunkserver
If that replica fails (network error, checksum mismatch), try the next one

For multi-chunk reads, the client calculates which chunks the byte range spans and issues per-chunk reads with the correct local offsets. If the requested length exceeds what the chunk actually has, the chunkserver clamps it and returns what's available.

Streaming reads pipe chunks through a tokio mpsc channel with a 4-chunk lookahead buffer. A background task fetches chunks in order and feeds them into the channel. The consumer processes data as it arrives without loading the entire file into memory.

Writes

Writes separate data flow from control flow using a two-phase protocol. This is the most interesting part of the design.

Write and Mutation Flow

sequenceDiagram
    participant C as Client
    participant P as Primary
    participant S1 as Secondary 1
    participant S2 as Secondary 2

    Note over C,S2: Phase 1 -- data pipeline
    C->>P: ForwardData(handle, data, [S1, S2])
    P->>S1: ForwardData(handle, data, [S2])
    S1->>S2: ForwardData(handle, data, [])

    Note over C,S2: Phase 2 -- primary coordinates commit
    C->>P: Write(handle, [S1, S2])
    P->>P: serial=next(), flush to disk
    P->>S1: CommitWrite(handle, serial)
    S1-->>P: ack
    P->>S2: CommitWrite(handle, serial)
    S2-->>P: ack
    P-->>C: Ok

Phase 1 pushes data through a chunkserver chain. The client sends to the first node, which buffers and forwards to the next, and so on. Data flows in a single network pass rather than the client uploading to each replica separately. All nodes buffer in memory (the LRU push buffer), nothing touches disk yet. If any node in the chain fails to forward, the error propagates back through the entire chain to the client via ChunkServerAck::Error:

// chunkserver buffers data, then forwards to next in chain
cs.buffer_push(handle, data.clone()).await;

let mut chain_err: Option<String> = None;
if let Some((next, rest)) = remaining.split_first() {
    let fwd = ChunkServerToChunkServer::ForwardData {
        handle, data, remaining: rest.to_vec(),
    };
    match TcpStream::connect(next).await {
        Ok(mut next_conn) => {
            // send and read downstream ack...
        }
        Err(e) => {
            chain_err = Some(format!("forward to {} unreachable: {}", next, e));
        }
    }
}
match chain_err {
    Some(e) => ChunkServerAck::Error(e),
    None => ChunkServerAck::Ok,
}

Phase 2 is where ordering happens. The client tells the primary to commit. The primary assigns a monotonic serial number, flushes its own buffer to disk, then sends CommitWrite to each secondary with that serial. Secondaries flush and ack. Only after all replicas confirm does the primary ack the client.

The serial number is the consistency mechanism. Two concurrent writes to the same chunk get serialized by the primary. Every replica applies them in serial order, so they converge to identical state.

Leases

Before any write, the client asks the master for a primary via GetPrimary. The master grants a 60-second lease to one chunkserver. The same lease is reused for all writes during that window. No round-trip to the master per write, just per lease period.

When a lease expires and a new one is needed, the master bumps the chunk version, notifies all replicas via UpdateVersion, and returns the new primary. This version bump is how stale replicas get caught (more on that below).

Record Append

The dominant write pattern in GFS. The client provides only data, no offset. The primary decides where to place it. Multiple clients can append concurrently to the same file without any coordination between them.

Record Append

The flow:

Client finds the last chunk of the file and sends Append { handle, data, secondaries } to its primary
Primary checks: does current_size + data_len fit within the chunk size?
Fits: primary sets offset = current_size, appends locally, sends CommitAppend with the same offset and data to all secondaries. Returns AppendOk { offset } to the client.
Doesn't fit: primary pads the chunk to exactly 64MB on all replicas and returns RetryNewChunk. The client allocates a new chunk via the master, invalidates its metadata cache, and retries. This loop is transparent to the caller.

The padding matters. Without it, a half-full chunk on one replica could accept a conflicting append from a different client, breaking cross-replica consistency. Padding seals the chunk on all replicas simultaneously.

Chunk Versioning

Every chunk starts at version 0. The version increments each time the master grants a new lease for that chunk. This is how the system detects replicas that missed

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。