SkillAgentSearch skills...

Dv5das

Data Availability Sampling (DAS) on a Discovery-v5 DHT overlay

Install / Use

/learn @protolambda/Dv5das
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Implementing Data Availability Sampling (DAS)

There's a lot of history to unpack here. Vitalik posted about the "Endgame": where ethereum could be headed, assuming MEV is fundamental. Scaling data-availability is one of the fundamental endgame requirements.

DAS is not new though, there is a lot of background.

A brief summary:

So "wen data-availability sampling?" Well, we need to solve the network layer still. This is a new proposal how it could work.

Prototypes

The DAS-in-practice doc made a start, and a call with eth2 implementers was hosted in October 2020 to talk about implementation. Then Dec 1 (2020) the beacon-chain launched, followed by holidays: no time from implementers for bleeding edge work end of 2020.

There were some educated guesses for an initial design direction though:

  • Random sampling results in high subscription churn in GossipSub.
  • Sampling needs to be very fast, DHTs are slow, right?
  • GossipSub can handle attestation load, surely we can handle sampling if we avoid the churn?

And so the idea of a push-variant started: everyone rotates random subscriptions to sample-specific GossipSub topics slowly, to avoid churn. Then seed the subnets via the shard topic (horizontal), and propagate samples to subscribers (vertical).

I was too optimistic, and started putting together a libp2p simulation of this GossipSub approach: eth2-das.

The networking approach seemed ok-ish, but doubts started to arise around the altruism:

  • How do we serve historical shard-data?
  • How do we make sure sampling works past just the most recent things?
  • If you propagate on 1000+ different topics, and the network is random and peer-fan is low, don't we eat a lot of undesired bandwidth?

With peer-scoring and banning, and frequent enough tiny tasks for the scores to make sense, you may have some chance it works. Still, if the networking is too heavy, altruism with peer-scoring is not going to work. The prototype was then abandoned.

KZG libs

Enough other problems to solve for implementers though, like using actual KZG proofs in prototypes.

So in January I ported over the python KZG10/FK20 code to optimized Go code: Go-KZG. We formed a small working-group, and more ports, based on the Python and Go code, were created:

  • C-KZG, using BLST by Ben Edington (Teku team). FK20, with optimized go code, then written in C, and hooked to the fasted BLS library. Amazing :clap:
  • JC-KZG, java bindings to C-KZG, developed by Anton Nashatyrev (Teku team).
  • Constantine (research code), KZG FFTs and proofs ported to Nim by Mamy Ratsimbazafy (Nimbus team).
  • KZG in rust, Rust port by students of Saulius Grigaitis / SIFRAI-tech (Grandine team).

Revisiting the DAS network problem

Theory and crypto-libs are there, and the Merge is almost here (Catalyst, Rayonism, Amphora, Kintsugi, Kiln), now back to the networking problem.

The protocol has two independent identity layers, because of privacy and redundancy: (everyone can setup their validators however, wherever and act however they like):

  • consensus layer (validator entities signing stuff)
  • network layer (nodes hosting validators)

To the consensus layer, DAS only needs to be accurate. The data that is available must not be ignored, or we get an unhealthy chain.

The network layer serves the above, requiring:

  • Fast sample responses (keep accuracy for best case)
  • Resilience (keep accuracy for worst case, e.g. DoS attacks and maintain randomness for security)
  • Low resources (keep accuracy for lazy case, no excuses not to run it)

Also note that data-availability only requires data to be published to anyone unconditionally, not to persist it forever (incentivized honest nodes can keep storing it, as long as they can keep getting it).

Some ideas to create these properties:

  • Reduce the number of hops for fast samples queries, and saving resources
  • No small meshes as requester, to preserve sample randomness
  • Distribute the samples as much as possible, the uniform random sampling will load-balance the work
  • Distributing helps offload privacy/DoS risk from data publishers
  • Reduce the overhead per sample as much as possible, to save resources
  • Rotate out data slow enough to serve queries, but fast enough to keep the relevant

Does this sound familiar? Discv5 solves a very similar problem!

What discv5 offers:

  • The node tables can be large (16 x 16 = 256 records, plus replacements queue) and the network is not that much larger (rough estimate: 10k validating nodes, = approx 40 times the size). Number of hops will be low if some type of DHT search even works only a little bit.
  • Only UDP communication: effective with resource to keep that many peers.
  • Information distribution is similar: every record has a tiny signature to authenticate the data. Samples have a tiny proof to verify.
  • Just like sending a record in a single packet, send a sample with a single packet.
  • Discovery so far is used to retrieve records at random locations, and we do not need to advertise locations, they can be deterministic based on the sample hash
  • Built to scale horizontally, and scale through participation

Caveat: discv5 is bigger than ethereum discovery, not all nodes may want to answer sample requests. We'll need some type of overlay of ethereum-only discv5 nodes.

Extending Discv5

But how does it fit discv5?

Sampling

Discv5 offers TALKREQ/TALKRESP to extend the protocol with application-layer messaging. Not all discv5 implementations support this yet, but it is part of the spec and we can start with it in Go.

Although the TALKREQ/TALKRESP are intended for initial handshakes / negotiation of extensions, the communication in this protocol is infrequent and simple enough that TALKREQ/TALKRESP should work fine (open for suggestions however).

The primary roles in the DHT are:

  1. Seed the DHT with sample bundles as publisher
  2. Fan-out the bundles as relayer
  3. Query the DHT for samples as searcher
  4. Respond to queries as host

And the samples all get hashed to an ID in the DHT: H(fork_digest, randao_mix, data_id, sample_index) -> node_id:

  • fork_digest (f) - Bytes4, fork digest to separate networks
  • randao_mix (r) - Bytes32, is randomness picked based on the sample time: this prevents mining of node identities close to a specific sample far in the future
  • data_id (x) - Bytes48, is the KZG commitment of the data: samples are unique to the data they are part of
  • sample_index (i) - uint64 (little endian), identifies the sample

The fork_digest and randao_mix can be determined based on time (slot t), and are not communicated or stored in the sample protocol.

Samples are stored in a key-value store. Key: t,x,i (f and r are determined based on slot t) Value: sample data points and proof

Seeding

Generally we are not worried about the cost of the publisher as much: a block builder has more resources and has incentive to avoid orphaning their block because of missed/late publishing.

Before publishing the data, nodes should have the expected KZG commitment to all of it (proposer signed KZG commitment from builder on global topic).

The builder then generates the samples 0,...,N and hashes them to their DHT location H(f,r,x,0),...,H(f,r,x,N), and distributes them in bundles (see )

Each node that receives a sample can verify the sample proof against the global KZG commitment, and limit spam by e.g. testing if the message is close enough to their identity.

Nodes that receive the samples can propagate them further to other nodes in the DHT radius of the sample: some redundancy is balances load and speeds up search.

Sybil attacks to capture the first distributed samples may be a problem, but a builder can score their peers, select long-lived peers,

View on GitHub
GitHub Stars35
CategoryDevelopment
Updated8mo ago
Forks2

Languages

Go

Security Score

82/100

Audited on Jul 17, 2025

No findings