fix: enable new reactor in the e2e test #1657

evan-forbes · 2025-03-06T04:02:03Z

Description

adds catchup and proof verification to fix the e2e test. still has some unit testing gaps though

… added

evan-forbes · 2025-03-17T06:18:24Z

need to debug the e2e test after merging

evan-forbes · 2025-03-17T23:55:23Z

currently, proofs are not sent during catchup. This is fine if the node downloads the compact block, however when the node doesn't download the compact block it cannot reconstruct the proofs itself. This occurs when the consensus reactor reconstructs the partsetheader from the commit, and the node catches up.

to fix, we need to optionally send proofs in the recovery part, and add the ability to ask for proofs in the want message. at least for now when we need to remain backwards compatible

evan-forbes · 2025-03-18T23:20:07Z

consensus/propagation/catchup.go

+		missing := prop.block.BitArray().Not()
+		if missing.IsEmpty() {
+			// this should never be hit due to the check above.
+			continue
 		}
-		missingParts := partSet.BitArray().Not()
-		wantPart := &proptypes.WantParts{
-			Parts:  missingParts,
-			Height: targetHeight,
-			Round:  round,
+
+		// make requests from different peers
+		peers = shuffle(peers)
+
+		for _, peer := range peers {
+			mc := missing.Copy()
+			reqs, has := peer.GetRequests(height, round)
+			if has {
+				mc = mc.Sub(reqs)
+			}
+
+			if mc.IsEmpty() {
+				continue
+			}
+
+			e := p2p.Envelope{
+				ChannelID: WantChannel,
+				Message: &protoprop.WantParts{
+					Parts:  *missing.ToProto(),
+					Height: height,
+					Round:  round,
+					Prove:  true,
+				},
+			}
+
+			if !p2p.TrySendEnvelopeShim(peer.peer, e, blockProp.Logger) { //nolint:staticcheck
+				blockProp.Logger.Error("failed to send want part", "peer", peer, "height", height, "round", round)
+				continue
+			}
+
+			// keep track of which requests we've made this attempt.
+			missing.Sub(mc)
+			peer.AddRequests(height, round, missing)


note that need to test this function in a unit test and ensure that we're only actually max requesting parts once in a follow up

evan-forbes · 2025-03-18T23:21:08Z

consensus/propagation/catchup.go

+func (blockProp *Reactor) AddCommitment(height int64, round int32, psh *types.PartSetHeader) {
+	blockProp.pmtx.Lock()
+	defer blockProp.pmtx.Unlock()
+
+	if blockProp.proposals[height] == nil {
+		blockProp.proposals[height] = make(map[int32]*proposalData)
+	}
+
+	combinedSet := proptypes.NewCombinedPartSetFromOriginal(types.NewPartSetFromHeader(*psh), true)
+
+	if blockProp.proposals[height][round] != nil {
+		return
+	}
+
+	blockProp.proposals[height][round] = &proposalData{
+		compactBlock: &proptypes.CompactBlock{
+			Proposal: types.Proposal{
+				Height: height,
+				Round:  round,
+			},
+		},
+		catchup:     true,
+		block:       combinedSet,
+		maxRequests: bits.NewBitArray(int(psh.Total * 2)), // this assumes that the parity parts are the same size
+	}
+}


this allows the consensus reactor to add a partset header each time the node falls behind and then sees a valid commit later

evan-forbes · 2025-03-18T23:22:35Z

consensus/propagation/commitment_state.go

-	// delete all but the last round for each remaining height except the current.
-	// this is because we need to keep the last round for the current height.
-	for height := range p.proposals {
-		if height == p.currentHeight {
-			continue
-		}
-		for round := range p.proposals[height] {
-			if round <= p.currentRound-int32(keepRecentRounds) {
-				delete(p.proposals[height], round)
-			}
-		}
-	}


simplified pruning via just pruning everything after the consensus reactor commits to a block (instead of pruning after we receive new compact blocks. This makes pruning rounds more difficult tho

evan-forbes · 2025-03-18T23:22:48Z

consensus/propagation/commitment_state_test.go

-			round:        -1, // meaning "latest round"
+			round:        -2, // meaning "latest round"


I think we don't need this

evan-forbes · 2025-03-18T23:23:20Z

consensus/propagation/have_wants.go

-func (blockProp *Reactor) handleHaves(peer p2p.ID, haves *proptypes.HaveParts, bypassRequestLimit bool) {
+func (blockProp *Reactor) handleHaves(peer p2p.ID, haves *proptypes.HaveParts, _ bool) {


note that we should remove the unused arg

evan-forbes · 2025-03-18T23:24:24Z

consensus/propagation/peer_state.go

-	p.wants.AddBitArray(haves)
+	p.haves.AddBitArray(haves)


idek know how this was working at all before

super annoying bug omg

evan-forbes · 2025-03-18T23:25:56Z

consensus/propagation/reactor_test.go

+// TestHugeBlock doesn't have a success or failure condition yet, although one could be added. It is very useful for debugging however
+func TestHugeBlock(t *testing.T) {
+	p2pCfg := cfg.DefaultP2PConfig()
+	p2pCfg.SendRate = 5000000
+	p2pCfg.RecvRate = 5000000
+
+	nodes := 20
+
+	reactors, _ := createTestReactors(nodes, p2pCfg, false, "/home/evan/data/experiments/celestia/fast-recovery/debug")
+
+	cleanup, _, sm := state.SetupTestCase(t)
+	t.Cleanup(func() {
+		cleanup(t)
+	})
+
+	prop, ps, _, metaData := createTestProposal(sm, 1, 32, 1000000)
+
+	reactors[1].ProposeBlock(prop, ps, metaData)
+}


this is only for fun / help debugging. we should expand on it in the future to actually check invarients

evan-forbes · 2025-03-18T23:27:02Z

consensus/state.go

-	stateMachineValidBlock, err := cs.blockExec.ProcessProposal(cs.ProposalBlock)
-	if err != nil {
-		cs.Logger.Error("state machine returned an error when trying to process proposal block", "err", err)
-		return
-	}
+	// todo: re-enable after the fast testnet
+	// stateMachineValidBlock, err := cs.blockExec.ProcessProposal(cs.ProposalBlock)
+	// if err != nil {
+	// 	cs.Logger.Error("state machine returned an error when trying to process proposal block", "err", err)
+	// 	return
+	// }
+	stateMachineValidBlock := true


note that we should re-enable this after the mammoth testnet

evan-forbes · 2025-03-18T23:28:30Z

consensus/propagation/have_wants.go

+	// todo: we need to figure out a way to get the proof for a part that was
+	// sent during catchup.
+	proof := cb.GetProof(part.Index)
+	if proof == nil {
+		if part.Proof == nil {
+			blockProp.Logger.Error("proof not found", "peer", peer, "height", part.Height, "round", part.Round, "part", part.Index)
+			return
+		}
+		proof = part.Proof
+	}


now we get proofs that are stored in the compact block instead of including them in messages. now proofs are only included during catchup

rach-id · 2025-03-19T09:55:27Z

consensus/propagation/catchup.go

+// todo: add a request limit for each part to avoid downloading the block too
+// many times. atm, this code will request the same part from every peer.
+func (blockProp *Reactor) retryWants(currentHeight int64, currentRound int32) {
+	data := blockProp.dumpAll()


hum, so this supports gaps right?

rach-id · 2025-03-19T09:55:41Z

consensus/propagation/catchup.go

+		// only re-request original parts that are missing, not parity parts.
+		missing := prop.block.BitArray().Not()
+		if missing.IsEmpty() {
+			// this should never be hit due to the check above.


maybe log something here since it's never supposed to happen

rach-id · 2025-03-19T10:52:29Z

consensus/propagation/commitment.go

-// TODO document and explain the parameters
+// chunkIndexes creates a nested slice of starting and ending indexes for each
+// chunk. totalSize indicates the number of chunks. chunkSize indicates the size
+// of each chunk..


Suggested change

// of each chunk..

// of each chunk.

rach-id · 2025-03-19T10:53:28Z

consensus/propagation/commitment_state.go

-	var hasStored *types.BlockMeta
-	if height < p.currentHeight {
-		hasStored = p.store.LoadBlockMeta(height)
-	}

 	cachedProps, has := p.proposals[height]
 	cachedProp, hasRound := cachedProps[round]

 	// if the round is less than zero, then they're asking for the latest


Suggested change

// if the round is less than zero, then they're asking for the latest

// if the round is less than -1, then they're asking for the latest

rach-id · 2025-03-19T10:55:24Z

consensus/propagation/commitment_state.go

+// prune deletes all cached compact blocks for heights less than the provided
+// height and round.
+//
+// todo: also prune rounds. this requires prune in the consensus reactor after


this is pruning rounds too no? since we delete all the proposals for that round

rach-id · 2025-03-19T10:57:42Z

consensus/propagation/have_wants.go

+	hc.Sub(fullReqs)
+
+	if hc.IsEmpty() {
+		return


this also shouldn't happen since we're checking if the block is complete, maybe write some log

rach-id · 2025-03-19T11:12:53Z

consensus/propagation/have_wants.go

 	}

+	reqLimit := 1


so in the happy path, we only send wants once?

rach-id · 2025-03-19T11:15:52Z

consensus/propagation/have_wants.go

@@ -106,7 +104,7 @@ func (blockProp *Reactor) handleHaves(peer p2p.ID, haves *proptypes.HaveParts, b
 		},
 	}

-	if !p2p.SendEnvelopeShim(p.peer, e, blockProp.Logger) { //nolint:staticcheck
+	if !p2p.TrySendEnvelopeShim(p.peer, e, blockProp.Logger) { //nolint:staticcheck


since we're not making sure the message is sent, does it make sense to increase the request limit to 2? for additional redundancy

rach-id · 2025-03-19T11:20:13Z

consensus/propagation/reactor.go

@@ -17,8 +17,7 @@ import (
 )

 const (
-	// TODO: set a valid max msg size
-	maxMsgSize = 1048576
+	maxMsgSize = 4194304 // 4MiB


any rational for 4mib?

rach-id · 2025-03-19T11:22:55Z

consensus/propagation/reactor.go

+	// blockCacheSize determines the number of blocks to keep in the cache.
+	// After each block is committed, only the last `blockCacheSize` blocks are
+	// kept.
+	blockCacheSize = 5


IMO, this should be increased. Assuming 128mb blocks, we can assume this will hold 800mb of RAM (including proofs and other stuff), which is still small.

The issue is the blocktime ~3s. this only keeps the last 15sec compact blocks. If a node falls behind for more than 15sec (which I think is a lot likely to happen), will block sync be started automatically? I don't think that's the case, and this will leave the node hanging if all peers prune.

Maybe increase it to 25, which would keep ~4GB of RAM but that's fine IMO

evan-forbes added 2 commits March 5, 2025 21:59

feat: enable the new block prop reactor

3c012d8

fix: a bunch of fixes

Loading
Loading status checks…

c173228

evan-forbes self-assigned this Mar 6, 2025

evan-forbes changed the title ~~Evan/recovery/e2e test~~ fix: enable new reactor in the e2e test Mar 6, 2025

evan-forbes added 10 commits March 6, 2025 17:15

fix: no gap catchup

291880f

feat: catchup

Loading
Loading status checks…

cc76d7a

fix: all bugs

6f7043f

chore: cleanup

Loading
Loading status checks…

0bd1ad5

chore: linter

Loading
Loading status checks…

ba93525

fix: disable some proof tests until we figure out where its not being…

Loading
Loading status checks…

caeeabb

… added

fix: tests

7cd22b4

fix: race

Loading
Loading status checks…

83421ac

Merge branch 'feature/recovery' into evan/recovery/e2e-test

Loading
Loading status checks…

2b518cd

chore: fix test

Loading
Loading status checks…

45f1cfb

fix: catchup for now still need proofs

Loading
Loading status checks…

56e66d3

evan-forbes added 2 commits March 18, 2025 17:14

fix!: send proofs during catchup

Loading
Loading status checks…

c6b85e5

fix: tests

Loading
Loading status checks…

efe9793

evan-forbes marked this pull request as ready for review March 18, 2025 23:16

evan-forbes requested a review from a team as a code owner March 18, 2025 23:16

evan-forbes requested review from rootulp and rach-id and removed request for a team March 18, 2025 23:16

evan-forbes commented Mar 18, 2025

View reviewed changes

rach-id approved these changes Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable new reactor in the e2e test #1657

fix: enable new reactor in the e2e test #1657

evan-forbes commented Mar 6, 2025 •

edited

Loading

evan-forbes commented Mar 17, 2025

evan-forbes commented Mar 17, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

evan-forbes Mar 18, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

rach-id Mar 19, 2025

		round: -1, // meaning "latest round"
		round: -2, // meaning "latest round"

		func (blockProp Reactor) handleHaves(peer p2p.ID, haves proptypes.HaveParts, bypassRequestLimit bool) {
		func (blockProp Reactor) handleHaves(peer p2p.ID, haves proptypes.HaveParts, _ bool) {

	// if the round is less than zero, then they're asking for the latest
	// if the round is less than -1, then they're asking for the latest

fix: enable new reactor in the e2e test #1657

Are you sure you want to change the base?

fix: enable new reactor in the e2e test #1657

Conversation

evan-forbes commented Mar 6, 2025 • edited Loading

Description

evan-forbes commented Mar 17, 2025

evan-forbes commented Mar 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes commented Mar 6, 2025 •

edited

Loading