Skip to content

rotation of outbound peers#3037

Merged
pompon0 merged 70 commits intomainfrom
gprusak-stablehash
Mar 19, 2026
Merged

rotation of outbound peers#3037
pompon0 merged 70 commits intomainfrom
gprusak-stablehash

Conversation

@pompon0
Copy link
Copy Markdown
Contributor

@pompon0 pompon0 commented Mar 6, 2026

To make the p2p network uniformly random, nodes need to be able to change their peers periodically (otherwise network would be centralized at bootstrap peers). This PR implements an algorithm based on stable hashing, which makes nodes assign random priority to every node ID, and connect to peers with higher priority.

This pr also separates inbound and outbound connection pools to simplify logic. In particular it is possible that there will be 2 concurrent connections between 2 peers (inbound + outbound). Avoiding duplicate connections is best effort - nodes try not to dial peers that they are connected to, but in case 2 peers dial each other at the same time they let those 2 connections be.

Basic requirements:

  • peermanager maintains a pex table: addresses of peers of our peers. This bounds the network view size (even though it is an unauthenticated network) that our node needs to maintain and provides enough exposure to select new peers. Peers are periodically reporting their connections, therefore peermanager keeps only fresh addresses
  • on startup a spike of dials is expected - node will try to connect to peers that it was connected to before restart
  • during stable operation node will dial peers at a low rate like 1/s or 0.1/s, and the node should be selected from a fresh set of addresses - i.e. we cannot snapshot a list of currently available addresses and try to dial them all (it will take hours)
  • despite low dial rate, node should attempt to round robin over the ever changing peer candidates set - i.e. it should not get stuck dialing the same bad address over and over
  • in the stable-hash-based approach, each peer ID obtains a priority for dialing - it should be taken into account
  • implementation should support replacing low priority peers with higher priority peers (to support convergence to a random graph). The churn of the connections should be low though, so that connection efficiency is not affected. My initial guesstimate would be that we should allow replacing a connection every ~1min.
  • We need to support a pex table with ~100 * 100 = 10k addresses total (100 connections per peer is a safe estimate with the current implementation). Whether we can affort just rank all the addresses on every dial attempt is a borderline IMO.
  • the addresses inserted to pex table should be made available for dialing ASAP, without any active polling, if possible.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 19, 2026, 2:52 PM

Comment on lines +28 to +37
for id := range s.last {
if _,ok := GetAny(conns,id); !ok {
delete(s.last, id)
update = PeerUpdate{
NodeID: id,
Status: PeerStatusDown,
}
return true
}
}

Check warning

Code scanning / CodeQL

Iteration over map Warning

Iteration over map may be a possible source of non-determinism
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I think my main concern is we are trusting PEX from any connected peer blindly, but we don't have to fix that in this PR.

}
// We have found a new best candidate.
best = utils.Some(addr.pNodeID())
clear(bestAddrs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an attacker doesn't know your seed, but it can inject fake addresses and try its luck on the fake addresses being high priority for someone, and then we will periodically try the fake addresses?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, also there is nothing much to do here - the attacker can do sybil attack and sybil peers are not distinguishable from other peers.

}))
for endpoint := range endpointSet {
c, err := utils.WithOptTimeout1(ctx, r.options.DialTimeout, func(ctx context.Context) (tcp.Conn, error) {
return tcp.Dial(ctx, endpoint.AddrPort)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if an attacker tries to use you to launch DoS attack, it can tell you IP of validator B belongs to node A. Of course you will find out later after dialing succeeds and you find out this is node B not node A, but we don't remove the address right? So the attacker can kind of delays your connection to A, and make you launch a mini dos attack to B?
We might need some feedback if we see a node mismatch, also we might want to continue dialing other IP of A? (You don't have to do that in this PR though)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dial all the IPs of A that we know of (see the outer loop). Attacker can slightly delay us connecting to A, but it doesn't matter. There is not much we can do about the fact that peer can send us mismatching (node key, ip) pairs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can punish sender of the PEX if it gives us random garbage. Maybe the sender should only give us outbound connection so they are sure the IP:port is correct. But we don't have to do that until later.

Copy link
Copy Markdown
Contributor Author

@pompon0 pompon0 Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can punish sender of the PEX if it gives us random garbage

what if peer of our peer is the attacker though and presents us with a different identity than when it did to our peer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess there are two layers:

  1. I care about staked validators a lot, and I trust staked validators from staked peers
  2. For other non-staked validators I don't care that much, this is best effort. The important ones I care are in my persist list always.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, we should treat staked nodes separately. For that we need to have a way to distinguish them first.

@pompon0 pompon0 enabled auto-merge March 18, 2026 19:08
@pompon0 pompon0 added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026
@pompon0 pompon0 enabled auto-merge March 19, 2026 12:48
@pompon0 pompon0 added this pull request to the merge queue Mar 19, 2026
Merged via the queue into main with commit cc96111 Mar 19, 2026
38 checks passed
@pompon0 pompon0 deleted the gprusak-stablehash branch March 19, 2026 13:54
@pompon0
Copy link
Copy Markdown
Contributor Author

pompon0 commented Mar 19, 2026

/backport

@seidroid
Copy link
Copy Markdown

seidroid bot commented Mar 19, 2026

Created backport PR for release/v6.4:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-3037-to-release/v6.4
git worktree add --checkout .worktree/backport-3037-to-release/v6.4 backport-3037-to-release/v6.4
cd .worktree/backport-3037-to-release/v6.4
git reset --hard HEAD^
git cherry-pick -x cc96111b94678fe786ce5fbc27ee8ed0ba155cf1
git push --force-with-lease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants