You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is probably better to split into sub issues, but for now I just want to provide a high-level overview of tasks we MAY need to implement for Anchor.
These are things currently in the Lighthouse peer manager, that might need a new version or to port to Anchor.
PeerDB
I understand there is a peer store now in libp2p, but let me highlight the features of this in Lighthouse, in case we need to add these.
The PeerDB in Lighthouse, firstly stores a list of peers and their state/info. It is the source of truth about all knowledge of our peers past connected and presently connected. It acts as a kind of LRU cache, but it has a cache for three different states.
Connected Peers - This is unbounded at the PeerDB level, but is limited by the number of connections we can have in libp2p. For obvious reasons we store which peers we are connected to.
Disconnected Peers - We keep a cache (500-1000 configurable via a const) of recently disconnected peers. I don't know if the peer-store does this. We keep this because for scoring, if a peer immediately disconnects from us and re-connects, we don't want them to reset their score. i.e they can't be naughty, disconnect and re-connect and be considered perfectly fine. It is also handy to have a list of recently disconnected peers for debugging purposes. We can also use this list to reconnect to peers that may be useful to us (i.e they are on a subnet we need).
Banned Peers - There is also a cache of about 500ish most recent banned peers. When a peer gets banned, we want to keep track of them (and their IP) so that we can prevent future connections. This will prevent a banned peer from reconnected over a time period (scoring determines how long, will talk about that later).
I think these features are nice, that we can configure the size of this cache and bound it based on memory usages.
This is a bigger and more pervasive topic. A few years ago I wrote an overview. Essentially we want to score peers based on what we observe. This are two main uses:
DOS and attack prevention - Malicious nodes can be identified, kicked and banned to prevent a whole suite of attacks
Performance - We can identify peers that are slow, faulty or invalid to our network and remove them in favour of other peers
We want a useful API so that most parts of the codebase can score a peer. Here are some simple examples:
Someone sends a malformed message on a gossip topic - They are faulty or malicious we want to penalize them. This will be caught in the network section of code, so should be scored there.
Someone sends a QBFT message with an invalid signature - They are faulty or malicious, but we will only detect this in the signature verification which will occur in the processor or thereabouts, so the processor needs to send a message to penalize the peer.
Someone sends an invalid QBFT message, eg invalid round change etc - We can only catch this in the QBFT logic, so needs to downscore a peer there.
In Lighthouse in the network layer, we build an API called report_peer (here where various parts of the code base could report a peer for some action and it will be scored accordingly depending on the severity of its mistake. There can be cases of obvious malicious actions in which case we want to instantly ban the peer and prevent connections.
The scoring system there works on a scale from 0 to -100 (i.e no positive score). There are only bad events that can happen. If a peer gets a score of -20 it gets disconnected, but we allow it to connect back (if we have spare peer slots). If it gets a score of -50, it gets banned. It can be very bad and get a score of -100. The score decays over time, i.e over time it moves back to 0, so a peer can become unbanned, but it takes longer for one that has -100 than one that has -51. The timing is configurable. Most of this logic is here
We also keep track of banned peer's IP addresses. If enough peers get banned on a /24 subnet, the entire subnet gets banned. This is to prevent peers from attacking us, changing their IP and continuing, perhaps we want this logic to. This is handled in the peer-db via helper functions.
Connection Status
This is a huge burden and where a lot of the spaghetti code comes into the peer manager. Over time the API for rust-libp2p has changed a lot. We needed a wrapper that used the swarm events, to keep effectively keep track of all the peers in our db and their status. Because there are a ridiculous number of edge cases and state transitions, over time we've managed to handle them all, but the code is complicated and somewhat scattered. I've made multiple attempts at trying to group this code into one place. i.e here.
I'm hoping that we can shift all of this logic into rust-libp2p and essentially get rust-libp2p to manage the connection state of all our peers in the peer-store and remove all of this logic from the peer manager. cc @jxs
Maintain Peer Slots
This is one of the most important features, imo. In Lighthouse, we allow 10% more connections than our --target-peer count. The reason for this is that a network that has a hard-limit of peers becomes difficult to join for new peers. We found that most nodes on the network, filled all their connections with each other. Anytime a new peer wanted to join the network, all the discovered peers would reject them because they had too many peers. Have a 10% buffer resolves this but also gives us the added feature that we can prune the excess in a way that maximizes the quality and usefulness of our peers.
In Lighthouse, if we have a target peer count of 100. We allow 10 extra peers to connect to us. Every 30 seconds, we prune-excess-peers. There are a number of competing objectives when pruning peers and the ordering is documented in the code. Some of the things we want to optimize for are:
Peer Score (We want to remove poorly performing peers in favour or newer or better peers)
Uniform Subnets - In Lighthouse we rotate through subnets, so having a uniform distribution of peers on all subnets is beneficial so we dont have to use discovery to find them. In this case, if lots of peers are on one subnet, we prune them in favour of a more distributed set. Over-time lighthouse should hit a steady state of a uniform distribution. In Anchor, we may only need a collection of specific subnets, so we may want to prune peers that are not on those subnets.
Operator Ids - We may want to keep peers that are directly related to our committees for faster and more direct communication
Various other metrics we may want to optimize our peer set.
Maintain Peers Per Subnet
Given a specific subnet that we need to listen and publish too, we need to make sure we have peers on that subnet, otherwise we can't receive or publish messages. In lighthouse this is quite involved because we rotate subnets quite frequently and for various reasons. In Anchor, they are more static so it should be simpler. The complicated code (Although I recently just simplified it a lot) in Lighthouse is here (it's not actually in the peer manager :p).
In Anchor, we probably just need to check every 30 seconds or something if we have a minimum number of peers on the subnets we require and if we don't perform discoveries for them. I think we just need some service or task that ensures we maintain our subnet peer count for subnets we need.
FIN
I think this covers the high-level features inside the peer manager that might be useful to Anchor and we can decide if we want any or all of these and potentially better way to implement or port them.
This issue is probably better to split into sub issues, but for now I just want to provide a high-level overview of tasks we MAY need to implement for Anchor.
These are things currently in the Lighthouse peer manager, that might need a new version or to port to Anchor.
PeerDB
I understand there is a peer store now in libp2p, but let me highlight the features of this in Lighthouse, in case we need to add these.
The PeerDB in Lighthouse, firstly stores a list of peers and their state/info. It is the source of truth about all knowledge of our peers past connected and presently connected. It acts as a kind of LRU cache, but it has a cache for three different states.
I think these features are nice, that we can configure the size of this cache and bound it based on memory usages.
Main code location is here: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/peerdb.rs
PeerInfo
This is the struct of the kind of information we want to store for each peer. It contains general things like seen IP addresses, multiaddrs, ENRs as well as more specific info, we might want to know operator id, or other qbft/ssv specific things. Its here: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/peerdb/peer_info.rs#L22
Scoring
This is a bigger and more pervasive topic. A few years ago I wrote an overview. Essentially we want to score peers based on what we observe. This are two main uses:
We want a useful API so that most parts of the codebase can score a peer. Here are some simple examples:
In Lighthouse in the network layer, we build an API called
report_peer
(here where various parts of the code base could report a peer for some action and it will be scored accordingly depending on the severity of its mistake. There can be cases of obvious malicious actions in which case we want to instantly ban the peer and prevent connections.The scoring system there works on a scale from 0 to -100 (i.e no positive score). There are only bad events that can happen. If a peer gets a score of -20 it gets disconnected, but we allow it to connect back (if we have spare peer slots). If it gets a score of -50, it gets banned. It can be very bad and get a score of -100. The score decays over time, i.e over time it moves back to 0, so a peer can become unbanned, but it takes longer for one that has -100 than one that has -51. The timing is configurable. Most of this logic is here
We also keep track of banned peer's IP addresses. If enough peers get banned on a /24 subnet, the entire subnet gets banned. This is to prevent peers from attacking us, changing their IP and continuing, perhaps we want this logic to. This is handled in the peer-db via helper functions.
Connection Status
This is a huge burden and where a lot of the spaghetti code comes into the peer manager. Over time the API for rust-libp2p has changed a lot. We needed a wrapper that used the swarm events, to keep effectively keep track of all the peers in our db and their status. Because there are a ridiculous number of edge cases and state transitions, over time we've managed to handle them all, but the code is complicated and somewhat scattered. I've made multiple attempts at trying to group this code into one place. i.e here.
I'm hoping that we can shift all of this logic into rust-libp2p and essentially get rust-libp2p to manage the connection state of all our peers in the peer-store and remove all of this logic from the peer manager. cc @jxs
Maintain Peer Slots
This is one of the most important features, imo. In Lighthouse, we allow 10% more connections than our
--target-peer
count. The reason for this is that a network that has a hard-limit of peers becomes difficult to join for new peers. We found that most nodes on the network, filled all their connections with each other. Anytime a new peer wanted to join the network, all the discovered peers would reject them because they had too many peers. Have a 10% buffer resolves this but also gives us the added feature that we can prune the excess in a way that maximizes the quality and usefulness of our peers.In Lighthouse, if we have a target peer count of 100. We allow 10 extra peers to connect to us. Every 30 seconds, we prune-excess-peers. There are a number of competing objectives when pruning peers and the ordering is documented in the code. Some of the things we want to optimize for are:
Maintain Peers Per Subnet
Given a specific subnet that we need to listen and publish too, we need to make sure we have peers on that subnet, otherwise we can't receive or publish messages. In lighthouse this is quite involved because we rotate subnets quite frequently and for various reasons. In Anchor, they are more static so it should be simpler. The complicated code (Although I recently just simplified it a lot) in Lighthouse is here (it's not actually in the peer manager :p).
In Anchor, we probably just need to check every 30 seconds or something if we have a minimum number of peers on the subnets we require and if we don't perform discoveries for them. I think we just need some service or task that ensures we maintain our subnet peer count for subnets we need.
FIN
I think this covers the high-level features inside the peer manager that might be useful to Anchor and we can decide if we want any or all of these and potentially better way to implement or port them.
cc @jking-aus @Zacholme7 @diegomrsantos @dknopik @ThreeHrSleep @jxs
The text was updated successfully, but these errors were encountered: