Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid pubkey cache timeouts #2381

Closed
wants to merge 3 commits into from

Conversation

paulhauner
Copy link
Member

Issue Addressed

Proposed Changes

Move to a two-lock system for the ValidatorPubkeyCache which will not block read-access to the cache whilst writing to the DB or decompressing validator pubkeys.

Additional Info

NA

@paulhauner paulhauner added the work-in-progress PR is a work-in-progress label May 31, 2021
@paulhauner
Copy link
Member Author

I think this should be a target for v1.5.0 (i.e., not v1.4.0)

@paulhauner paulhauner added the v1.5.0 For inclusion in v1.5.0 release label Jun 2, 2021
@paulhauner paulhauner force-pushed the pubkey-cache-timeouts branch from 829e563 to 1b0965f Compare July 12, 2021 00:50
@paulhauner paulhauner removed the v1.5.0 For inclusion in v1.5.0 release label Jul 22, 2021
@paulhauner paulhauner added the v1.5.1 To be included in the v1.5.1 relase label Aug 2, 2021
@michaelsproul michaelsproul added v1.5.2 The release after v1.5.1 v2.0.0 Altair on mainnet release (v2.0.0) and removed v1.5.1 To be included in the v1.5.1 relase v1.5.2 The release after v1.5.1 labels Aug 26, 2021
@michaelsproul michaelsproul removed the v2.0.0 Altair on mainnet release (v2.0.0) label Sep 27, 2021
bors bot pushed a commit that referenced this pull request Oct 19, 2021
## Issue Addressed

Mitigates #1096

## Proposed Changes

Add a flag to the beacon node called `--disable-lock-timeouts` which allows opting out of lock timeouts.

The lock timeouts serve a dual purpose:

1. They prevent any single operation from hogging the lock for too long. When a timeout occurs it logs a nasty error which indicates that there's suboptimal lock use occurring, which we can then act on.
2. They allow deadlock detection. We're fairly sure there are no deadlocks left in Lighthouse anymore but the timeout locks offer a safeguard against that.

However, timeouts on locks are not without downsides:

They allow for the possibility of livelock, particularly on slower hardware. If lock timeouts keep failing spuriously the node can be prevented from making any progress, even if it would be able to make progress slowly without the timeout. One particularly concerning scenario which could occur would be if a DoS attack succeeded in slowing block signature verification times across the network, and all Lighthouse nodes got livelocked because they timed out repeatedly. This could also occur on just a subset of nodes (e.g. dual core VPSs or Raspberri Pis).

By making the behaviour runtime configurable this PR allows us to choose the behaviour we want depending on circumstance. I suspect that long term we could make the timeout-free approach the default (#2381 moves in this direction) and just enable the timeouts on our testnet nodes for debugging purposes. This PR conservatively leaves the default as-is so we can gain some more experience before switching the default.
@paulhauner
Copy link
Member Author

Closing due due to staleness. We've removed a bunch of these timeouts already and if we wanted to remove more we'd be best to start with an entirely new PR.

@paulhauner paulhauner closed this Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants