Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_download_remote_layers_api and similar: don't start safekeepers to validate on-demand downloads #4081

Closed
problame opened this issue Apr 26, 2023 · 5 comments
Assignees
Labels
a/tech_debt Area: related to tech debt a/test/flaky Area: related to flaky tests a/test Area: related to testing c/storage/pageserver Component: storage: pageserver triaged bugs that were already triaged

Comments

@problame
Copy link
Contributor

I've recently found the following procedure to prevent a bunch of flakiness.
We should follow it in older tests, such as the original on-demand download tests (test_download_remote_layers_api).

  • fire up the full stack
  • insert data using endpoing / pgbench
  • shutdown endpoint
  • shutdown safekeepers
  • do whatever needs to be tested in pageserver
  • if we need to validate contents, start endpoint in read-only mode
  • BUT: never start the safekeepers again :D

For example, the test_download_remote_layers_api doesn't do this.
At the end, it starts SKs again just to make a read-only query that is supposed to do layer downloads.

(This is follow-up to #4024 (review) )

@problame problame added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing a/tech_debt Area: related to tech debt a/test/flaky Area: related to flaky tests labels Apr 26, 2023
@arssher
Copy link
Contributor

arssher commented Apr 26, 2023

I don't immediately see how that helps; the source of flakiness I've mostly encountered is logic to wait until data from compute up to some LSN arrives to pageserver, but in the above steps no such waiting is happening. Probably it is just about something different though.

@problame problame changed the title test_download_remote_layers_api and dimilar: don't start safekeepers to validate on-demand downloads test_download_remote_layers_api and similar: don't start safekeepers to validate on-demand downloads Apr 26, 2023
@problame
Copy link
Contributor Author

It's for tests that want to control layer creation.
These tests need to prevent a new layer from being created.
At least when I started at Neon, I thought it'd be sufficient to stop the endpoint.
But obviously it's not sufficient, because the safekeepers act as a buffer.
By stopping the safekeepers, and never starting them again, we ensure that no new WAL will arrive at pageserver.

So yes, it's for a different flakiness than the one you fixed in #4024

@LizardWizzard LizardWizzard added the triaged bugs that were already triaged label Jul 3, 2023
@koivunej
Copy link
Member

koivunej commented Jul 3, 2023

#3697 works like this.

@LizardWizzard
Copy link
Contributor

TODO: create some test suite docs/tips and link to it from CONTRIBUTING.md

@problame
Copy link
Contributor Author

problame commented Sep 6, 2024

Team is aware of this as a best practice and we nowadays have better monitoring of flaky tests & find issues early/earlier.

@problame problame closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt a/test/flaky Area: related to flaky tests a/test Area: related to testing c/storage/pageserver Component: storage: pageserver triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

4 participants