Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a lockfile when opening config files #1196

Merged
merged 6 commits into from
May 13, 2024
Merged

Conversation

gefjon
Copy link
Contributor

@gefjon gefjon commented May 2, 2024

Description of Changes

In the past, we've had issues where multiple concurrent CLI processes would race to read and write the CLI config file,
leading to data loss.

We considered using flock/LockFileEx and blocking until the file became available, but unfortunately it's not possible to atomically create and lock a nonexistent file, which we need to do in the case where the configuration doesn't yet exist.

Instead, we opt for a classic lockfile-based scheme: Before opening a config file foo.conf, attempt to exclusively create foo.lock, and panic if the exclusive creation fails. Once it becomes clear that we will not write the config any more, i.e. in Config::drop, delete the lockfile, allowing another process to operate.

This means that attempting to run multiple concurrent Spacetime CLI processes with the same config file is now a hard error.

API and ABI breaking changes

Arguably this is a CLI interface break, in the sense that running multiple concurrent CLI processes with the same config file is now a hard error, where before it sometimes worked.

Expected complexity level and risk

3 - CLI stuff is finicky and requires manual testing.

Testing

  • Using the new CLI:
    • Create an identity.
    • List identities for a particular server, a non-default server, and all servers.
    • Publish a database.
    • Fingerprint a server.
    • Add a server configuration.
    • Delete an identity.
    • Access a module's logs.
    • Run a SQL query against a module.
  • Create a file ~/.spacetime/config.lock, then attempt the above, and observe the error.
  • Delete ~/.spacetime/config.lock, try again, and see that it works again.
  • Invoke the CLI with invalid arguments to trigger a Clap parse error (e.g. spacetime lsit), then run again with a valid parse (e.g. spacetime identity list), and verify that there is no stale lockfile preventing the corrected command.
  • Delete the ~/.spacetime directory, then run a CLI command and verify that the lockfile can be created despite the directory not existing.
  • Do whatever cursed concurrent invocations were exhibiting the bug before, and observe that they now show an error rather than silently corrupting or losing data.

In the past, we've had issues where multiple concurrent CLI processes
would race to read and write the CLI config file,
leading to data loss.

We considered using `flock`/`LockFileEx` and blocking until the file became available,
but unfortunately it's not possible to atomically create and lock a nonexistent file,
which we need to do in the case where the configuration doesn't yet exist.

Instead, we opt for a classic lockfile-based scheme:
Before opening a config file `foo.conf`, attempt to exclusively create `foo.lock`,
and panic if the exclusive creation fails.
Once it becomes clear that we will not write the config any more,
i.e. in `Config::drop`,
delete the lockfile, allowing another process to operate.

This means that attempting to run multiple concurrent Spacetime CLI processes
with the same config file is now a hard error.
@gefjon gefjon added the CLI only This change only affects the CLI behavior label May 2, 2024
@gefjon gefjon requested review from bfops and jdetter May 2, 2024 19:00
This commit fixes two CI failures:

- `spacetime start`, and a few other CLI subcommands, do not access their `Config` at all,
  but the CLI constructs it unconditionally in `main`,
  which made it an error to run any CLI command while `spacetime start` was running.
  This is fixed by having subcommands which don't need a `Config`
  drop it before doing anything.
- Contrary to my assumption,
  the test configuration created by `Config::new_with_localhost` does get `drop`ped,
  because the test harness `clone`s is and passes an owned version to the CLI.
  This was causing it to attempt to delete the empty path, which failed.
  This is fixed by having the home configuration be `Option`al,
  and setting it to `None` in tests.
@gefjon
Copy link
Contributor Author

gefjon commented May 2, 2024

Ugh, if Clap parsing fails, it doesn't panic; instead it directly calls exit (presumably in order to suppress the panic message). This means that unwinding doesn't happen and destructors are never called, so lockfiles get left in place. Gross.

Edit: Fixed in latest commit; adding more testing steps.

Perform Clap argument parsing as the very first thing in a CLI process,
before locking the config,
because Clap calls `exit` directly on error rather than panicing
(presumably to have more control over error output),
which prevents destructors from running,
leaving stale lockfiles.
Also deduplicate logic for finding config file paths.
@gefjon gefjon requested a review from bfops May 3, 2024 16:31
Copy link
Contributor

@kazimuth kazimuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, otherwise LGTM 👍

if let Some(config_path) = Self::find_config_path(&cur_dir) {
Self::load_from_file(&config_path)
.inspect_err(|e| eprintln!("config file {config_path:?} is invalid: {e:#}"))
.unwrap_or_default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was like this before, but it might be worth raising the error here. I would prefer a CLI tool crashes rather than continuing with unknown defaults when there's a config error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but I'd like @jdetter to weigh in before I change any semantics more than is required for the ticket.

.with_context(|| format!("Unable to remove lockfile {:?}", self.path))
.unwrap();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: FWIW this is veeeeery close to a Config-independent struct.

do you have any opinions on factoring it out a la https://github.com/clockworklabs/SpacetimeDB/pull/1202/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings one way or the other. If you prefer that, I'm into it.

Copy link
Collaborator

@bfops bfops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, left some nits and one I-think-bug.

What do you think about splitting out the internal-only refactors into a base PR like https://github.com/clockworklabs/SpacetimeDB/pull/1203/files ?

(I suspect it would make this diff hard-to-argue)

@bfops bfops added the release-any To be landed in any release window label May 13, 2024
Copy link
Collaborator

@jdetter jdetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested + working. Originally I hit a snag on windows but I am now not able to reproduce it at all and it might be a somewhat rare issue. Even if the issue occurs it didn't currupt my config file which is what I care about.

Thanks Phoebe 👍

@jdetter jdetter added this pull request to the merge queue May 13, 2024
Merged via the queue into master with commit f6573c4 May 13, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLI only This change only affects the CLI behavior release-any To be landed in any release window
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants