-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380
Comments
I got this just running the ZTS snapshot portion twice in a row. Which might be good because I'm researching "Unable to automount.." issues and that directly precedes the panic.
|
@PaulZ-98 if you're able to reproduce this it would be great if you could get to the root cause. It's been difficult to reproduce locally, and the bots do occasionally encounter it. |
I was able to reproduce this with a version from Jan 31, but not with a version pulled on March 7. Work with tip at: commit becdcec
Reproducible/fails with tip at: commit 57dc41d
If I have time if will do git bisect to try and find what fixes it. Have you seen it recently with buildbot @behlendorf? |
@PaulZ-98 now that you mention it, I haven't. This build on Feb 25th from PR 8418, is the last time any of the bots hit this. That PR was based off 2d76ab9 which narrows it down a little further. It would be ideal if we could bisect this. I'll make sure to keep an eye out for any more instances of this in the CI. |
I confirmed with git bisect, and also by hand-applying the patch, that @jwk404 fix #8409 eliminates the problem. It is still a bit concerning that when clone_001_pos cleanup fails, you can later get a panic. Here's the test log from when clone_001_pos cleanup fails (pre 8409 fix).
The PANIC occurs in snapshot_009_pos.ksh when we create a zvol and immediately snap -r the whole pool. Somehow, this situation (state left behind by clone_001_pos cleanup failure), causes dp_dirty_dirs to have entries at the end of dsl_pool_sync (because the synctasks from |
it is probably similar situation with https://www.illumos.org/issues/10445 - where we can see panic with snap-slider and with destroy older BE where we have several clones of snapshots |
@PaulZ-98 this is definitely still concerning. I'm glad to see the spurious failures in the CI go away, but clearly there's still a bug here which needs to be fixed. |
As mentioned, testpool/testvol@testsnap still exists at the time of running snapshot_009_pos.ksh. 009 attempts Most of the dsl_dataset_snapshot_check calls then return 0, except the testpool/testvol@testsnap check which returns EEXIST. So the dsl_dataset_snapshot_sync calls never happen for any of the -r snaps. At this point we have some dp_dirty_dirs courtesy of dsl_dataset_snapshot_check_impl --> dsl_dataset_snapshot_reserve_space, but the MOS is never dirtied because we don't take any snapshots. This complicates spa_sync_iterate_to_convergence because normally when you had dp_dirty_dirs you also dirtied the MOS.
I don't see an easy way to undo the work done by dsl_dataset_snapshot_reserve_space because the dsl_sync_task design doesn't have a cleanup function capability. |
Not sure why the snap -r is not prevented in user space / libzfs. In trivial cases of any descendant dataset of a snap -r already having that snap, it fails. |
Here's a potential fix. Don't reserve space in dsl_dataset_snapshot_check_impl. Just use that code to find the errors if any (in this case EEXIST). Then iff no errors are found, dirty the dsl_dir_t by reserving the space in a second loop through the to-be-taken snaps.
|
So with the above change, if clone_001_pos cleanup fails, then snapshot_009_pos (snap -r) also fails later with EEXIST. To test this you must withhold the cleanup retry fix to clone_001_pos, because you won't see the PANIC unless clone_001_pos cleanup fails. |
@PaulZ-98 that makes sense. But I think this code also needs to somehow handle the case where It also looks like |
Good points @behlendorf I'll look at it further. |
Now that we know what's going on it'd be nice to add a test case to verify the eventual fix and avoid regressions. |
Yes @behlendorf I'm taking clone_001_pos without any cleanup, and the start of snapshot_009_pos and combining it into a diabolical test does panic ZFS when you don't have this fix. |
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
The new test will cause the panic reliably when you don't have this fix, and it also revealed a different issue when destroying the datasets from the new test and moving on to destroy the pool.
And the related stacks...
|
I believe that the dbuf_evict_one thread has no knowledge of the dmu_objset_evict process, and they can fight over the dn_dbufs list and its contents. In the PR for this, I have introduced some mutual exclusion such that dbuf_evict_one just returns if the chosen dbuf's objset is already being evicted. |
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with the same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with the same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with the same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with the same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
Certain arrangment of existing snaps with the same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380
I managed to trigger this in my home lab running fc34dfb while running
|
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Distribution Name | Ubuntu
Distribution Version | 18.04
Linux Kernel | 4.15.0-39-generic
Architecture | x86_64
ZFS Version | zfs-0.8.0-rc3-43-g0902c4577f4b
SPL Version |
Describe the problem you're observing
Occasional failed assertion when running the ZTS. To my knowledge this has only been reproduced thus far by the CI.
Describe how to reproduce the problem
Occasionally reproduced by the CI when running the ZTS.
http://build.zfsonlinux.org/builders/Ubuntu%2017.04%20x86_64%20Coverage%20%28TEST%29/builds/4914/steps/shell_9/logs/console
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: