Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

Closed
behlendorf opened this issue Feb 5, 2019 · 19 comments · May be fixed by #8578
Closed

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

behlendorf opened this issue Feb 5, 2019 · 19 comments · May be fixed by #8578
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@behlendorf
Copy link
Contributor

System information

Distribution Name | Ubuntu
Distribution Version | 18.04
Linux Kernel | 4.15.0-39-generic
Architecture | x86_64
ZFS Version | zfs-0.8.0-rc3-43-g0902c4577f4b
SPL Version |

Describe the problem you're observing

Occasional failed assertion when running the ZTS. To my knowledge this has only been reproduced thus far by the CI.

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[30973.264795] PANIC at spa.c:7957:spa_sync_iterate_to_convergence()

Describe how to reproduce the problem

Occasionally reproduced by the CI when running the ZTS.

http://build.zfsonlinux.org/builders/Ubuntu%2017.04%20x86_64%20Coverage%20%28TEST%29/builds/4914/steps/shell_9/logs/console

Include any warning/errors/backtraces from the system logs

[30973.259253] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[30973.264795] PANIC at spa.c:7957:spa_sync_iterate_to_convergence()
[30973.270149] Showing stack for process 9302
[30973.270154] CPU: 1 PID: 9302 Comm: txg_sync Tainted: P           OE   4.13.0-coverage1 #2
[30973.270162] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[30973.270166] Call Trace:
[30973.270175]  dump_stack+0xb0/0xe6
[30973.270196]  spl_dumpstack+0x39/0x50 [spl]
[30973.270212]  spl_panic+0xd7/0x140 [spl]
[30973.270452]  ? dbuf_rele+0x61/0xb0 [zfs]
[30973.270643]  ? dmu_buf_rele+0x16/0x20 [zfs]
[30973.270851]  ? zap_unlockdir+0x6b/0xd0 [zfs]
[30973.271060]  ? zap_lookup_norm+0x9c/0xc0 [zfs]
[30973.271257]  ? queued_spin_unlock+0x13/0x20 [zfs]
[30973.271465]  ? queued_spin_unlock+0x13/0x20 [zfs]
[30973.271672]  spa_sync_iterate_to_convergence+0x3e0/0x430 [zfs]
[30973.271879]  spa_sync+0x469/0xb50 [zfs]
[30973.272093]  txg_sync_thread+0x3bc/0x590 [zfs]
[30973.272306]  ? txg_dispatch_callbacks+0x180/0x180 [zfs]
[30973.272325]  thread_generic_wrapper+0xaf/0x100 [spl]
[30973.272335]  kthread+0x150/0x1b0
[30973.272353]  ? IS_ERR+0x20/0x20 [spl]
[30973.272356]  ? kthread_flush_work+0x1a0/0x1a0
@behlendorf behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label Feb 5, 2019
@behlendorf behlendorf added this to the 0.8.0 milestone Feb 5, 2019
@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Feb 20, 2019

I got this just running the ZTS snapshot portion twice in a row. Which might be good because I'm researching "Unable to automount.." issues and that directly precedes the panic.

Feb 20 12:45:38 ub16fresh kernel: [72120.548071] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.1/testpool/[email protected]: 256
Feb 20 12:45:38 ub16fresh kernel: [72120.609377] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.2/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.678067] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.3/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.775444] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.4/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.850295] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.5/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.914487] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.6/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.975676] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.7/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72121.037588] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.8/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72121.103074] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.9/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72121.603153] WARNING: Unable to automount /var/tmp/testdir1/.zfs/snapshot/testsnap/testpool/testctr/testfs1@testsnap: 256
Feb 20 12:45:40 ub16fresh kernel: [72122.151665] WARNING: Unable to automount /var/tmp/testdir1/.zfs/snapshot/testsnap/testpool/testctr/testfs1@testsnap: 256
Feb 20 12:45:43 ub16fresh kernel: [72125.046754] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
Feb 20 12:45:43 ub16fresh kernel: [72125.046758] PANIC at spa.c:7957:spa_sync_iterate_to_convergence()
Feb 20 12:45:43 ub16fresh kernel: [72125.046759] Showing stack for process 20380
Feb 20 12:45:43 ub16fresh kernel: [72125.046761] CPU: 0 PID: 20380 Comm: txg_sync Tainted: P           OE   4.10.0-28-generic #32~16.04.2-Ubuntu
Feb 20 12:45:43 ub16fresh kernel: [72125.046761] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Feb 20 12:45:43 ub16fresh kernel: [72125.046762] Call Trace:
Feb 20 12:45:43 ub16fresh kernel: [72125.046775]  dump_stack+0x63/0x90
Feb 20 12:45:43 ub16fresh kernel: [72125.046781]  spl_dumpstack+0x29/0x30 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046784]  spl_panic+0xc5/0x100 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046786]  ? spl_kmem_zalloc+0xd9/0x170 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046789]  ? spl_kmem_free+0x33/0x40 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046822]  ? dmu_tx_destroy+0x12d/0x150 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046843]  ? dmu_tx_commit+0x190/0x200 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046849]  ? mutex_lock+0x12/0x40
Feb 20 12:45:43 ub16fresh kernel: [72125.046878]  ? txg_verify+0x14/0xe0 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046905]  ? txg_list_empty_impl+0x34/0x70 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046933]  spa_sync+0x1805/0x1840 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046959]  ? zfs_refcount_remove_many+0x1c4/0x250 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046987]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.047015]  txg_sync_thread+0x2f2/0x690 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.047041]  ? txg_quiesce_thread+0x550/0x550 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.047045]  thread_generic_wrapper+0x7b/0xc0 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.047048]  kthread+0x109/0x140
Feb 20 12:45:43 ub16fresh kernel: [72125.047051]  ? __thread_exit+0x20/0x20 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.047052]  ? kthread_create_on_node+0x60/0x60
Feb 20 12:45:43 ub16fresh kernel: [72125.047053]  ret_from_fork+0x2c/0x40

@behlendorf
Copy link
Contributor Author

@PaulZ-98 if you're able to reproduce this it would be great if you could get to the root cause. It's been difficult to reproduce locally, and the bots do occasionally encounter it.

@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Mar 8, 2019

I was able to reproduce this with a version from Jan 31, but not with a version pulled on March 7.

Work with tip at:

commit becdcec
Author: Tony Hutter [email protected]
Date: Wed Mar 6 16:03:03 2019 -0800

kernel_fpu fixes

Reproducible/fails with tip at:

commit 57dc41d
Author: Tony Hutter [email protected]
Date: Thu Jan 31 10:51:18 2019 -0800

Fix zpool iostat -w header names

If I have time if will do git bisect to try and find what fixes it.

Have you seen it recently with buildbot @behlendorf?

@behlendorf
Copy link
Contributor Author

@PaulZ-98 now that you mention it, I haven't. This build on Feb 25th from PR 8418, is the last time any of the bots hit this. That PR was based off 2d76ab9 which narrows it down a little further. It would be ideal if we could bisect this. I'll make sure to keep an eye out for any more instances of this in the CI.

@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Mar 9, 2019

I confirmed with git bisect, and also by hand-applying the patch, that @jwk404 fix #8409 eliminates the problem. It is still a bit concerning that when clone_001_pos cleanup fails, you can later get a panic.

Here's the test log from when clone_001_pos cleanup fails (pre 8409 fix).

10:42:38.79 NOTE: Performing local cleanup via log_onexit (cleanup_all)
10:42:39.04 SUCCESS: zfs destroy -Rf testpool/testfs@testsnap
10:42:39.04 SUCCESS: rm -rf /var/tmp/testdir.0
10:42:39.10 ERROR: zfs destroy -Rf testpool/testvol@testsnap exited 1
10:42:39.10 cannot destroy 'testpool/testclone1': dataset is busy cannot destroy snapshot testpool/testvol@testsnap: snapshot is cloned
10:42:39.10 NOTE: Performing test-fail callback (/root/zap_prefetch/zfs/tests/zfs-tests/callbacks/zfs_dbgmsg.ksh)

The PANIC occurs in snapshot_009_pos.ksh when we create a zvol and immediately snap -r the whole pool. Somehow, this situation (state left behind by clone_001_pos cleanup failure), causes dp_dirty_dirs to have entries at the end of dsl_pool_sync (because the synctasks from zfs snapshot -r testpool@testsnap dirtied the dirs). Normally we count on dp_dirty_dirs being empty, but here the ASSERT is tripped later during spa_sync.

@ikozhukhov
Copy link
Contributor

it is probably similar situation with https://www.illumos.org/issues/10445 - where we can see panic with snap-slider and with destroy older BE where we have several clones of snapshots

@behlendorf
Copy link
Contributor Author

@PaulZ-98 this is definitely still concerning. I'm glad to see the spurious failures in the CI go away, but clearly there's still a bug here which needs to be fixed.

@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Mar 12, 2019

As mentioned, testpool/testvol@testsnap still exists at the time of running snapshot_009_pos.ksh.

009 attempts zfs snapshot -r testpool@testsnap

Most of the dsl_dataset_snapshot_check calls then return 0, except the testpool/testvol@testsnap check which returns EEXIST. So the dsl_dataset_snapshot_sync calls never happen for any of the -r snaps. At this point we have some dp_dirty_dirs courtesy of dsl_dataset_snapshot_check_impl --> dsl_dataset_snapshot_reserve_space, but the MOS is never dirtied because we don't take any snapshots.

This complicates spa_sync_iterate_to_convergence because normally when you had dp_dirty_dirs you also dirtied the MOS.

               if (pass == 1 &&
7949  		    spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
7950  		    !dmu_objset_is_dirty(mos, txg)) {
7951  			/*
7952  			 * Nothing changed on the first pass, therefore this
7953  			 * TXG is a no-op. Avoid syncing deferred frees, so
7954  			 * that we can keep this TXG as a no-op.
7955  			 */
7956  			ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7957  			ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7958  			ASSERT(txg_list_empty(&dp->dp_sync_tasks, txg));
7959  			ASSERT(txg_list_empty(&dp->dp_early_sync_tasks, txg));
7960  			break;
7961  		}

I don't see an easy way to undo the work done by dsl_dataset_snapshot_reserve_space because the dsl_sync_task design doesn't have a cleanup function capability.

@PaulZ-98
Copy link
Contributor

Not sure why the snap -r is not prevented in user space / libzfs. In trivial cases of any descendant dataset of a snap -r already having that snap, it fails.

@PaulZ-98
Copy link
Contributor

Here's a potential fix. Don't reserve space in dsl_dataset_snapshot_check_impl. Just use that code to find the errors if any (in this case EEXIST). Then iff no errors are found, dirty the dsl_dir_t by reserving the space in a second loop through the to-be-taken snaps.

diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 168aea8..a2e5df7 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1419,10 +1419,6 @@ dsl_dataset_snapshot_check_impl(dsl_dataset_t *ds, const char *snapname,
                        return (error);
        }
 
-       error = dsl_dataset_snapshot_reserve_space(ds, tx);
-       if (error != 0)
-               return (error);
-
        return (0);
 }
 
@@ -1566,6 +1562,28 @@ dsl_dataset_snapshot_check(void *arg, dmu_tx_t *tx)
                }
        }
 
+       if (rv != 0)
+               return (rv);
+
+       /* Now reserve space since we passed all the checks. */
+       for (pair = nvlist_next_nvpair(ddsa->ddsa_snaps, NULL);
+           pair != NULL; pair = nvlist_next_nvpair(ddsa->ddsa_snaps, pair)) {
+               int error = 0;
+               char *name, *atp = NULL;
+               dsl_dataset_t *ds;
+               char dsname[ZFS_MAX_DATASET_NAME_LEN];
+
+               name = nvpair_name(pair);
+               atp = strchr(name, '@');
+               (void) strlcpy(dsname, name, atp - name + 1);
+               (void) dsl_dataset_hold(dp, dsname, FTAG, &ds);
+               error = dsl_dataset_snapshot_reserve_space(ds, tx);
+               dsl_dataset_rele(ds, FTAG);
+               if (error) {
+                       rv = error;
+                       break;
+               }
+       }
        return (rv);
 }

@PaulZ-98
Copy link
Contributor

So with the above change, if clone_001_pos cleanup fails, then snapshot_009_pos (snap -r) also fails later with EEXIST. To test this you must withhold the cleanup retry fix to clone_001_pos, because you won't see the PANIC unless clone_001_pos cleanup fails.

@behlendorf
Copy link
Contributor Author

@PaulZ-98 that makes sense. But I think this code also needs to somehow handle the case where dsl_dataset_snapshot_reserve_space() fails with ENOSPC. Otherwise we'll end up in the same situation again where the sync task doesn't run but the dsl_dir_t has been dirtied.

It also looks like dsl_dataset_snapshot_tmp_check depends on dsl_dataset_snapshot_check_impl so I'd expect it to suffer from the same issue.

@PaulZ-98
Copy link
Contributor

Good points @behlendorf I'll look at it further.

@behlendorf
Copy link
Contributor Author

Now that we know what's going on it'd be nice to add a test case to verify the eventual fix and avoid regressions.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 20, 2019
@behlendorf behlendorf removed this from the 0.8.0 milestone Mar 20, 2019
@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Mar 28, 2019

Yes @behlendorf I'm taking clone_001_pos without any cleanup, and the start of snapshot_009_pos and combining it into a diabolical test does panic ZFS when you don't have this fix.

PaulZ-98 added a commit to datto/zfs that referenced this issue Apr 4, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Apr 4, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Apr 4, 2019

The new test will cause the panic reliably when you don't have this fix, and it also revealed a different issue when destroying the datasets from the new test and moving on to destroy the pool.

Apr  2 10:23:42 ub16fresh kernel: [ 2030.021096] VERIFY(diff > 0 ? child == 1 : child == 0) failed
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021099] PANIC at avl.c:620:avl_insert_here()
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021100] Showing stack for process 8721
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021102] CPU: 0 PID: 8721 Comm: dbu_evict Tainted: P           OE   4.10.0-28-generic #32~16.04.2-Ubuntu
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021103] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021104] Call Trace:
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021110]  dump_stack+0x63/0x90
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021117]  spl_dumpstack+0x29/0x30 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021120]  spl_panic+0xc5/0x100 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021168]  ? __list_add+0x17/0x40 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021203]  ? queued_spin_unlock+0xb/0x10 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021236]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021267]  ? __raw_spin_unlock+0x9/0x10 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021289]  ? dbuf_cache_target_bytes+0x23/0x40 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021311]  ? dbuf_evict_notify+0x1e/0x50 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021332]  ? dbuf_rele_and_unlock+0x517/0x5b0 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021352]  ? aggsum_add+0x107/0x120 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021355]  ? __kmalloc_node+0x1f0/0x2a0
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021358]  ? spl_kmem_alloc_impl+0x79/0x100 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021360]  avl_insert_here+0x3d4/0x410 [zavl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021387]  dnode_evict_dbufs+0x1f7/0x280 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021411]  dmu_objset_evict_dbufs+0x136/0x220 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021435]  dmu_objset_evict+0x5e/0x170 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021461]  dsl_dataset_evict_async+0x3b/0x280 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021465]  taskq_thread+0x262/0x4d0 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021467]  ? wake_up_q+0x70/0x70
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021494]  ? dsl_dataset_feature_is_active+0x20/0x20 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021496]  kthread+0x109/0x140
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021500]  ? taskq_thread_spawn+0x50/0x50 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021501]  ? kthread_create_on_node+0x60/0x60
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021504]  ret_from_fork+0x2c/0x40

And the related stacks...

root 2833 2 0 16:05 ? 00:00:00 [dbu_evict]
[<ffffffffc069d46b>] spl_panic+0xeb/0x100 [spl]
[<ffffffffc03e7084>] avl_insert_here+0x3d4/0x410 [zavl]
[<ffffffffc0839de7>] dnode_evict_dbufs+0x1f7/0x280 [zfs]
[<ffffffffc081b816>] dmu_objset_evict_dbufs+0x136/0x220 [zfs]
[<ffffffffc081f7ae>] dmu_objset_evict+0x5e/0x170 [zfs]
[<ffffffffc083df5b>] dsl_dataset_evict_async+0x3b/0x280 [zfs]
[<ffffffffc06a5bb2>] taskq_thread+0x262/0x4d0 [spl]
[<ffffffff8eea8de9>] kthread+0x109/0x140
[<ffffffff8f6d413c>] ret_from_fork+0x2c/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
root 2834 2 0 16:05 ? 00:00:02 [dbuf_evict]
[<ffffffffc0803a5b>] dbuf_destroy+0x1fb/0x6f0 [zfs]
[<ffffffffc0804061>] dbuf_evict_one+0x111/0x1c0 [zfs]
[<ffffffffc0804251>] dbuf_evict_thread+0x141/0x270 [zfs]
[<ffffffffc06a6e4b>] thread_generic_wrapper+0x7b/0xc0 [spl]
[<ffffffff8eea8de9>] kthread+0x109/0x140
[<ffffffff8f6d413c>] ret_from_fork+0x2c/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
root 18742 17666 0 16:27 pts/18 00:00:00 /root/snapexists/zfs/cmd/zpool/.libs/lt-zpool destroy -f testpool1
[<ffffffffc06a48e4>] taskq_wait+0x64/0xa0 [spl]
[<ffffffffc080bbc5>] dmu_buf_user_evict_wait+0x15/0x20 [zfs]
[<ffffffffc08a1cab>] spa_evicting_os_wait+0x9b/0xd0 [zfs]
[<ffffffffc0895840>] spa_export_common+0x3f0/0x510 [zfs]
[<ffffffffc089597a>] spa_destroy+0x1a/0x20 [zfs]
[<ffffffffc08fcc7a>] zfs_ioc_pool_destroy+0x1a/0x20 [zfs]
[<ffffffffc090090f>] zfsdev_ioctl+0x60f/0x6c0 [zfs]
[<ffffffff8f058dc1>] do_vfs_ioctl+0xa1/0x5f0
[<ffffffff8f059389>] SyS_ioctl+0x79/0x90
[<ffffffff8f6d3ebb>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff

@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Apr 4, 2019

I believe that the dbuf_evict_one thread has no knowledge of the dmu_objset_evict process, and they can fight over the dn_dbufs list and its contents. In the PR for this, I have introduced some mutual exclusion such that dbuf_evict_one just returns if the chosen dbuf's objset is already being evicted.

PaulZ-98 added a commit to datto/zfs that referenced this issue Apr 8, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Jun 14, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Jul 26, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Jul 26, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Jul 26, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Jul 29, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Aug 6, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Sep 24, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Sep 26, 2019
Certain arrangment of existing snaps with same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Oct 2, 2019
Certain arrangment of existing snaps with the same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Oct 9, 2019
Certain arrangment of existing snaps with the same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Oct 10, 2019
Certain arrangment of existing snaps with the same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Nov 13, 2019
Certain arrangment of existing snaps with the same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
PaulZ-98 added a commit to datto/zfs that referenced this issue Nov 14, 2019
Certain arrangment of existing snaps with the same
name as specified in new snap -r command can
create a situation during sync phase with entries
in dp_dirty_dirs but no dirty dbufs in the MOS.

Also when zfs tests are destroying datasets and then
the pool there is a race condition between
dmu_objset_evict and dbuf_evict_one, resulting in
a hang.

Signed-off-by: Paul Zuchowski <[email protected]>
Fixes openzfs#8380
@gamanakis
Copy link
Contributor

gamanakis commented Aug 24, 2020

I managed to trigger this in my home lab running fc34dfb while running zfs snapshot -r pool@snap

[Aug24 08:48] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[  +0.000010] PANIC at spa.c:8922:spa_sync_iterate_to_convergence()
[  +0.000002] Showing stack for process 54206
[  +0.000002] CPU: 3 PID: 54206 Comm: txg_sync Kdump: loaded Tainted: P           OE     5.7.10-1-vfio #1
[  +0.000001] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ELITE/Z390 AORUS ELITE-CF, BIOS F10c 12/18/2019
[  +0.000001] Call Trace:
[  +0.000005]  dump_stack+0x64/0x88
[  +0.000009]  spl_panic+0xef/0x117 [spl]
[  +0.000078]  ? dbuf_rele_and_unlock+0x4e8/0x620 [zfs]
[  +0.000089]  ? txg_verify+0xf/0xe0 [zfs]
[  +0.000068]  spa_sync_iterate_to_convergence+0x2e0/0x310 [zfs]
[  +0.000076]  spa_sync+0x384/0x860 [zfs]
[  +0.000063]  txg_sync_thread+0x302/0x460 [zfs]
[  +0.000058]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[  +0.000006]  thread_generic_wrapper+0x78/0xb0 [spl]
[  +0.000006]  ? IS_ERR+0x10/0x10 [spl]
[  +0.000003]  kthread+0x117/0x130
[  +0.000001]  ? __kthread_bind_mask+0x60/0x60
[  +0.000003]  ret_from_fork+0x1f/0x40

@stale
Copy link

stale bot commented Aug 24, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2021
@stale stale bot closed this as completed Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants