VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

behlendorf · 2019-02-05T17:16:26Z

System information

Describe the problem you're observing

Occasional failed assertion when running the ZTS. To my knowledge this has only been reproduced thus far by the CI.

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[30973.264795] PANIC at spa.c:7957:spa_sync_iterate_to_convergence()

Describe how to reproduce the problem

Occasionally reproduced by the CI when running the ZTS.

http://build.zfsonlinux.org/builders/Ubuntu%2017.04%20x86_64%20Coverage%20%28TEST%29/builds/4914/steps/shell_9/logs/console

Include any warning/errors/backtraces from the system logs

[30973.259253] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[30973.264795] PANIC at spa.c:7957:spa_sync_iterate_to_convergence()
[30973.270149] Showing stack for process 9302
[30973.270154] CPU: 1 PID: 9302 Comm: txg_sync Tainted: P           OE   4.13.0-coverage1 #2
[30973.270162] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[30973.270166] Call Trace:
[30973.270175]  dump_stack+0xb0/0xe6
[30973.270196]  spl_dumpstack+0x39/0x50 [spl]
[30973.270212]  spl_panic+0xd7/0x140 [spl]
[30973.270452]  ? dbuf_rele+0x61/0xb0 [zfs]
[30973.270643]  ? dmu_buf_rele+0x16/0x20 [zfs]
[30973.270851]  ? zap_unlockdir+0x6b/0xd0 [zfs]
[30973.271060]  ? zap_lookup_norm+0x9c/0xc0 [zfs]
[30973.271257]  ? queued_spin_unlock+0x13/0x20 [zfs]
[30973.271465]  ? queued_spin_unlock+0x13/0x20 [zfs]
[30973.271672]  spa_sync_iterate_to_convergence+0x3e0/0x430 [zfs]
[30973.271879]  spa_sync+0x469/0xb50 [zfs]
[30973.272093]  txg_sync_thread+0x3bc/0x590 [zfs]
[30973.272306]  ? txg_dispatch_callbacks+0x180/0x180 [zfs]
[30973.272325]  thread_generic_wrapper+0xaf/0x100 [spl]
[30973.272335]  kthread+0x150/0x1b0
[30973.272353]  ? IS_ERR+0x20/0x20 [spl]
[30973.272356]  ? kthread_flush_work+0x1a0/0x1a0

The text was updated successfully, but these errors were encountered:

PaulZ-98 · 2019-02-20T17:58:31Z

I got this just running the ZTS snapshot portion twice in a row. Which might be good because I'm researching "Unable to automount.." issues and that directly precedes the panic.

Feb 20 12:45:38 ub16fresh kernel: [72120.548071] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.1/testpool/[email protected]: 256
Feb 20 12:45:38 ub16fresh kernel: [72120.609377] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.2/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.678067] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.3/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.775444] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.4/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.850295] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.5/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.914487] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.6/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72120.975676] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.7/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72121.037588] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.8/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72121.103074] WARNING: Unable to automount /var/tmp/testdir/.zfs/snapshot/testsnap.9/testpool/[email protected]: 256
Feb 20 12:45:39 ub16fresh kernel: [72121.603153] WARNING: Unable to automount /var/tmp/testdir1/.zfs/snapshot/testsnap/testpool/testctr/testfs1@testsnap: 256
Feb 20 12:45:40 ub16fresh kernel: [72122.151665] WARNING: Unable to automount /var/tmp/testdir1/.zfs/snapshot/testsnap/testpool/testctr/testfs1@testsnap: 256
Feb 20 12:45:43 ub16fresh kernel: [72125.046754] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
Feb 20 12:45:43 ub16fresh kernel: [72125.046758] PANIC at spa.c:7957:spa_sync_iterate_to_convergence()
Feb 20 12:45:43 ub16fresh kernel: [72125.046759] Showing stack for process 20380
Feb 20 12:45:43 ub16fresh kernel: [72125.046761] CPU: 0 PID: 20380 Comm: txg_sync Tainted: P           OE   4.10.0-28-generic #32~16.04.2-Ubuntu
Feb 20 12:45:43 ub16fresh kernel: [72125.046761] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Feb 20 12:45:43 ub16fresh kernel: [72125.046762] Call Trace:
Feb 20 12:45:43 ub16fresh kernel: [72125.046775]  dump_stack+0x63/0x90
Feb 20 12:45:43 ub16fresh kernel: [72125.046781]  spl_dumpstack+0x29/0x30 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046784]  spl_panic+0xc5/0x100 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046786]  ? spl_kmem_zalloc+0xd9/0x170 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046789]  ? spl_kmem_free+0x33/0x40 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.046822]  ? dmu_tx_destroy+0x12d/0x150 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046843]  ? dmu_tx_commit+0x190/0x200 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046849]  ? mutex_lock+0x12/0x40
Feb 20 12:45:43 ub16fresh kernel: [72125.046878]  ? txg_verify+0x14/0xe0 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046905]  ? txg_list_empty_impl+0x34/0x70 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046933]  spa_sync+0x1805/0x1840 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046959]  ? zfs_refcount_remove_many+0x1c4/0x250 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.046987]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.047015]  txg_sync_thread+0x2f2/0x690 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.047041]  ? txg_quiesce_thread+0x550/0x550 [zfs]
Feb 20 12:45:43 ub16fresh kernel: [72125.047045]  thread_generic_wrapper+0x7b/0xc0 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.047048]  kthread+0x109/0x140
Feb 20 12:45:43 ub16fresh kernel: [72125.047051]  ? __thread_exit+0x20/0x20 [spl]
Feb 20 12:45:43 ub16fresh kernel: [72125.047052]  ? kthread_create_on_node+0x60/0x60
Feb 20 12:45:43 ub16fresh kernel: [72125.047053]  ret_from_fork+0x2c/0x40

behlendorf · 2019-02-20T18:19:19Z

@PaulZ-98 if you're able to reproduce this it would be great if you could get to the root cause. It's been difficult to reproduce locally, and the bots do occasionally encounter it.

PaulZ-98 · 2019-03-08T16:35:26Z

I was able to reproduce this with a version from Jan 31, but not with a version pulled on March 7.

Work with tip at:

commit becdcec
Author: Tony Hutter [email protected]
Date: Wed Mar 6 16:03:03 2019 -0800

kernel_fpu fixes

Reproducible/fails with tip at:

commit 57dc41d
Author: Tony Hutter [email protected]
Date: Thu Jan 31 10:51:18 2019 -0800

Fix zpool iostat -w header names

If I have time if will do git bisect to try and find what fixes it.

Have you seen it recently with buildbot @behlendorf?

behlendorf · 2019-03-08T18:27:05Z

@PaulZ-98 now that you mention it, I haven't. This build on Feb 25th from PR 8418, is the last time any of the bots hit this. That PR was based off 2d76ab9 which narrows it down a little further. It would be ideal if we could bisect this. I'll make sure to keep an eye out for any more instances of this in the CI.

PaulZ-98 · 2019-03-09T16:18:26Z

I confirmed with git bisect, and also by hand-applying the patch, that @jwk404 fix #8409 eliminates the problem. It is still a bit concerning that when clone_001_pos cleanup fails, you can later get a panic.

Here's the test log from when clone_001_pos cleanup fails (pre 8409 fix).

10:42:38.79 NOTE: Performing local cleanup via log_onexit (cleanup_all)
10:42:39.04 SUCCESS: zfs destroy -Rf testpool/testfs@testsnap
10:42:39.04 SUCCESS: rm -rf /var/tmp/testdir.0
10:42:39.10 ERROR: zfs destroy -Rf testpool/testvol@testsnap exited 1
10:42:39.10 cannot destroy 'testpool/testclone1': dataset is busy cannot destroy snapshot testpool/testvol@testsnap: snapshot is cloned
10:42:39.10 NOTE: Performing test-fail callback (/root/zap_prefetch/zfs/tests/zfs-tests/callbacks/zfs_dbgmsg.ksh)

The PANIC occurs in snapshot_009_pos.ksh when we create a zvol and immediately snap -r the whole pool. Somehow, this situation (state left behind by clone_001_pos cleanup failure), causes dp_dirty_dirs to have entries at the end of dsl_pool_sync (because the synctasks from zfs snapshot -r testpool@testsnap dirtied the dirs). Normally we count on dp_dirty_dirs being empty, but here the ASSERT is tripped later during spa_sync.

ikozhukhov · 2019-03-09T16:25:53Z

it is probably similar situation with https://www.illumos.org/issues/10445 - where we can see panic with snap-slider and with destroy older BE where we have several clones of snapshots

behlendorf · 2019-03-11T17:28:04Z

@PaulZ-98 this is definitely still concerning. I'm glad to see the spurious failures in the CI go away, but clearly there's still a bug here which needs to be fixed.

PaulZ-98 · 2019-03-12T01:22:32Z

As mentioned, testpool/testvol@testsnap still exists at the time of running snapshot_009_pos.ksh.

009 attempts zfs snapshot -r testpool@testsnap

Most of the dsl_dataset_snapshot_check calls then return 0, except the testpool/testvol@testsnap check which returns EEXIST. So the dsl_dataset_snapshot_sync calls never happen for any of the -r snaps. At this point we have some dp_dirty_dirs courtesy of dsl_dataset_snapshot_check_impl --> dsl_dataset_snapshot_reserve_space, but the MOS is never dirtied because we don't take any snapshots.

This complicates spa_sync_iterate_to_convergence because normally when you had dp_dirty_dirs you also dirtied the MOS.

               if (pass == 1 &&
7949  		    spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
7950  		    !dmu_objset_is_dirty(mos, txg)) {
7951  			/*
7952  			 * Nothing changed on the first pass, therefore this
7953  			 * TXG is a no-op. Avoid syncing deferred frees, so
7954  			 * that we can keep this TXG as a no-op.
7955  			 */
7956  			ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7957  			ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7958  			ASSERT(txg_list_empty(&dp->dp_sync_tasks, txg));
7959  			ASSERT(txg_list_empty(&dp->dp_early_sync_tasks, txg));
7960  			break;
7961  		}

I don't see an easy way to undo the work done by dsl_dataset_snapshot_reserve_space because the dsl_sync_task design doesn't have a cleanup function capability.

PaulZ-98 · 2019-03-12T11:55:39Z

Not sure why the snap -r is not prevented in user space / libzfs. In trivial cases of any descendant dataset of a snap -r already having that snap, it fails.

PaulZ-98 · 2019-03-12T17:58:20Z

Here's a potential fix. Don't reserve space in dsl_dataset_snapshot_check_impl. Just use that code to find the errors if any (in this case EEXIST). Then iff no errors are found, dirty the dsl_dir_t by reserving the space in a second loop through the to-be-taken snaps.

diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 168aea8..a2e5df7 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1419,10 +1419,6 @@ dsl_dataset_snapshot_check_impl(dsl_dataset_t *ds, const char *snapname,
                        return (error);
        }
 
-       error = dsl_dataset_snapshot_reserve_space(ds, tx);
-       if (error != 0)
-               return (error);
-
        return (0);
 }
 
@@ -1566,6 +1562,28 @@ dsl_dataset_snapshot_check(void *arg, dmu_tx_t *tx)
                }
        }
 
+       if (rv != 0)
+               return (rv);
+
+       /* Now reserve space since we passed all the checks. */
+       for (pair = nvlist_next_nvpair(ddsa->ddsa_snaps, NULL);
+           pair != NULL; pair = nvlist_next_nvpair(ddsa->ddsa_snaps, pair)) {
+               int error = 0;
+               char *name, *atp = NULL;
+               dsl_dataset_t *ds;
+               char dsname[ZFS_MAX_DATASET_NAME_LEN];
+
+               name = nvpair_name(pair);
+               atp = strchr(name, '@');
+               (void) strlcpy(dsname, name, atp - name + 1);
+               (void) dsl_dataset_hold(dp, dsname, FTAG, &ds);
+               error = dsl_dataset_snapshot_reserve_space(ds, tx);
+               dsl_dataset_rele(ds, FTAG);
+               if (error) {
+                       rv = error;
+                       break;
+               }
+       }
        return (rv);
 }

PaulZ-98 · 2019-03-12T18:00:30Z

So with the above change, if clone_001_pos cleanup fails, then snapshot_009_pos (snap -r) also fails later with EEXIST. To test this you must withhold the cleanup retry fix to clone_001_pos, because you won't see the PANIC unless clone_001_pos cleanup fails.

behlendorf · 2019-03-12T19:06:06Z

@PaulZ-98 that makes sense. But I think this code also needs to somehow handle the case where dsl_dataset_snapshot_reserve_space() fails with ENOSPC. Otherwise we'll end up in the same situation again where the sync task doesn't run but the dsl_dir_t has been dirtied.

It also looks like dsl_dataset_snapshot_tmp_check depends on dsl_dataset_snapshot_check_impl so I'd expect it to suffer from the same issue.

PaulZ-98 · 2019-03-12T19:25:05Z

Good points @behlendorf I'll look at it further.

behlendorf · 2019-03-12T19:36:28Z

Now that we know what's going on it'd be nice to add a test case to verify the eventual fix and avoid regressions.

PaulZ-98 · 2019-03-28T14:12:32Z

Yes @behlendorf I'm taking clone_001_pos without any cleanup, and the start of snapshot_009_pos and combining it into a diabolical test does panic ZFS when you don't have this fix.

Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380

PaulZ-98 · 2019-04-04T19:05:41Z

The new test will cause the panic reliably when you don't have this fix, and it also revealed a different issue when destroying the datasets from the new test and moving on to destroy the pool.

Apr  2 10:23:42 ub16fresh kernel: [ 2030.021096] VERIFY(diff > 0 ? child == 1 : child == 0) failed
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021099] PANIC at avl.c:620:avl_insert_here()
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021100] Showing stack for process 8721
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021102] CPU: 0 PID: 8721 Comm: dbu_evict Tainted: P           OE   4.10.0-28-generic #32~16.04.2-Ubuntu
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021103] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021104] Call Trace:
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021110]  dump_stack+0x63/0x90
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021117]  spl_dumpstack+0x29/0x30 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021120]  spl_panic+0xc5/0x100 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021168]  ? __list_add+0x17/0x40 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021203]  ? queued_spin_unlock+0xb/0x10 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021236]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021267]  ? __raw_spin_unlock+0x9/0x10 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021289]  ? dbuf_cache_target_bytes+0x23/0x40 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021311]  ? dbuf_evict_notify+0x1e/0x50 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021332]  ? dbuf_rele_and_unlock+0x517/0x5b0 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021352]  ? aggsum_add+0x107/0x120 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021355]  ? __kmalloc_node+0x1f0/0x2a0
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021358]  ? spl_kmem_alloc_impl+0x79/0x100 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021360]  avl_insert_here+0x3d4/0x410 [zavl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021387]  dnode_evict_dbufs+0x1f7/0x280 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021411]  dmu_objset_evict_dbufs+0x136/0x220 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021435]  dmu_objset_evict+0x5e/0x170 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021461]  dsl_dataset_evict_async+0x3b/0x280 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021465]  taskq_thread+0x262/0x4d0 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021467]  ? wake_up_q+0x70/0x70
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021494]  ? dsl_dataset_feature_is_active+0x20/0x20 [zfs]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021496]  kthread+0x109/0x140
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021500]  ? taskq_thread_spawn+0x50/0x50 [spl]
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021501]  ? kthread_create_on_node+0x60/0x60
Apr  2 10:23:42 ub16fresh kernel: [ 2030.021504]  ret_from_fork+0x2c/0x40

And the related stacks...

root 2833 2 0 16:05 ? 00:00:00 [dbu_evict]
[<ffffffffc069d46b>] spl_panic+0xeb/0x100 [spl]
[<ffffffffc03e7084>] avl_insert_here+0x3d4/0x410 [zavl]
[<ffffffffc0839de7>] dnode_evict_dbufs+0x1f7/0x280 [zfs]
[<ffffffffc081b816>] dmu_objset_evict_dbufs+0x136/0x220 [zfs]
[<ffffffffc081f7ae>] dmu_objset_evict+0x5e/0x170 [zfs]
[<ffffffffc083df5b>] dsl_dataset_evict_async+0x3b/0x280 [zfs]
[<ffffffffc06a5bb2>] taskq_thread+0x262/0x4d0 [spl]
[<ffffffff8eea8de9>] kthread+0x109/0x140
[<ffffffff8f6d413c>] ret_from_fork+0x2c/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
root 2834 2 0 16:05 ? 00:00:02 [dbuf_evict]
[<ffffffffc0803a5b>] dbuf_destroy+0x1fb/0x6f0 [zfs]
[<ffffffffc0804061>] dbuf_evict_one+0x111/0x1c0 [zfs]
[<ffffffffc0804251>] dbuf_evict_thread+0x141/0x270 [zfs]
[<ffffffffc06a6e4b>] thread_generic_wrapper+0x7b/0xc0 [spl]
[<ffffffff8eea8de9>] kthread+0x109/0x140
[<ffffffff8f6d413c>] ret_from_fork+0x2c/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
root 18742 17666 0 16:27 pts/18 00:00:00 /root/snapexists/zfs/cmd/zpool/.libs/lt-zpool destroy -f testpool1
[<ffffffffc06a48e4>] taskq_wait+0x64/0xa0 [spl]
[<ffffffffc080bbc5>] dmu_buf_user_evict_wait+0x15/0x20 [zfs]
[<ffffffffc08a1cab>] spa_evicting_os_wait+0x9b/0xd0 [zfs]
[<ffffffffc0895840>] spa_export_common+0x3f0/0x510 [zfs]
[<ffffffffc089597a>] spa_destroy+0x1a/0x20 [zfs]
[<ffffffffc08fcc7a>] zfs_ioc_pool_destroy+0x1a/0x20 [zfs]
[<ffffffffc090090f>] zfsdev_ioctl+0x60f/0x6c0 [zfs]
[<ffffffff8f058dc1>] do_vfs_ioctl+0xa1/0x5f0
[<ffffffff8f059389>] SyS_ioctl+0x79/0x90
[<ffffffff8f6d3ebb>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff

PaulZ-98 · 2019-04-04T19:07:17Z

I believe that the dbuf_evict_one thread has no knowledge of the dmu_objset_evict process, and they can fight over the dn_dbufs list and its contents. In the PR for this, I have introduced some mutual exclusion such that dbuf_evict_one just returns if the chosen dbuf's objset is already being evicted.

Certain arrangment of existing snaps with same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380

Certain arrangment of existing snaps with the same name as specified in new snap -r command can create a situation during sync phase with entries in dp_dirty_dirs but no dirty dbufs in the MOS. Also when zfs tests are destroying datasets and then the pool there is a race condition between dmu_objset_evict and dbuf_evict_one, resulting in a hang. Signed-off-by: Paul Zuchowski <[email protected]> Fixes openzfs#8380

gamanakis · 2020-08-24T12:58:45Z

I managed to trigger this in my home lab running fc34dfb while running zfs snapshot -r pool@snap

[Aug24 08:48] VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg)) failed
[  +0.000010] PANIC at spa.c:8922:spa_sync_iterate_to_convergence()
[  +0.000002] Showing stack for process 54206
[  +0.000002] CPU: 3 PID: 54206 Comm: txg_sync Kdump: loaded Tainted: P           OE     5.7.10-1-vfio #1
[  +0.000001] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ELITE/Z390 AORUS ELITE-CF, BIOS F10c 12/18/2019
[  +0.000001] Call Trace:
[  +0.000005]  dump_stack+0x64/0x88
[  +0.000009]  spl_panic+0xef/0x117 [spl]
[  +0.000078]  ? dbuf_rele_and_unlock+0x4e8/0x620 [zfs]
[  +0.000089]  ? txg_verify+0xf/0xe0 [zfs]
[  +0.000068]  spa_sync_iterate_to_convergence+0x2e0/0x310 [zfs]
[  +0.000076]  spa_sync+0x384/0x860 [zfs]
[  +0.000063]  txg_sync_thread+0x302/0x460 [zfs]
[  +0.000058]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[  +0.000006]  thread_generic_wrapper+0x78/0xb0 [spl]
[  +0.000006]  ? IS_ERR+0x10/0x10 [spl]
[  +0.000003]  kthread+0x117/0x130
[  +0.000001]  ? __kthread_bind_mask+0x60/0x60
[  +0.000003]  ret_from_fork+0x1f/0x40

stale · 2021-08-24T13:12:48Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label Feb 5, 2019

behlendorf added this to the 0.8.0 milestone Feb 5, 2019

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 20, 2019

behlendorf removed this from the 0.8.0 milestone Mar 20, 2019

PaulZ-98 mentioned this issue Apr 4, 2019

Snap -r with existing snaps can cause panic #8578

Open

12 tasks

stale bot added the Status: Stale No recent activity for issue label Aug 24, 2021

stale bot closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

behlendorf commented Feb 5, 2019

PaulZ-98 commented Feb 20, 2019 •

edited

Loading

behlendorf commented Feb 20, 2019

PaulZ-98 commented Mar 8, 2019

behlendorf commented Mar 8, 2019

PaulZ-98 commented Mar 9, 2019 •

edited

Loading

ikozhukhov commented Mar 9, 2019

behlendorf commented Mar 11, 2019

PaulZ-98 commented Mar 12, 2019 •

edited

Loading

PaulZ-98 commented Mar 12, 2019

PaulZ-98 commented Mar 12, 2019

PaulZ-98 commented Mar 12, 2019

behlendorf commented Mar 12, 2019

PaulZ-98 commented Mar 12, 2019

behlendorf commented Mar 12, 2019

PaulZ-98 commented Mar 28, 2019 •

edited

Loading

PaulZ-98 commented Apr 4, 2019 •

edited

Loading

PaulZ-98 commented Apr 4, 2019 •

edited

Loading

gamanakis commented Aug 24, 2020 •

edited

Loading

stale bot commented Aug 24, 2021

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

VERIFY(txg_list_empty(&dp->dp_dirty_dirs, txg) #8380

Comments

behlendorf commented Feb 5, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

PaulZ-98 commented Feb 20, 2019 • edited Loading

behlendorf commented Feb 20, 2019

PaulZ-98 commented Mar 8, 2019

behlendorf commented Mar 8, 2019

PaulZ-98 commented Mar 9, 2019 • edited Loading

ikozhukhov commented Mar 9, 2019

behlendorf commented Mar 11, 2019

PaulZ-98 commented Mar 12, 2019 • edited Loading

PaulZ-98 commented Mar 12, 2019

PaulZ-98 commented Mar 12, 2019

PaulZ-98 commented Mar 12, 2019

behlendorf commented Mar 12, 2019

PaulZ-98 commented Mar 12, 2019

behlendorf commented Mar 12, 2019

PaulZ-98 commented Mar 28, 2019 • edited Loading

PaulZ-98 commented Apr 4, 2019 • edited Loading

PaulZ-98 commented Apr 4, 2019 • edited Loading

gamanakis commented Aug 24, 2020 • edited Loading

stale bot commented Aug 24, 2021

PaulZ-98 commented Feb 20, 2019 •

edited

Loading

PaulZ-98 commented Mar 9, 2019 •

edited

Loading

PaulZ-98 commented Mar 12, 2019 •

edited

Loading

PaulZ-98 commented Mar 28, 2019 •

edited

Loading

PaulZ-98 commented Apr 4, 2019 •

edited

Loading

PaulZ-98 commented Apr 4, 2019 •

edited

Loading

gamanakis commented Aug 24, 2020 •

edited

Loading