Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux: Fix zfs_prune panics #16770

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Linux: Fix zfs_prune panics #16770

merged 1 commit into from
Nov 21, 2024

Conversation

snajpa
Copy link
Contributor

@snajpa snajpa commented Nov 16, 2024

Motivation and Context

Linux: Fix zfs_prune panics:
#16324

Description

Linux: Fix zfs_prune panics

by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

How Has This Been Tested?

Low memory scenario docker pull with zfs as storage backend, now passes. Template build at vpsFree, also passes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Nov 16, 2024
@snajpa snajpa marked this pull request as ready for review November 16, 2024 15:00
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Nov 16, 2024
@satmandu
Copy link
Contributor

@behlendorf Any chance of getting this reviewed for #16760 2.3.0-rc4?
(I'm biased as someone who makes heavy use of docker.)

@behlendorf behlendorf self-requested a review November 19, 2024 18:06
@snajpa snajpa force-pushed the fix-lowmem branch 2 times, most recently from 38893b6 to 649c441 Compare November 20, 2024 15:26
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The free_inode change makes sense, thanks for spotting it. Could you just move it in to it's own PR.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 20, 2024

The free_inode change makes sense, thanks for spotting it. Could you just move it in to it's own PR.

#16788

@snajpa snajpa changed the title Linux: fix 2 lowmem bugs Linux: Fix zfs_prune panics Nov 20, 2024
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Signed-off-by: Pavel Snajdr <[email protected]>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 21, 2024
@AllKind
Copy link
Contributor

AllKind commented Nov 21, 2024

@tonyhutter Maybe it'd be good to have this fix in 2.2.7?

@behlendorf behlendorf merged commit 38c0324 into openzfs:master Nov 21, 2024
23 checks passed
@satmandu satmandu mentioned this pull request Nov 22, 2024
7 tasks
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 23, 2024
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes openzfs#16770
ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes openzfs#16770
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Dec 3, 2024
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes openzfs#16770
arter97 pushed a commit to arter97/zfs that referenced this pull request Dec 9, 2024
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes openzfs#16770
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Jan 26, 2025
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes openzfs#16770
@youzhongyang
Copy link
Contributor

This PR may be the culprit of an issue we observed on one of our servers.

When trying to destroy a dataset, it failed with 'dataset busy':

# zfs destroy {zfs dataset}
cannot destroy '{zfs dataset}': dataset is busy

The dataset was unmounted, when trying to mount it by 'zfs mount {zfs dataset}', it's completely stuck with the following stack trace:

[<0>] grab_super_dead+0xd8/0x140
[<0>] sget+0x1ce/0x2a0
[<0>] zpl_mount_impl+0xd3/0x290 [zfs]
[<0>] zpl_mount+0x32/0x80 [zfs]
[<0>] legacy_get_tree+0x2d/0x60
[<0>] vfs_get_tree+0x2c/0xf0
[<0>] do_new_mount+0x192/0x330
[<0>] path_mount+0x1d9/0x820
[<0>] __x64_sys_mount+0x121/0x160
[<0>] x64_sys_call+0x1a6a/0x1c90
[<0>] do_syscall_64+0x58/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2

where it got stuck here:
https://github.com/torvalds/linux/blob/v6.6/fs/super.c#L582

Observing the super block of the dataset, it has s_active 0:

crash> super_block.s_flags,s_active,s_count -x ffff88c07decd800
  s_flags = 0x60010400,
  s_active = {
    counter = 0x0
  },
  s_count = 0x2,

The s_flags = SB_ACTIVE | SB_BORN | SB_POSIXACL | SB_NOATIME.

It seems when messing with sb->s_active in zpl_prune_sb() and a 'zfs destroy' is in progress, we can get into this weird situation.

A normal cleanup procedure would be cleanup_mnt() -> deactivate_super() -> deactivate_locked_super() -> zpl_kill_sb(), sb->s_active is atomically detect-and-changed in deactivate_super() and deactivate_locked_super(), theoretically zpl_prune_sb() could get in the way and alter the normal procedure by messing with the sb->s_active counter.

I am working on reproducing the issue but it doesn't seem to be easy - it has to be a perfect timing for it to happen.

I apologize if it's inappropriate to post in a closed PR.

@youzhongyang
Copy link
Contributor

I was able to reproduce the issue by simulating what zfs_prune_cb() does in a dummy driver.

Here is the code snippet:

static void test_sb(void *data)
{
        test_sb_t *pdata = data;
        struct super_block *sb = NULL;
        printk(KERN_INFO "loops %lu sleep_ms %lu sleep_ms_iter %lu path %s\n",
          pdata->loops, pdata->sleep_ms, pdata->sleep_ms_iter, pdata->path);
        sb = get_ds_sb(pdata->path);
        if (!sb)
                return;
        printk(KERN_INFO "super_block %px s_active %d s_count %d s_fs_info %px\n",
          sb, sb->s_active.counter, sb->s_count, sb->s_fs_info);

        while(1) {
                if (sb->s_fs_info == NULL ||
                  (sb->s_instances.next == NULL && sb->s_instances.pprev == NULL))
                        break;
                if (atomic_inc_not_zero(&sb->s_active)) {
                        schedule_timeout_interruptible(msecs_to_jiffies(pdata->sleep_ms));
                        atomic_dec(&sb->s_active);
                        schedule_timeout_interruptible(msecs_to_jiffies(pdata->sleep_ms_iter));
                        if (pdata->loops == 0)
                                continue;
                        pdata->loops--;
                        if (pdata->loops == 0)
                                break;
                        continue;
                }
                break;
        }
}

@snajpa - may I have you attention? it seems this PR fixes one issue but introduces another.

@snajpa
Copy link
Contributor Author

snajpa commented Feb 20, 2025

@youzhongyang that is weird, I used atomic_inc_not_zero specifically to avoid increasing it from zero to one in edge cases where the sb is in process of going away, am I missing something at a first glance?

@snajpa
Copy link
Contributor Author

snajpa commented Feb 20, 2025

@youzhongyang can you try to change the condition in zpl_prune_sb from

	if (atomic_inc_not_zero(&sb->s_active)) {

to

	if ((sb->s_flags & SB_ACTIVE) && atomic_inc_not_zero(&sb->s_active)) {

and try to reproduce with that?

this was a thing I was unsure if it was worth doing (or making an effort if that didn't work to make it work, can't remember), maybe skipping it altogether wasn't the best of ideas, though it didn't show up in any testing nor in our prod so far

@youzhongyang
Copy link
Contributor

@snajpa I added (sb->s_flags & SB_ACTIVE), it's also reproducible.

@youzhongyang
Copy link
Contributor

Here is my explanation:

In Linux kernel, the deactivate_super() and deactivate_locked_super() are as follows:

void deactivate_super(struct super_block *s)
{
	if (!atomic_add_unless(&s->s_active, -1, 1)) {
		__super_lock_excl(s);
		deactivate_locked_super(s);
	}
}

void deactivate_locked_super(struct super_block *s)
{
	struct file_system_type *fs = s->s_type;
	if (atomic_dec_and_test(&s->s_active)) {
		shrinker_free(s->s_shrink);
		fs->kill_sb(s);

		kill_super_notify(s);
		...
		put_filesystem(fs);
		put_super(s);
	} else {
		super_unlock_excl(s);
	}
}

deactivate_super():
If sb->s_active is 1, then hold sb->s_umount semaphore
for 'write' (by down_write()), call deactivate_locked_super();
otherwise atomically decrement sb->s_active by 1 and return.

deactivate_locked_super():
Atomically decrement sb->s_active by 1, if it becomes 0,
then call shrinker_free(), zpl_kill_sb(), and so on; otherwise
release the sb->s_umount semaphore.

If an outsider (such as zpl_prune_sb()) tries to manipulate sb->s_active counter,
although it happens atomically, but as you can see from the above analysis,
the code path of sb->s_active becoming 0 may be missed and never gets executed.

@snajpa
Copy link
Contributor Author

snajpa commented Feb 27, 2025

@youzhongyang well if you have any ideas how to fix it, please go ahead and try :)

if not, I should be able to get to this in about a week +-

we'll probably need to figure out yet another way how to avoid the shrinkers going away from under zfs_prune, if s_active really can't be used; SB_ACTIVE wasn't enough tho, so it's a bit of a pickle :D

@youzhongyang
Copy link
Contributor

@snajpa - I thought about it but haven't been able to figure out a better way. Relying on the info provided by super_block seems inadequate.

We never had panic caused by zfs_prune so we reverted this commit for our servers. Please go ahead try fixing it.

@snajpa
Copy link
Contributor Author

snajpa commented Mar 6, 2025

@youzhongyang if you have some time to spare, can I ask you to try to reproduce with this patch?

diff --git a/module/os/linux/zfs/zpl_super.c b/module/os/linux/zfs/zpl_super.c
index e27e9fd02..3bc2e587e 100644
--- a/module/os/linux/zfs/zpl_super.c
+++ b/module/os/linux/zfs/zpl_super.c
@@ -385,17 +385,15 @@ zpl_prune_sb(uint64_t nr_to_scan, void *arg)
        int objects = 0;
 
        /*
-        * deactivate_locked_super calls shrinker_free and only then
-        * sops->kill_sb cb, resulting in UAF on umount when trying to reach
-        * for the shrinker functions in zpl_prune_sb of in-umount dataset.
-        * Increment if s_active is not zero, but don't prune if it is -
-        * umount could be underway.
+        * Ensure the superblock is not in the process of being torn down.
         */
-       if (atomic_inc_not_zero(&sb->s_active)) {
-               (void) -zfs_prune(sb, nr_to_scan, &objects);
-               atomic_dec(&sb->s_active);
+       if (down_read_trylock(&sb->s_umount)) {
+               if (!(sb->s_flags & SB_DYING) && sb->s_root &&
+                   (sb->s_flags & SB_BORN)) {
+                       (void) -zfs_prune(sb, nr_to_scan, &objects);
+                   }
+               up_read(&sb->s_umount);
        }
-
 }
 
 const struct super_operations zpl_super_operations = {

@youzhongyang
Copy link
Contributor

@snajpa - I tested your patch, the issue is no longer reproducible. Thanks.

@snajpa
Copy link
Contributor Author

snajpa commented Mar 6, 2025

@youzhongyang awesome, thank you very much! I'm going to make a PR with that patch now, thanks again! for all your time spent on this, very much appreciated!

@snajpa snajpa mentioned this pull request Mar 6, 2025
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants