Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

Closed
ned14 opened this issue Mar 5, 2021 · 4 comments
Closed

SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

ned14 opened this issue Mar 5, 2021 · 4 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ned14
Copy link

ned14 commented Mar 5, 2021

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Linux Kernel 5.8.0-44-generic
Architecture x64
ZFS Version 0.8.4-1ubuntu11.1
SPL Version 0.8.4-1ubuntu11.1

Describe the problem you're observing

We tested a custom database running on Ubuntu 18.04 to 20.04 and found a most odd regression in ZFS. I have repeated only the pertinent parts from strace here for brevity:

openat(21, "010100", O_RDWR|O_CREAT, 0660) = 22
fstat(22, {st_mode=S_IFREG|0660, st_size=136, ...}) = 0
fstat(22, {st_mode=S_IFREG|0660, st_size=136, ...}) = 0
mmap(NULL, 136, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NORESERVE, 22, 0) = 0x7f23ba468000
flock(22, LOCK_EX)                      = 0

Opens metadata file for modification, maps it into memory, takes exclusive lock. Reads existing metadata from the map.

fstat(22, {st_mode=S_IFREG|0660, st_size=136, ...}) = 0
ftruncate(22, 394)                      = 0

Resizes metadata file to new metadata size.

fstat(22, {st_mode=S_IFREG|0660, st_size=394, ...}) = 0
munmap(0x7f23ba468000, 4096)            = 0
fstat(22, {st_mode=S_IFREG|0660, st_size=394, ...}) = 0
mmap(NULL, 394, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NORESERVE, 22, 0) = 0x7f23ba468000

Remaps the map to accommodate the larger size, and writes the updated metadata into the map.

flock(22, LOCK_UN)                      = 0
munmap(0x7f23ba468000, 4096)            = 0
close(22)                               = 0

Releases the exclusive lock, unmaps the file, closes the file.

openat(6, "b64e11/0001/a66d73/0000000000/010100", O_RDONLY) = 22
fstat(22, {st_mode=S_IFREG|0660, st_size=394, ...}) = 0
lseek(22, 0, SEEK_DATA)                 = -1 ENXIO (No such device or address)

Opens the metadata file just recently updated, however SEEK_DATA says it contains no valid extents (ENXIO). This causes a sparse file content clone routine to replicate the file with all bits zero, because that is correct if the source file has no valid extents. That, in turn, causes a chain of failure way down the line. It took some hours to discover the originating cause.

Describe how to reproduce the problem

I can't describe how to reproduce the problem as this is a proprietary database under load. I can tell you this however:

  1. If you wait for a small period of time, SEEK_DATA returns the truth.
  2. fsyncing the mapped file after modify before close does not fix the problem.
  3. If you precede the SEEK_DATA with a single byte read() of that file, suddenly it works.

We've got with workaround 2 for production, and it seems to work well.

Include any warning/errors/backtraces from the system logs

N/A

@ned14 ned14 added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 5, 2021
@ahrens
Copy link
Member

ahrens commented Mar 5, 2021

It seems like if we don't know if there is data present, we should err on the side of treating it as DATA, not HOLE.

If you precede the SEEK_DATA with a single byte read() of that file, suddenly it works

That's a bit surprising, I wonder how that fixes it.

FYI, you can set the module parameter zfs_dmu_offset_next_sync=1 to make SEEK_DATA/HOLE work properly (at some performance cost).

@ned14
Copy link
Author

ned14 commented Mar 5, 2021

It seems like if we don't know if there is data present, we should err on the side of treating it as DATA, not HOLE.

That's a good start, but may I suggest something a bit more intelligent than the current behaviour?

If you think about it, nobody really cares about sparse copies of small files, so for files sized below some threshold (e.g. 1Mb), if you don't know, return DATA.

If, however, the file is sized larger than that threshold, then forcing txg sync to find holes (i.e. zfs_dmu_offset_next_sync=1) becomes worth it. BUT ...

95% of the time, large files will contain no holes. Wasting time looking for holes on those makes no sense. If you had a spare bit somewhere in metadata which you could update to 1 or 0 based on whether there is at least one hole in the file, you could avoid forcing txg sync to find holes except where you really have to.

However, certainly as an immediate step, returning DATA instead of HOLE for when we don't know is an urgent fix. The open source file system abstraction layer we use its file and directory clone routines always do optimised sparse copies, so right now, it's completely broken on ZFS for file and directory clones where a directory tree is undergoing rapid mutation. I can think of a few other things which would also be broken e.g. rsync, which we use to take backups of a live DB currently in use, and which is also sparse file aware.

@stale
Copy link

stale bot commented Mar 6, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Mar 6, 2022
@behlendorf
Copy link
Contributor

This issue should be fixed by commit de198f2, the fix was included in the 2.0.7 and 2.1.2 releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants