SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

ned14 · 2021-03-05T21:45:33Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	20.04
Linux Kernel	5.8.0-44-generic
Architecture	x64
ZFS Version	0.8.4-1ubuntu11.1
SPL Version	0.8.4-1ubuntu11.1

Describe the problem you're observing

We tested a custom database running on Ubuntu 18.04 to 20.04 and found a most odd regression in ZFS. I have repeated only the pertinent parts from strace here for brevity:

openat(21, "010100", O_RDWR|O_CREAT, 0660) = 22
fstat(22, {st_mode=S_IFREG|0660, st_size=136, ...}) = 0
fstat(22, {st_mode=S_IFREG|0660, st_size=136, ...}) = 0
mmap(NULL, 136, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NORESERVE, 22, 0) = 0x7f23ba468000
flock(22, LOCK_EX)                      = 0

Opens metadata file for modification, maps it into memory, takes exclusive lock. Reads existing metadata from the map.

fstat(22, {st_mode=S_IFREG|0660, st_size=136, ...}) = 0
ftruncate(22, 394)                      = 0

Resizes metadata file to new metadata size.

fstat(22, {st_mode=S_IFREG|0660, st_size=394, ...}) = 0
munmap(0x7f23ba468000, 4096)            = 0
fstat(22, {st_mode=S_IFREG|0660, st_size=394, ...}) = 0
mmap(NULL, 394, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NORESERVE, 22, 0) = 0x7f23ba468000

Remaps the map to accommodate the larger size, and writes the updated metadata into the map.

flock(22, LOCK_UN)                      = 0
munmap(0x7f23ba468000, 4096)            = 0
close(22)                               = 0

Releases the exclusive lock, unmaps the file, closes the file.

openat(6, "b64e11/0001/a66d73/0000000000/010100", O_RDONLY) = 22
fstat(22, {st_mode=S_IFREG|0660, st_size=394, ...}) = 0
lseek(22, 0, SEEK_DATA)                 = -1 ENXIO (No such device or address)

Opens the metadata file just recently updated, however SEEK_DATA says it contains no valid extents (ENXIO). This causes a sparse file content clone routine to replicate the file with all bits zero, because that is correct if the source file has no valid extents. That, in turn, causes a chain of failure way down the line. It took some hours to discover the originating cause.

Describe how to reproduce the problem

I can't describe how to reproduce the problem as this is a proprietary database under load. I can tell you this however:

If you wait for a small period of time, SEEK_DATA returns the truth.
fsyncing the mapped file after modify before close does not fix the problem.
If you precede the SEEK_DATA with a single byte read() of that file, suddenly it works.

We've got with workaround 2 for production, and it seems to work well.

Include any warning/errors/backtraces from the system logs

N/A

The text was updated successfully, but these errors were encountered:

ahrens · 2021-03-05T22:01:54Z

It seems like if we don't know if there is data present, we should err on the side of treating it as DATA, not HOLE.

If you precede the SEEK_DATA with a single byte read() of that file, suddenly it works

That's a bit surprising, I wonder how that fixes it.

FYI, you can set the module parameter zfs_dmu_offset_next_sync=1 to make SEEK_DATA/HOLE work properly (at some performance cost).

ned14 · 2021-03-05T22:31:17Z

It seems like if we don't know if there is data present, we should err on the side of treating it as DATA, not HOLE.

That's a good start, but may I suggest something a bit more intelligent than the current behaviour?

If you think about it, nobody really cares about sparse copies of small files, so for files sized below some threshold (e.g. 1Mb), if you don't know, return DATA.

If, however, the file is sized larger than that threshold, then forcing txg sync to find holes (i.e. zfs_dmu_offset_next_sync=1) becomes worth it. BUT ...

95% of the time, large files will contain no holes. Wasting time looking for holes on those makes no sense. If you had a spare bit somewhere in metadata which you could update to 1 or 0 based on whether there is at least one hole in the file, you could avoid forcing txg sync to find holes except where you really have to.

However, certainly as an immediate step, returning DATA instead of HOLE for when we don't know is an urgent fix. The open source file system abstraction layer we use its file and directory clone routines always do optimised sparse copies, so right now, it's completely broken on ZFS for file and directory clones where a directory tree is undergoing rapid mutation. I can think of a few other things which would also be broken e.g. rsync, which we use to take backups of a live DB currently in use, and which is also sparse file aware.

stale · 2022-03-06T02:27:10Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf · 2022-03-07T23:44:39Z

This issue should be fixed by commit de198f2, the fix was included in the 2.0.7 and 2.1.2 releases.

ned14 added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 5, 2021

stale bot added the Status: Stale No recent activity for issue label Mar 6, 2022

behlendorf closed this as completed Mar 7, 2022

behlendorf removed Status: Stale No recent activity for issue Status: Triage Needed New issue which needs to be triaged labels Mar 7, 2022

jittygitty mentioned this issue Apr 30, 2022

FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only #13349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

ned14 commented Mar 5, 2021

ahrens commented Mar 5, 2021

ned14 commented Mar 5, 2021

stale bot commented Mar 6, 2022

behlendorf commented Mar 7, 2022

SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697

Comments

ned14 commented Mar 5, 2021

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

ahrens commented Mar 5, 2021

ned14 commented Mar 5, 2021

stale bot commented Mar 6, 2022

behlendorf commented Mar 7, 2022