-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fiemap[Still needs to address comments from original PR7545] #9554
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9554 +/- ##
===========================================
- Coverage 79.02% 48.85% -30.18%
===========================================
Files 418 276 -142
Lines 123686 95488 -28198
===========================================
- Hits 97748 46646 -51102
- Misses 25938 48842 +22904
Continue to review full report at Codecov.
|
The FIEMAP ioctl is the standard Linux user space mechanism for inspecting physical layout of a file on disk. The ioctl returns a list of extents each of which describes a region of the file. Compatible blocks are merged in to larger extents when they are physically contiguous and have identical flags. The following per-extent flags are supported by ZFS. FIEMAP_EXTENT_LAST - The last extent in the mapping FIEMAP_EXTENT_UNKNOWN - Set on gang blocks FIEMAP_EXTENT_DELALLOC - Dirty extents not yet written FIEMAP_EXTENT_ENCODED - Set on compressed extents FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned FIEMAP_EXTENT_DATA_INLINE - Set on embedded block pointers FIEMAP_EXTENT_UNWRITTEN - Set on holes (normally not reported) FIEMAP_EXTENT_MERGED - Multiple block pointers were merged FIEMAP_EXTENT_SHARED - Set on deduplicated extents The following flags are supported when requesting extents. Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems. FIEMAP_FLAG_SYNC - Sync the file first FIEMAP_FLAG_COPIES - Include all copies of an extent FIEMAP_FLAG_NOMERGE - Don't merge blocks in to extents FIEMAP_FLAG_HOLES - Include unwritten holes as an extent Finally, the following reserved fields in the fiemap_extent structure have been utilized and should be reserved to prevent future conflicts. .fe_reserved[0] - Unique device ID; top-level VDEV ID for ZFS .fe_reserved64[0] - Physical length of an extent Future work: * The FIEMAP_FLAG_XATTR flag is not supported. Implementing this should be relatively straight forward if it is needed. This would be a handy way to determine if the xattrs have been stored in the dnode, spill block, or an external object. * The FIEMAP_FLAG_CACHE flag is not supported. We rely on the ARC to keep the right blocks cached. * The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync. The FIEMAP interface is fully described at: * https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#264
Now all fiemap zfs-tests pass except 1, fiemap_free, which is a regression in master. ... home/rohan/zol-ws/zfs/tests/test-runner/bin/test-runner.py -c "tests/runfiles/fiemap.run" -T "functional" -i "/home/rohan/zol-ws/zfs/tests/zfs-tests" -I "1" Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/setup (run as root) [00:01] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_sync (run as root) [00:16] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_flags (run as root) [00:10] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_delalloc (run as root) [00:31] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_free (run as root) [00:01] [FAIL] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_copies (run as root) [00:02] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_nomerge (run as root) [00:03] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_holes_sanity (run as root) [01:08] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/cleanup (run as root) [00:00] [PASS] Results Summary PASS 8 FAIL 1 Running Time: 00:02:15 Percent passed: 88.9% ... Signed-off-by: Rohan Puri <[email protected]>
Signed-off-by: Rohan Puri <[email protected]>
The test was failing since fiemap range trees, work on {offset, len}, while dn_free_ranges range trees work on {blkid, nblks}, do the necessary conversion. Signed-off-by: Rohan Puri <[email protected]>
Hi Pavel,
Yes sure, sounds good.
Thanks,
Rohan
…On Sat, Nov 9, 2019 at 5:51 AM Pavel Snajdr ***@***.***> wrote:
RHEL7 .fiemap dir inode operation is in struct inode_operations_wrapper,
just as .renameat2 - actually these two are the only ones; I have a
commit testing for renameat2 being in the wrapper struct, you can see what
I'm talking about here: 164ec05#diff-8171081b574f4a45da3be9970a68ce68R684
<164ec05#diff-8171081b574f4a45da3be9970a68ce68R684>
I could make the test more generic and @rohan-puri
<https://github.com/rohan-puri> you could use it in your patch then too.
@behlendorf <https://github.com/behlendorf> what do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9554>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADHJKXI4X6PNHJVDXTWA3LQSX7APANCNFSM4JJMERCQ>
.
|
@rohan-puri never mind, I should have verified things before writing anything, it's |
@snajpa : ok :) |
@rohan-puri Good work taking this on! Could you also fillin the details of your PR, instead of relying on people looking up the old one? |
Any update on this? This feature would be incredibly useful. |
@rohan-puri @behlendorf @mmaybee Were you using Linux kernel's "generic_block_fiemap" anywhere? Just wondering if you saw no need for it since you're doing something different than I thought. Though I did just notice @adilger's post at #264 (comment) And yes I know our friends removed the export free-for all generic_block_fiemap and replaced it with the GPL-only iomap_fiemap On July 27th 2021. (And they're just getting warmed up, hence some of my other recent issues I posted...) fs: REMOVE generic_block_fiemap torvalds/linux@9acb9c4 |
@jittygitty thanks for taking an interest in this PR, but to be clear the kernel changes you referenced don't have any impact on the ability to implement this feature. What's needed is for someone to address the unresolved feedback provided in the original PR review. This still involves a significant amount of development work which I haven't had the time to pursue. If someone would like to pick up this work, I'm happy to help point them at what still needs to be done. |
@behlendorf Yes it would be beneficial to explain the path of the implementation, I did not look at code much since I have only very basic C knowledge so I did not have full grasp if you avoided iomap usage due to exports or if there's no help in using. But either way yes if you could explain what needs to be done, I can try to hire someone to give a crack at it and would be good to point them here and read yours or other knowledgeable contributors explanation, instead as I've done before and try to bang my head and understand some basics to try and gain myself enough info on for example fiemap path for reflink implementing. I'd rather not torture my head and potentially get things wrong. I had also asked for @pjd to provide what he needs for someone to finish implementing reflinks using "his BRT" method for Linux, I'm not against having two methods compete if one method/BRT is not going to waste anyway since will stay with BSD. Plus I'm not even "sure" if via fiemap I can bypass using BRT for reflinks, and use what, some kind of clone of xfs b+ refcounttree etc, not sure how BRT differs or if its actually same thing since I didn't "study the code of either" and prefer someone just spell it out. I want the feature for linux for myself, and simply wanted to gauge interest to help "crowd-fund" issues such as reflinks and offline-deduplication to get it done faster on linux. @Ornias1993 I agree with you that there shouldn't be so much fear and avoidance of export_symbol_gpl, I would simply use anything we need. I've given some preliminary reasons why in #13415 But I should add that I still prefer being reserved for politeness sake, there's no reason openzfs can't be good friends with Linux Kernel guys, their real hatred is for CLOSED SOURCE binaries, OpenZFS complies with basically "all" of the "REASONS" why they went with GPL when they did, ie their intent. OpenZFS is truly open source. (Though it is unfortunate with DDN and purchase of Nexenta..., had some questions and complaints to let out, but I'll refrain in case there's openzfs contributors on here from Nexenta that might get upset, plus this is Fiemap issue tracker after all.) |
The next steps for FIEMAP I believe are pretty straight forward. They code needs to be rebased against the current master branch and the outstanding feedback from comment in the PR needs to be resolved. In particular, there was one throrny design issue remaining regarding efficiently handling in-flight I/O which still needs to be resolve and that will take some careful work. Beyond that it of course needs to be tested and verified to work correctly. As always, if someone if willing to tackle this remaining work that would be great.
These are complementary but different features, so developer time/interest permitting, I'd like to see both implemented. The vast majority of the BRT work is platform independent, this is the hard part of implementing the functionality for ZFS. Wiring it up for reflink on Linux should be relatively straight forward and is something we'll want to sort out in that PR. |
@behlendorf Ok thanks, I think you missed one of my questions though, and I agree with you in general that cross-platform is better all else being equal. When I started looking into reflinks + offline-dedupe, I came to conclusion to do it via "FIEMAP". But I thought implementing reflink via FIEMAP was "different" than how @pjd was doing it with his implementation via "BRT". Sorry, I know its kind of a messy brainstorm but if you read my comment in #13349 especially the clonefiles.pdf link (see #13349 (comment) ), I think you'll see my thoughts and can correct me. Anyway I will contact someone I had in mind to see about taking a shot at finishing this 9554 FIEMAP, but you are saying that the original issue #7545 needs to be looked at, yet is all the code from that one already in this one, ie work in which one now? What I was looking for though was a short "onboarding" for would be coder on this who does not have prior ZFS knowledge but is simply a good C/C++ coder perhaps with some kernel experience. I wanted to know if there's any "primer" that would be very helpful for them to look at before jumping in on this FIEMAP, since again, imagine its someone who never saw ZFS code. |
I see what you're getting at. I think the confusion here is that FIEMAP is solely about reporting logical/physical file extents. Now user space applications like
The original #7575 has all the relevant review comments and latest fully working version of the code. I'd suggest starting there. Unfortunately, one of the reasons this stalled out is there are some challenging design and implementation details to work out. In particular around reconciling file extents for uncommitted dirty data blocks and pending frees which have been modified in the not-yet-synced-to-disk transaction groups. While I wouldn't want to discourage any one from taking on this work, getting it right is going to be a little bit tricky and require some solid knowledge of the internals. I wish I could point you at a primer, but I'm not aware of anything along those lines. There is a fair amount of high level design documentation available, but only the code is really going to have the level of detail needed. |
@behlendorf Thanks ok I understand someone new to zfs will need read through the code carefully and learn it that way. I will try get someone to take a look, wish I hadn't ditched class to hack gopher systems in library...so I wouldn't have to 'hire' coders. As far as FIEMAP I knew it provided extents interface to underlying blocks but I thought if we're going to use the Linux VFS ioctl FICLONE, FICLONERANGE and FIDEDUPERANGE, then I thought we had to speak "extents" through those ioctls since I thought they didn't talk blocks and we/zfs weren't extent based already since I saw this FIEMAP issue outstanding. I thought we could leverage existing reflink, dedupe, and copy_file_range code in the Linux kernel via those ficlone/ficlonerange/fideduperange and lessen the work on zfs side. I thought that @pjd had created some "custom ioctl" for zfs maybe and wasn't using ficlone. BRT I had thought was like the XFS reference count B+tree for reflinks but maybe I'm wrong? At least for zfs I hear it can't be as straight-forward as it seems on XFS? But I can't figure out exactly why, as to what complicates things for us compared to xfs etc. What was really depressing me a bit much was hearing it would basically be impossible for us to preserve reflinks over zfs send/receive and when I read that supposedly BTRFS "can" preserve their reflinks over btrfs send/receive, I was like "how"? On a related note, will FIEMAP (and filefrag support) help us get better tools to "defrag" files via maybe background low 'scrub' priority rewriting? (I guess this could be done now with zdb zfs frag check utils?) I'm also looking for a good write-up as to why Block Pointer Rewrite is so difficult and why its really the "only way" some have said, to deal with long-term fragmentation. Also know of anything about the Panzura ZFS dedupe tech that was supposed to be coming to ZFS? |
@behlendorf, if one of the main obstacles for landing FIEMAP support is the "inflight writes" handling ( FIEMAP isn't used so often on still-dirty files, and in the cases that it is most users probably care about the on-disk allocations and not that X MB of unwritten data is still pending in memory. For those few that care about unwritten data, the complete lack of FIEMAP is equally non-functional, and that could be added incrementally as needed after the core on-disk allocation reporting is available. |
is there any update on this? |
Indeed, any update on this? |
rebased (badly, but compiles and works) on commit d98973d if anyone wants to clean it up, add back the tests and make a proper PR: fiemap-diff.txt |
Note that there is development again on the upstream kernel to add compressed extent support for FIEMAP: https://lore.kernel.org/linux-fsdevel/[email protected]/ The patches so far are using the
Note that this is all still under discussion and subject to change. |
regular selfish bump |
Motivation and Context
Description
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.