summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent_io.c
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: fix corruption reading shared and compressed extents after hole punchingFilipe Manana2019-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream. In the past we had data corruption when reading compressed extents that are shared within the same file and they are consecutive, this got fixed by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents"). However there was a case that was missing in those fixes, which is when the shared and compressed extents are referenced with a non-zero offset. The following shell script creates a reproducer for this issue: #!/bin/bash mkfs.btrfs -f /dev/sdc &> /dev/null mount -o compress /dev/sdc /mnt/sdc # Create a file with 3 consecutive compressed extents, each has an # uncompressed size of 128Kb and a compressed size of 4Kb. for ((i = 1; i <= 3; i++)); do head -c 4096 /dev/zero for ((j = 1; j <= 31; j++)); do head -c 4096 /dev/zero | tr '\0' "\377" done done > /mnt/sdc/foobar sync echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)" # Clone the first extent into offsets 128K and 256K. xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar sync echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)" # Punch holes into the regions that are already full of zeroes. xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar sync echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" echo "Dropping page cache..." sysctl -q vm.drop_caches=1 echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" umount /dev/sdc When running the script we get the following output: Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar linked 131072/131072 bytes at offset 131072 128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec) linked 131072/131072 bytes at offset 262144 128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec) Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Dropping page cache... Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar This happens because after reading all the pages of the extent in the range from 128K to 256K for example, we read the hole at offset 256K and then when reading the page at offset 260K we don't submit the existing bio, which is responsible for filling all the page in the range 128K to 256K only, therefore adding the pages from range 260K to 384K to the existing bio and submitting it after iterating over the entire range. Once the bio completes, the uncompressed data fills only the pages in the range 128K to 256K because there's no more data read from disk, leaving the pages in the range 260K to 384K unfilled. It is just a slightly different variant of what was solved by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents"). Fix this by forcing a bio submit, during readpages(), whenever we find a compressed extent map for a page that is different from the extent map for the previous page or has a different starting offset (in case it's the same compressed extent), instead of the extent map's original start offset. A test case for fstests follows soon. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents") Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") Cc: stable@vger.kernel.org # 4.3+ Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs: don't open code lru_to_page()Nikolay Borisov2019-01-041-2/+1
| | | | | | | | | | | | | | | | | | | Multiple filesystems open code lru_to_page(). Rectify this by moving the macro from mm_inline (which is specific to lru stuff) to the more generic mm.h header and start using the macro where appropriate. No functional changes. Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* btrfs: Fix typos in comments and stringsAndrea Gelmini2018-12-171-2/+2
| | | | | | | | | The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Refactor main loop in extent_readpagesNikolay Borisov2018-12-171-20/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | extent_readpages processes all pages in the readlist in batches of 16, this is implemented by a single for loop but thanks to an if condition the loop does 2 things based on whether we've filled the batch or not. Additionally due to the structure of the code there is an additional check which deals with partial batches. Streamline all of this by explicitly using two loops. The outter one is used to process all pages while the inner one just fills in the batch of 16 (currently). Due to this new structure the code guarantees that all pages are processed in the loop hence the code to deal with any leftovers is eliminated. This also enable the compiler to inline __extent_readpages: ./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160) Function old new delta extent_readpages 476 1136 +660 __extent_readpages 820 - -820 Total: Before=44315, After=44155, chg -0.36% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use offset_in_page instead of open-coding itJohannes Thumshirn2018-12-171-29/+24
| | | | | | | | | | | | | Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an offset into a page. So replace them by the offset_in_page() macro instead of open-coding it if they're not used as an alignment check. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove always true if branch in find_delalloc_rangeLu Fengqi2018-12-171-16/+16
| | | | | | | | | | | The @found is always false when it comes to the if branch. Besides, the bool type is more suitable for @found. Change the return value of the function and its caller to bool as well. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Refactor btrfs_merge_bio_hookNikolay Borisov2018-12-171-2/+2
| | | | | | | | | | | | | | This function really checks whether adding more data to the bio will straddle a stripe/chunk. So first let's give it a more appropraite name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never used to just remove it. Thirdly, pages are submitted to either btree or data inodes so it's guaranteed that tree->ops is set so replace the check with an ASSERT. Finally, document the parameters of the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't initialize 'offset' in map_private_extent_buffer()Johannes Thumshirn2018-12-171-1/+1
| | | | | | | | | | | | | | In map_private_extent_buffer() the 'offset' variable is initialized to a page aligned version of the 'start' parameter. But later on it is overwritten with either the offset from the extent buffer's start or 0. So get rid of the initial initialization. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::readpage_io_failed_hookNikolay Borisov2018-12-171-36/+35
| | | | | | | | | | | | | | | | | | | For data inodes this hook does nothing but to return -EAGAIN which is used to signal to the endio routines that this bio belongs to a data inode. If this is the case the actual retrying is handled by bio_readpage_error. Alternatively, if this bio belongs to the btree inode then btree_io_failed_hook just does some cleanup and doesn't retry anything. This patch simplifies the code flow by eliminating readpage_io_failed_hook and instead open-coding btree_io_failed_hook in end_bio_extent_readpage. Also eliminate some needless checks since IO is always performed on either data inode or btree inode, both of which are guaranteed to have their extent_io_tree::ops set. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: replace btrfs_io_bio::end_io with a simple helperDavid Sterba2018-12-171-2/+1
| | | | | | | | | | | | | | The end_io callback implemented as btrfs_io_bio_endio_readpage only calls kfree. Also the callback is set only in case the csum buffer is allocated and not pointing to the inline buffer. We can use that information to drop the indirection and call a helper that will free the csums only in the right case. This shrinks struct btrfs_io_bio by 8 bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use EXPORT_FOR_TESTS for conditionally exported functionsJohannes Thumshirn2018-12-171-11/+2
| | | | | | | | | | | | | | | | | | | Several functions in BTRFS are only used inside the source file they are declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared with the unit tests code. Before the introduction of the EXPORT_FOR_TESTS macro, these functions could not be declared as static and the compiler had a harder task when optimizing and inlining them. As we have EXPORT_FOR_TESTS now, use it where appropriate to support the compiler. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Replace BUG_ON with ASSERT in find_lock_delalloc_rangeNikolay Borisov2018-12-171-1/+1
| | | | | | | | | | | | | | | lock_delalloc_pages should only return 2 values - 0 in case of success and -EAGAIN if the range of pages to be locked should be shrunk due to some of gone. Manual inspections confirms that this is indeed the case since __process_pages_contig is where lock_delalloc_pages gets its return value. The latter always returns 0 or -EAGAIN so the invariant holds. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Sink find_lock_delalloc_range's 'max_bytes' argumentNikolay Borisov2018-12-171-6/+5
| | | | | | | | | | | | All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's reduce the argument count and make that a local variable. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use tagged writepage to mitigate livelock of snapshotEthan Lien2018-12-171-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove unused extent_state argument from ↵Nikolay Borisov2018-12-171-7/+5
| | | | | | | | | | | | | | btrfs_writepage_endio_finish_ordered This parameter was never used, yet was part of the interface of the function ever since its introduction as extent_io_ops::writepage_end_io_hook in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Now that NULL is passed everywhere as a value for this parameter let's remove it for good. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_page_data argument from writepage_delallocNikolay Borisov2018-12-171-7/+4
| | | | | | | | | | | | | | The only remaining use of the 'epd' argument in writepage_delalloc is to reference the extent_io_tree which was set in extent_writepages. Since it is guaranteed that page->mapping of any page passed to writepage_delalloc (and __extent_writepage as the sole caller) to be equal to that passed in extent_writepages we can directly get the io_tree via the already passed inode (which is also taken from page->mapping->host). No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Move epd::extent_locked check to writepage_delalloc's callerNikolay Borisov2018-12-171-7/+8
| | | | | | | | | | | If epd::extent_locked is set then writepage_delalloc terminates. Make this a bit more apparent in the caller by simply bubbling the check up. This enables to remove epd as an argument to writepage_delalloc in a future patch. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Adjust loop in free_extent_bufferNikolay Borisov2018-12-171-1/+3
| | | | | | | | | | | | | | | | | | The loop construct in free_extent_buffer was added in 242e18c7c1a8 ("Btrfs: reduce lock contention on extent buffer locks") as means of reducing the times the eb lock is taken, the non-last ref count is decremented and lock is released. As the special handling of UNMAPPED extent buffers was removed now there is only one decrement op which is happening for EXTENT_BUFFER_UNMAPPED case. This commit modifies the loop condition so that in case of UNMAPPED buffers the eb's lock is taken only if we are 100% sure the eb is going to be freed by the current executor of the code. Additionally, remove superfluous ref count ops in btrfs test. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeingNikolay Borisov2018-12-171-11/+0
| | | | | | | | | | | | | | | Now that the whole of btrfs code has been audited for eb reference count management it's time to remove the hunk in free_extent_buffer that essentially considered the condition "eb->ref == 2 && EXTENT_BUFFER_DUMMY" to equal "eb->ref = 1". Also remove the last location which takes an extra reference count in alloc_test_extent_buffer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::split_extent_hook callbackNikolay Borisov2018-12-171-8/+2
| | | | | | | | | | | This is the counterpart to merge_extent_hook, similarly, it's used only for data/freespace inodes so let's remove it, rename it and call it directly where necessary. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::merge_extent_hook callbackNikolay Borisov2018-12-171-9/+8
| | | | | | | | | | | | This callback is used only for data and free space inodes. Such inodes are guaranteed to have their extent_io_tree::private_data set to the inode struct. Exploit this fact to directly call the function. Also give it a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::clear_bit_hook callbackNikolay Borisov2018-12-171-8/+4
| | | | | | | | | | | This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent), similar to what was done before remove clear_bit_hook and rename the function. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::set_bit_hook extent_io callbackNikolay Borisov2018-12-171-8/+3
| | | | | | | | | | | | | | | | | | | | | | This callback is used to properly account delalloc extents for data inodes (ordinary file inodes and freespace v1 inodes). Those can be easily identified since they have their extent_io trees ->private_data member point to the inode. Let's exploit this fact to remove the needless indirection through extent_io_hooks and directly call the function. Also give the function a name which reflects its purpose - btrfs_set_delalloc_extent. This patch also modified test_find_delalloc so that the extent_io_tree used for testing doesn't have its ->private_data set which would have caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root member not being initialised. The old version of the code also didn't call set_bit_hook since the extent_io ops weren't set for the inode. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::check_extent_io_range callbackNikolay Borisov2018-12-171-3/+12
| | | | | | | | | | | | | This callback was only used in debug builds by btrfs_leak_debug_check. A better approach is to move its implementation in btrfs_leak_debug_check and ensure the latter is only executed for extent tree which have ->private_data set i.e. relate to a data node and not the btree one. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::writepage_end_io_hookNikolay Borisov2018-12-171-21/+12
| | | | | | | | | | | | | | | This callback is ony ever called for data page writeout so there is no need to actually abstract it via extent_io_ops. Lets just export it, remove the definition of the callback and call it directly in the functions that invoke the callback. Also rename the function to btrfs_writepage_endio_finish_ordered since what it really does is account finished io in the ordered extent data structures. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::writepage_start_hookNikolay Borisov2018-12-171-13/+10
| | | | | | | | | | | | | | This hook is called only from __extent_writepage_io which is already called only from the data page writeout path. So there is no need to make an indirect call via extent_io_ops. This patch just removes the callback definition, exports the callback function and calls it directly at the only call site. Also give the function a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove extent_io_ops::fill_delallocNikolay Borisov2018-12-171-11/+9
| | | | | | | | | | | | | | This callback is called only from writepage_delalloc which in turn is guaranteed to be called from the data page writeout path. In the end there is no reason to have the call to this function to be indrected via the extent_io_ops structure. This patch removes the callback definition, exports the function and calls it directly. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_run_delalloc_range ] Signed-off-by: David Sterba <dsterba@suse.com>
* Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds2018-10-281-7/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ...
| * btrfs: Convert page cache to XArrayMatthew Wilcox2018-10-211-5/+3
| | | | | | | | | | Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: David Sterba <dsterba@suse.com>
| * pagevec: Use xa_mark_tMatthew Wilcox2018-10-211-2/+2
| | | | | | | | | | | | Removes sparse warnings. Signed-off-by: Matthew Wilcox <willy@infradead.org>
* | btrfs: tests: add separate stub for find_lock_delalloc_rangeDavid Sterba2018-10-151-1/+12
| | | | | | | | | | | | | | | | | | | | | | The helper find_lock_delalloc_range is now conditionally built static, dpending on whether the self-tests are enabled or not. There's a macro that is supposed to hide the export, used only once. To discourage further use, drop it an add a public wrapper for the helper needed by tests. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: skip set_page_dirty if eb pages are already dirtyLiu Bo2018-10-151-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages are dirty, so no need to set pages dirty again. Ftrace showed that the loop took 10us on my dev box, so removing this can save us at least 10us if eb is already dirty and otherwise avoid a potentially expensive calls to set_page_dirty. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: assert page dirty bit on extent buffer pagesLiu Bo2018-10-151-0/+6
| | | | | | | | | | | | | | | | | | | | Just in case that someone breaks the rule that pages are dirty as long as eb is dirty. The next patch will dirty the pages conditionally. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: use next_state in find_first_extent_bitLiu Bo2018-10-151-6/+1
|/ | | | | | | | | | As next_state() is already defined to get the next state, use it in find_first_extent_bit. No functional changes. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: readpages() should submit IO as read-aheadJens Axboe2018-08-171-1/+1
| | | | | | | | | | | | | | | a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* btrfs: drop extent_io_ops::set_range_writeback callbackDavid Sterba2018-08-061-9/+1
| | | | | | | | | The data and metadata callback implementation both use the same function. We can remove the call indirection and intermediate helper completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop extent_io_ops::merge_bio_hook callbackDavid Sterba2018-08-061-2/+2
| | | | | | | | The data and metadata callback implementation both use the same function. We can remove the call indirection completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop extent_io_ops::tree_fs_info callbackDavid Sterba2018-08-061-10/+4
| | | | | | | | All implementations of the callback are trivial and do the same and there's only one user. Merge everything together. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_UNMAPPEDNikolay Borisov2018-08-061-5/+5
| | | | | | | | | | | | | | | | EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have this flag set are not in any way dummy. Rather, they are private in the sense that are not mapped and linked to the global buffer tree. This flag has subtle implications to the way free_extent_buffer works for example, as well as controls whether page->mapping->private_lock is held during extent_buffer release. Pages for an unmapped buffer cannot be under io, nor can they be written by a 3rd party so taking the lock is unnecessary. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ EXTENT_BUFFER_UNMAPPED, update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Document locking requirement via lockdep_assert_heldNikolay Borisov2018-08-061-1/+2
| | | | | | | | | | Remove stale comment since there is no longer an eb->eb_lock and document the locking expectation with a lockdep_assert_held statement. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename btrfs_release_extent_buffer_pageDavid Sterba2018-08-061-4/+4
| | | | | | | | The function used to release one page (and always the first one), but not anymore since a50924e3a4d7fccb0ecfbd4 ("btrfs: drop constant param from btrfs_release_extent_buffer_page"). Update the name and comment. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Refactor loop in btrfs_release_extent_buffer_pageNikolay Borisov2018-08-061-9/+6
| | | | | | | | | | | | | | | | | The purpose of the function is to free all the pages comprising an extent buffer. This can be achieved with a simple for loop rather than the slightly more involved 'do {} while' construct. So rewrite the loop using a 'for' construct. Additionally we can never have an extent_buffer that has 0 pages so remove the check for index == 0. No functional changes. The reversed order used to have a meaning in the past where the first page served as a blocking point for several callers. See eg 4f2de97acee6532b36dd6e99 ("Btrfs: set page->private to the eb"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Reword dodgy comments in alloc_extent_bufferNikolay Borisov2018-08-061-9/+8
| | | | | | | | | | | | Commit eb14ab8ed24a ("Btrfs: fix page->private races") fixed a genuine race between extent buffer initialisation and btree_releasepage. Unfortunately as the code has evolved the comments weren't changed which made them slightly wrong and they weren't very clear in the fist place. Fix this by (hopefully) rewording them in a more approachable manner. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Simplify page unlocking in alloc_extent_bufferNikolay Borisov2018-08-061-7/+2
| | | | | | | | | | | | | | | | | | | | | Current version of the page unlocking code was added in 727011e07cbd ("Btrfs: allow metadata blocks larger than the page size") but even in this commit that particular flag was never used per-se. In fact, btrfs only uses PageChecked for data pages to identify pages which have been dirtied but don't have ORDERED bit set. For more information see 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied"). However, this doesn't apply to extent buffer pages. The important bit here is that the pages are unlocked AFTER the extent buffer has been properly recorded in the radix tree to avoid races with btree_releasepage. Let's exploit this fact and simplify the page unlocking sequence by unlocking the pages in-order and removing the redundant PageChecked flag setting/clearing. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: open-code bio_set_op_attrsDavid Sterba2018-08-061-1/+1
| | | | | | The helper is trivial and marked as deprecated. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: switch types to int when counting eb pagesDavid Sterba2018-08-061-22/+22
| | | | | | | | | The loops iterating eb pages use unsigned long, that's an overkill as we know that there are at most 16 pages (64k / 4k), and 4 by default (with nodesize 16k). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: pass only eb to num_extent_pagesDavid Sterba2018-08-061-15/+15
| | | | | | | | | | Almost all callers pass the start and len as 2 arguments but this is not necessary, all the information is provided by the eb. By reordering the calls to num_extent_pages, we don't need the local variables with start/len. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge tag 'for-4.18-rc5-tag' of ↵Linus Torvalds2018-07-211-2/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix of a corruption regarding fsync and clone, under some very specific conditions explained in the patch. The fix is marked for stable 3.16+ so I'd like to get it merged now given the impact" * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix file data corruption after cloning a range and fsync
| * Btrfs: fix file data corruption after cloning a range and fsyncFilipe Manana2018-07-191-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we clone a range into a file we can end up dropping existing extent maps (or trimming them) and replacing them with new ones if the range to be cloned overlaps with a range in the destination inode. When that happens we add the new extent maps to the list of modified extents in the inode's extent map tree, so that a "fast" fsync (the flag BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps and log corresponding extent items. However, at the end of range cloning operation we do truncate all the pages in the affected range (in order to ensure future reads will not get stale data). Sometimes this truncation will release the corresponding extent maps besides the pages from the page cache. If this happens, then a "fast" fsync operation will miss logging some extent items, because it relies exclusively on the extent maps being present in the inode's extent tree, leading to data loss/corruption if the fsync ends up using the same transaction used by the clone operation (that transaction was not committed in the meanwhile). An extent map is released through the callback btrfs_invalidatepage(), which gets called by truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The later ends up calling try_release_extent_mapping() which will release the extent map if some conditions are met, like the file size being greater than 16Mb, gfp flags allow blocking and the range not being locked (which is the case during the clone operation) nor being the extent map flagged as pinned (also the case for cloning). The following example, turned into a test for fstests, reproduces the issue: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar # reflink destination offset corresponds to the size of file bar, # 2728Kb minus 4Kb. $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar $ md5sum /mnt/bar 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar <power fail> $ mount /dev/sdb /mnt $ md5sum /mnt/bar 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the # power failure In the above example, the destination offset of the clone operation corresponds to the size of the "bar" file minus 4Kb. So during the clone operation, the extent map covering the range from 2572Kb to 2728Kb gets trimmed so that it ends at offset 2724Kb, and a new extent map covering the range from 2724Kb to 11724Kb is created. So at the end of the clone operation when we ask to truncate the pages in the range from 2724Kb to 2724Kb + 15908Kb, the page invalidation callback ends up removing the new extent map (through try_release_extent_mapping()) when the page at offset 2724Kb is passed to that callback. Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent map is removed at try_release_extent_mapping(), forcing the next fsync to search for modified extents in the fs/subvolume tree instead of relying on the presence of extent maps in memory. This way we can continue doing a "fast" fsync if the destination range of a clone operation does not overlap with an existing range or if any of the criteria necessary to remove an extent map at try_release_extent_mapping() is not met (file size not bigger then 16Mb or gfp flags do not allow blocking). CC: stable@vger.kernel.org # 3.16+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'for-4.18-rc1-tag' of ↵Linus Torvalds2018-06-261-1/+4
|\| | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two regression fixes and an incorrect error value propagation fix from 'rename exchange'" * tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix return value on rename exchange failure btrfs: fix invalid-free in btrfs_extent_same Btrfs: fix physical offset reported by fiemap for inline extents
OpenPOWER on IntegriCloud