summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: don't enospc all tickets on flush failureJosef Bacik2019-04-051-20/+25
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f91587e4151e84f798f37839dddd3e4152fb4c76 ] With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However, this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* btrfs: qgroup: Make qgroup async transaction commit more aggressiveQu Wenruo2019-04-051-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f5fef4593653dfa2a865c485bb81415de51d5c99 ] [BUG] Btrfs qgroup will still hit EDQUOT under the following case: $ dev=/dev/test/test $ mnt=/mnt/btrfs $ umount $mnt &> /dev/null $ umount $dev &> /dev/null $ mkfs.btrfs -f $dev $ mount $dev $mnt -o nospace_cache $ btrfs subv create $mnt/subv $ btrfs quota enable $mnt $ btrfs quota rescan -w $mnt $ btrfs qgroup limit -e 1G $mnt/subv $ fallocate -l 900M $mnt/subv/padding $ sync $ rm $mnt/subv/padding # Hit EDQUOT $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file [CAUSE] Since commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT"), btrfs is not forced to commit transaction to reclaim more quota space. Instead, we just check pertrans metadata reservation against some threshold and try to do asynchronously transaction commit. However in above case, the pertrans metadata reservation is pretty small thus it will never trigger asynchronous transaction commit. [FIX] Instead of only accounting pertrans metadata reservation, we calculate how much free space we have, and if there isn't much free space left, commit transaction asynchronously to try to free some space. This may slow down the fs when we have less than 32M free qgroup space, but should reduce a lot of false EDQUOT, so the cost should be acceptable. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* btrfs: save drop_progress if we drop refs at allJosef Bacik2019-04-051-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit aea6f028d01d629eda2e958ccd1133e805cda159 ] Previously we only updated the drop_progress key if we were in the DROP_REFERENCE stage of snapshot deletion. This is because the UPDATE_BACKREF stage checks the flags of the blocks it's converting to FULL_BACKREF, so if we go over a block we processed before it doesn't matter, we just don't do anything. The problem is in do_walk_down() we will go ahead and drop the roots reference to any blocks that we know we won't need to walk into. Given subvolume A and snapshot B. The root of B points to all of the nodes that belong to A, so all of those nodes have a refcnt > 1. If B did not modify those blocks it'll hit this condition in do_walk_down if (!wc->update_ref || generation <= root->root_key.offset) goto skip; and in "goto skip" we simply do a btrfs_free_extent() for that bytenr that we point at. Now assume we modified some data in B, and then took a snapshot of B and call it C. C points to all the nodes in B, making every node the root of B points to have a refcnt > 1. This assumes the root level is 2 or higher. We delete snapshot B, which does the above work in do_walk_down, free'ing our ref for nodes we share with A that we didn't modify. Now we hit a node we _did_ modify, thus we own. We need to walk down into this node and we set wc->stage == UPDATE_BACKREF. We walk down to level 0 which we also own because we modified data. We can't walk any further down and thus now need to walk up and start the next part of the deletion. Now walk_up_proc is supposed to put us back into DROP_REFERENCE, but there's an exception to this if (level < wc->shared_level) goto out; we are at level == 0, and our shared_level == 1. We skip out of this one and go up to level 1. Since path->slots[1] < nritems we path->slots[1]++ and break out of walk_up_tree to stop our transaction and loop back around. Now in btrfs_drop_snapshot we have this snippet if (wc->stage == DROP_REFERENCE) { level = wc->level; btrfs_node_key(path->nodes[level], &root_item->drop_progress, path->slots[level]); root_item->drop_level = level; } our stage == UPDATE_BACKREF still, so we don't update the drop_progress key. This is a problem because we would have done btrfs_free_extent() for the nodes leading up to our current position. If we crash or unmount here and go to remount we'll start over where we were before and try to free our ref for blocks we've already freed, and thus abort() out. Fix this by keeping track of the last place we dropped a reference for our block in do_walk_down. Then if wc->stage == UPDATE_BACKREF we know we'll start over from a place we meant to, and otherwise things continue to work as they did before. I have a complicated reproducer for this problem, without this patch we'll fail to fsck the fs when replaying the log writes log. With this patch we can replay the whole log without any fsck or mount failures. The steps to reproduce this easily are sort of tricky, I had to add a couple of debug patches to the kernel in order to make it easy, basically I just needed to make sure we did actually commit the transaction every time we finished a walk_down_tree/walk_up_tree combo. The reproducer: 1) Creates a base subvolume. 2) Creates 100k files in the subvolume. 3) Snapshots the base subvolume (snap1). 4) Touches files 5000-6000 in snap1. 5) Snapshots snap1 (snap2). 6) Deletes snap1. I do this with dm-log-writes, and then replay to every FUA in the log and fsck the fs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ copy reproducer steps ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* Btrfs: fix assertion failure on fsync with NO_HOLES enabledFilipe Manana2019-04-031-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0ccc3876e4b2a1559a4dbe3126dda4459d38a83b upstream. Back in commit a89ca6f24ffe4 ("Btrfs: fix fsync after truncate when no_holes feature is enabled") I added an assertion that is triggered when an inline extent is found to assert that the length of the (uncompressed) data the extent represents is the same as the i_size of the inode, since that is true most of the time I couldn't find or didn't remembered about any exception at that time. Later on the assertion was expanded twice to deal with a case of a compressed inline extent representing a range that matches the sector size followed by an expanding truncate, and another case where fallocate can update the i_size of the inode without adding or updating existing extents (if the fallocate range falls entirely within the first block of the file). These two expansion/fixes of the assertion were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of regular file when using no-holes feature") and commit 6399fb5a0b69a ("Btrfs: fix assertion failure during fsync in no-holes mode"). These however missed the case where an falloc expands the i_size of an inode to exactly the sector size and inline extent exists, for example: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mount /dev/sdc /mnt $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar wrote 1096/1096 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec) $ xfs_io -c "falloc 1096 3000" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar Segmentation fault $ dmesg [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727 [701253.602962] ------------[ cut here ]------------ [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533! [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G W 5.0.0-rc8-btrfs-next-45 #1 [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs] (...) [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286 [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000 [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868 [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000 [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18 [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80 [701253.607769] FS: 00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000 [701253.608163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0 [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [701253.609608] Call Trace: [701253.609994] btrfs_log_inode+0xdfb/0xe40 [btrfs] [701253.610383] btrfs_log_inode_parent+0x2be/0xa60 [btrfs] [701253.610770] ? do_raw_spin_unlock+0x49/0xc0 [701253.611150] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [701253.611537] btrfs_sync_file+0x3b2/0x440 [btrfs] [701253.612010] ? do_sysinfo+0xb0/0xf0 [701253.612552] do_fsync+0x38/0x60 [701253.612988] __x64_sys_fsync+0x10/0x20 [701253.613360] do_syscall_64+0x60/0x1b0 [701253.613733] entry_SYSCALL_64_after_hwframe+0x49/0xbe [701253.614103] RIP: 0033:0x7f14da4e66d0 (...) [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0 [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003 [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010 [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260 [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10 (...) [701253.619941] ---[ end trace e088d74f132b6da5 ]--- Updating the assertion again to allow for this particular case would result in a meaningless assertion, plus there is currently no risk of logging content that would result in any corruption after a log replay if the size of the data encoded in an inline extent is greater than the inode's i_size (which is not currently possibe either with or without compression), therefore just remove the assertion. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: Avoid possible qgroup_rsv_size overflow in ↵Nikolay Borisov2019-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_calculate_inode_block_rsv_size commit 139a56170de67101791d6e6c8e940c6328393fe9 upstream. qgroup_rsv_size is calculated as the product of outstanding_extent * fs_info->nodesize. The product is calculated with 32 bit precision since both variables are defined as u32. Yet qgroup_rsv_size expects a 64 bit result. Avoid possible multiplication overflow by casting outstanding_extent to u64. Such overflow would in the worst case (64K nodesize) require more than 65536 extents, which is quite large and i'ts not likely that it would happen in practice. Fixes-coverity-id: 1435101 Fixes: ff6bc37eb7f6 ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: Fix bound checking in qgroup_trace_new_subtree_blocksNikolay Borisov2019-04-031-2/+2
| | | | | | | | | | | | | | | | | | | | | commit 7ff2c2a1a71e83f74574b8001ea88deb3c166ad7 upstream. If 'cur_level' is 7 then the bound checking at the top of the function will actually pass. Later on, it's possible to dereference ds_path->nodes[cur_level+1] which will be an out of bounds. The correct check will be cur_level >= BTRFS_MAX_LEVEL - 1 . Fixes-coverty-id: 1440918 Fixes-coverty-id: 1440911 Fixes: ea49f3e73c4b ("btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree") CC: stable@vger.kernel.org # 4.20+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: raid56: properly unmap parity page in finish_parity_scrub()Andrea Righi2019-04-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3897b6f0a859288c22fb793fad11ec2327e60fcd upstream. Parity page is incorrectly unmapped in finish_parity_scrub(), triggering a reference counter bug on i386, i.e.: [ 157.662401] kernel BUG at mm/highmem.c:349! [ 157.666725] invalid opcode: 0000 [#1] SMP PTI The reason is that kunmap(p_page) was completely left out, so we never did an unmap for the p_page and the loop unmapping the rbio page was iterating over the wrong number of stripes: unmapping should be done with nr_data instead of rbio->real_stripes. Test case to reproduce the bug: - create a raid5 btrfs filesystem: # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde - mount it: # mount /dev/sdb /mnt - run btrfs scrub in a loop: # while :; do btrfs scrub start -BR /mnt; done BugLink: https://bugs.launchpad.net/bugs/1812845 Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: don't report readahead errors and don't update statisticsDavid Sterba2019-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | commit 0cc068e6ee59c1fffbfa977d8bf868b7551d80ac upstream. As readahead is an optimization, all errors are usually filtered out, but still properly handled when the real read call is done. The commit 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") added REQ_RAHEAD to readpages() because that's only used for readahead (despite what one would expect from the callback name). This causes a flood of messages and inflated read error stats, so skip reporting in case it's readahead. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403 Reported-by: LimeTech <tomm@lime-technology.com> Fixes: 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: remove WARN_ON in log_dir_itemsJosef Bacik2019-04-031-2/+9
| | | | | | | | | | | | | | | | | | | | | commit 2cc8334270e281815c3850c3adea363c51f21e0d upstream. When Filipe added the recursive directory logging stuff in 2f2ff0ee5e430 ("Btrfs: fix metadata inconsistencies after directory fsync") he specifically didn't take the directory i_mutex for the children directories that we need to log because of lockdep. This is generally fine, but can lead to this WARN_ON() tripping if we happen to run delayed deletion's in between our first search and our second search of dir_item/dir_indexes for this directory. We expect this to happen, so the WARN_ON() isn't necessary. Drop the WARN_ON() and add a comment so we know why this case can happen. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix incorrect file size after shrinking truncate and fsyncFilipe Manana2019-04-031-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bf504110bc8aa05df48b0e5f0aa84bfb81e0574b upstream. If we do a shrinking truncate against an inode which is already present in the respective log tree and then rename it, as part of logging the new name we end up logging an inode item that reflects the old size of the file (the one which we previously logged) and not the new smaller size. The decision to preserve the size previously logged was added by commit 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to inode") in order to avoid data loss after replaying the log. However that decision is only needed for the case the logged inode size is smaller then the current size of the inode, as explained in that commit's change log. If the current size of the inode is smaller then the previously logged size, we know a shrinking truncate happened and therefore need to use that smaller size. Example to trigger the problem: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo $ xfs_io -c "fsync" /mnt/foo $ xfs_io -c "truncate 3000" /mnt/foo $ mv /mnt/foo /mnt/bar $ xfs_io -c "fsync" /mnt/bar <power failure> $ mount /dev/sdb /mnt $ od -t x1 -A d /mnt/bar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0008000 Once we rename the file, we log its name (and inode item), and because the inode was already logged before in the current transaction, we log it with a size of 8000 bytes because that is the size we previously logged (with the first fsync). As part of the rename, besides logging the inode, we do also sync the log, which is done since commit d4682ba03ef618 ("Btrfs: sync log after logging new name"), so the next fsync against our inode is effectively a no-op, since no new changes happened since the rename operation. Even if did not sync the log during the rename operation, the same problem (fize size of 8000 bytes instead of 3000 bytes) would be visible after replaying the log if the log ended up getting synced to disk through some other means, such as for example by fsyncing some other modified file. In the example above the fsync after the rename operation is there just because not every filesystem may guarantee logging/journalling the inode (and syncing the log/journal) during the rename operation, for example it is needed for f2fs, but not for ext4 and xfs. Fix this scenario by, when logging a new name (which is triggered by rename and link operations), using the current size of the inode instead of the previously logged inode size. A test case for fstests follows soon. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695 CC: stable@vger.kernel.org # 4.4+ Reported-by: Seulbae Kim <seulbae@gatech.edu> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix deadlock between clone/dedupe and renameFilipe Manana2019-03-231-18/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4ea748e1d2c9f8a27332b949e8210dbbf392987e upstream. Reflinking (clone/dedupe) and rename are operations that operate on two inodes and therefore need to lock them in the same order to avoid ABBA deadlocks. It happens that Btrfs' reflink implementation always locked them in a different order from VFS's lock_two_nondirectories() helper, which is used by the rename code in VFS, resulting in ABBA type deadlocks. Btrfs' locking order: static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2) { if (inode1 < inode2) swap(inode1, inode2); inode_lock_nested(inode1, I_MUTEX_PARENT); inode_lock_nested(inode2, I_MUTEX_CHILD); } VFS's locking order: void lock_two_nondirectories(struct inode *inode1, struct inode *inode2) { if (inode1 > inode2) swap(inode1, inode2); if (inode1 && !S_ISDIR(inode1->i_mode)) inode_lock(inode1); if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1) inode_lock_nested(inode2, I_MUTEX_NONDIR2); } Fix this by killing the btrfs helper function that does the double inode locking and replace it with VFS's helper lock_two_nondirectories(). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 416161db9b63e3 ("btrfs: offline dedupe") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix corruption reading shared and compressed extents after hole punchingFilipe Manana2019-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream. In the past we had data corruption when reading compressed extents that are shared within the same file and they are consecutive, this got fixed by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents"). However there was a case that was missing in those fixes, which is when the shared and compressed extents are referenced with a non-zero offset. The following shell script creates a reproducer for this issue: #!/bin/bash mkfs.btrfs -f /dev/sdc &> /dev/null mount -o compress /dev/sdc /mnt/sdc # Create a file with 3 consecutive compressed extents, each has an # uncompressed size of 128Kb and a compressed size of 4Kb. for ((i = 1; i <= 3; i++)); do head -c 4096 /dev/zero for ((j = 1; j <= 31; j++)); do head -c 4096 /dev/zero | tr '\0' "\377" done done > /mnt/sdc/foobar sync echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)" # Clone the first extent into offsets 128K and 256K. xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar sync echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)" # Punch holes into the regions that are already full of zeroes. xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar sync echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" echo "Dropping page cache..." sysctl -q vm.drop_caches=1 echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" umount /dev/sdc When running the script we get the following output: Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar linked 131072/131072 bytes at offset 131072 128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec) linked 131072/131072 bytes at offset 262144 128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec) Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Dropping page cache... Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar This happens because after reading all the pages of the extent in the range from 128K to 256K for example, we read the hole at offset 256K and then when reading the page at offset 260K we don't submit the existing bio, which is responsible for filling all the page in the range 128K to 256K only, therefore adding the pages from range 260K to 384K to the existing bio and submitting it after iterating over the entire range. Once the bio completes, the uncompressed data fills only the pages in the range 128K to 256K because there's no more data read from disk, leaving the pages in the range 260K to 384K unfilled. It is just a slightly different variant of what was solved by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents"). Fix this by forcing a bio submit, during readpages(), whenever we find a compressed extent map for a page that is different from the extent map for the previous page or has a different starting offset (in case it's the same compressed extent), instead of the extent map's original start offset. A test case for fstests follows soon. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents") Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") Cc: stable@vger.kernel.org # 4.3+ Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: init csum_list before possible freeDan Robertson2019-03-231-1/+1
| | | | | | | | | | | | | | | | | | commit e49be14b8d80e23bb7c53d78c21717a474ade76b upstream. The scrub_ctx csum_list member must be initialized before scrub_free_ctx is called. If the csum_list is not initialized beforehand, the list_empty call in scrub_free_csums will result in a null deref if the allocation fails in the for loop. Fixes: a2de733c78fa ("btrfs: scrub") CC: stable@vger.kernel.org # 3.0+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dan Robertson <dan@dlrobertson.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: ensure that a DUP or RAID1 block group has exactly two stripesJohannes Thumshirn2019-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream. We recently had a customer issue with a corrupted filesystem. When trying to mount this image btrfs panicked with a division by zero in calc_stripe_length(). The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length() takes this value and divides it by the number of copies the RAID profile is expected to have to calculate the amount of data stripes. As a DUP profile is expected to have 2 copies this division resulted in 1/2 = 0. Later then the 'data_stripes' variable is used as a divisor in the stripe length calculation which results in a division by 0 and thus a kernel panic. When encountering a filesystem with a DUP block group and a 'num_stripes' value unequal to 2, refuse mounting as the image is corrupted and will lead to unexpected behaviour. Code inspection showed a RAID1 block group has the same issues. Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: drop the lock on error in btrfs_dev_replace_cancelDan Carpenter2019-03-231-0/+1
| | | | | | | | | | | | | | | | | | | | | commit 669e859b5ea7c6f4fce0149d3907c64e550c294b upstream. We should drop the lock on this error path. This has been found by a static tool. The lock needs to be released, it's there to protect access to the dev_replace members and is not supposed to be left locked. The value of state that's being switched would need to be artifically changed to an invalid value so the default: branch is taken. Fixes: d189dd70e255 ("btrfs: fix use-after-free due to race between replace start and cancel") CC: stable@vger.kernel.org # 5.0+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: scrub: fix circular locking dependency warningAnand Jain2019-03-231-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1cec3f27168d7835ff3d23ab371cd548440131bb upstream. This fixes a longstanding lockdep warning triggered by fstests/btrfs/011. Circular locking dependency check reports warning[1], that's because the btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock held. The test case leading to this warning: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /btrfs $ btrfs scrub start -B /btrfs In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy of the scrub workers are needed. So once we have incremented and decremented the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this patch drops the scrub_lock before calling btrfs_destroy_workqueue(). [359.258534] ====================================================== [359.260305] WARNING: possible circular locking dependency detected [359.261938] 5.0.0-rc6-default #461 Not tainted [359.263135] ------------------------------------------------------ [359.264672] btrfs/20975 is trying to acquire lock: [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540 [359.268416] [359.268416] but task is already holding lock: [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs] [359.272418] [359.272418] which lock already depends on the new lock. [359.272418] [359.274692] [359.274692] the existing dependency chain (in reverse order) is: [359.276671] [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}: [359.278187] __mutex_lock+0x86/0x9c0 [359.279086] btrfs_scrub_pause+0x31/0x100 [btrfs] [359.280421] btrfs_commit_transaction+0x1e4/0x9e0 [btrfs] [359.281931] close_ctree+0x30b/0x350 [btrfs] [359.283208] generic_shutdown_super+0x64/0x100 [359.284516] kill_anon_super+0x14/0x30 [359.285658] btrfs_kill_super+0x12/0xa0 [btrfs] [359.286964] deactivate_locked_super+0x29/0x60 [359.288242] cleanup_mnt+0x3b/0x70 [359.289310] task_work_run+0x98/0xc0 [359.290428] exit_to_usermode_loop+0x83/0x90 [359.291445] do_syscall_64+0x15b/0x180 [359.292598] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.294011] [359.294011] -> #2 (sb_internal#2){.+.+}: [359.295432] __sb_start_write+0x113/0x1d0 [359.296394] start_transaction+0x369/0x500 [btrfs] [359.297471] btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs] [359.298629] normal_work_helper+0xcd/0x530 [btrfs] [359.299698] process_one_work+0x246/0x610 [359.300898] worker_thread+0x3c/0x390 [359.302020] kthread+0x116/0x130 [359.303053] ret_from_fork+0x24/0x30 [359.304152] [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}: [359.306100] process_one_work+0x21f/0x610 [359.307302] worker_thread+0x3c/0x390 [359.308465] kthread+0x116/0x130 [359.309357] ret_from_fork+0x24/0x30 [359.310229] [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}: [359.311812] lock_acquire+0x90/0x180 [359.312929] flush_workqueue+0xaa/0x540 [359.313845] drain_workqueue+0xa1/0x180 [359.314761] destroy_workqueue+0x17/0x240 [359.315754] btrfs_destroy_workqueue+0x57/0x200 [btrfs] [359.317245] scrub_workers_put+0x2c/0x60 [btrfs] [359.318585] btrfs_scrub_dev+0x336/0x590 [btrfs] [359.319944] btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs] [359.321622] btrfs_ioctl+0x28a4/0x2e40 [btrfs] [359.322908] do_vfs_ioctl+0xa2/0x6d0 [359.324021] ksys_ioctl+0x3a/0x70 [359.325066] __x64_sys_ioctl+0x16/0x20 [359.326236] do_syscall_64+0x54/0x180 [359.327379] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.328772] [359.328772] other info that might help us debug this: [359.328772] [359.330990] Chain exists of: [359.330990] (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock [359.330990] [359.334376] Possible unsafe locking scenario: [359.334376] [359.336020] CPU0 CPU1 [359.337070] ---- ---- [359.337821] lock(&fs_info->scrub_lock); [359.338506] lock(sb_internal#2); [359.339506] lock(&fs_info->scrub_lock); [359.341461] lock((wq_completion)"%s-%s""btrfs", name); [359.342437] [359.342437] *** DEADLOCK *** [359.342437] [359.343745] 1 lock held by btrfs/20975: [359.344788] #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs] [359.346778] [359.346778] stack backtrace: [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461 [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [359.350501] Call Trace: [359.350931] dump_stack+0x67/0x90 [359.351676] print_circular_bug.isra.37.cold.56+0x15c/0x195 [359.353569] check_prev_add.constprop.44+0x4f9/0x750 [359.354849] ? check_prev_add.constprop.44+0x286/0x750 [359.356505] __lock_acquire+0xb84/0xf10 [359.357505] lock_acquire+0x90/0x180 [359.358271] ? flush_workqueue+0x87/0x540 [359.359098] flush_workqueue+0xaa/0x540 [359.359912] ? flush_workqueue+0x87/0x540 [359.360740] ? drain_workqueue+0x1e/0x180 [359.361565] ? drain_workqueue+0xa1/0x180 [359.362391] drain_workqueue+0xa1/0x180 [359.363193] destroy_workqueue+0x17/0x240 [359.364539] btrfs_destroy_workqueue+0x57/0x200 [btrfs] [359.365673] scrub_workers_put+0x2c/0x60 [btrfs] [359.366618] btrfs_scrub_dev+0x336/0x590 [btrfs] [359.367594] ? start_transaction+0xa1/0x500 [btrfs] [359.368679] btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs] [359.369545] btrfs_ioctl+0x28a4/0x2e40 [btrfs] [359.370186] ? __lock_acquire+0x263/0xf10 [359.370777] ? kvm_clock_read+0x14/0x30 [359.371392] ? kvm_sched_clock_read+0x5/0x10 [359.372248] ? sched_clock+0x5/0x10 [359.372786] ? sched_clock_cpu+0xc/0xc0 [359.373662] ? do_vfs_ioctl+0xa2/0x6d0 [359.374552] do_vfs_ioctl+0xa2/0x6d0 [359.375378] ? do_sigaction+0xff/0x250 [359.376233] ksys_ioctl+0x3a/0x70 [359.376954] __x64_sys_ioctl+0x16/0x20 [359.377772] do_syscall_64+0x54/0x180 [359.378841] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.380422] RIP: 0033:0x7f5429296a97 Backporting to older kernels: scrub_nocow_workers must be freed the same way as the others. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: setup a nofs context for memory allocation at __btrfs_set_aclFilipe Manana2019-03-231-0/+9
| | | | | | | | | | | | | | | | | commit a0873490660246db587849a9e172f2b7b21fa88a upstream. We are holding a transaction handle when setting an acl, therefore we can not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock if reclaim is triggered by the allocation, therefore setup a nofs context. Fixes: 39a27ec1004e8 ("btrfs: use GFP_KERNEL for xattr and acl allocations") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()Filipe Manana2019-03-231-0/+8
| | | | | | | | | | | | | | | | | commit b89f6d1fcb30a8cbdc18ce00c7d93792076af453 upstream. We are holding a transaction handle when creating a tree, therefore we can not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is triggered by the allocation, therefore setup a nofs context. Fixes: 74e4d82757f74 ("btrfs: let callers of btrfs_alloc_root pass gfp flags") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'for-5.0-rc4-tag' of ↵Linus Torvalds2019-02-034-38/+71
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fix: transaction commit can run away due to delayed ref waiting heuristic, this is not necessary now because of the proper reservation mechanism introduced in 5.0 - regression fix: potential crash due to use-before-check of an ERR_PTR return value - fix for transaction abort during transaction commit that needs to properly clean up pending block groups - fix deadlock during b-tree node/leaf splitting, when this happens on some of the fundamental trees, we must prevent new tree block allocation to re-enter indirectly via the block group flushing path - potential memory leak after errors during mount * tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: On error always free subvol_name in btrfs_mount btrfs: clean up pending block groups when transaction commit aborts btrfs: fix potential oops in device_list_add btrfs: don't end the transaction for delayed refs in throttle Btrfs: fix deadlock when allocating tree block during leaf/node split
| * btrfs: On error always free subvol_name in btrfs_mountEric W. Biederman2019-01-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The subvol_name is allocated in btrfs_parse_subvol_options and is consumed and freed in mount_subvol. Add a free to the error paths that don't call mount_subvol so that it is guaranteed that subvol_name is freed when an error happens. Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()") Cc: stable@vger.kernel.org # v4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: clean up pending block groups when transaction commit abortsDavid Sterba2019-01-301-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fstests generic/475 stresses transaction aborts and can reveal space accounting or use-after-free bugs regarding block goups. In this case the pending block groups that remain linked to the structures after transaction commit aborts in the middle. The corrupted slabs lead to failures in following tests, eg. generic/476 [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [ 8172.755799] #PF error: [normal kernel read fault] [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0 [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G W 5.0.0-rc2-default #408 [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90 [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287 [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000 [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0 [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000 [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78 [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200 [ 8172.778886] FS: 0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000 [ 8172.780545] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0 [ 8172.783547] Call Trace: [ 8172.784112] shrink_inactive_list+0x194/0x410 [ 8172.784747] shrink_node_memcg.constprop.85+0x3a5/0x6a0 [ 8172.785472] shrink_node+0x62/0x1e0 [ 8172.786011] balance_pgdat+0x216/0x460 [ 8172.786577] kswapd+0xe3/0x4a0 [ 8172.787085] ? finish_wait+0x80/0x80 [ 8172.787795] ? balance_pgdat+0x460/0x460 [ 8172.788799] kthread+0x116/0x130 [ 8172.789640] ? kthread_create_on_node+0x60/0x60 [ 8172.790323] ret_from_fork+0x24/0x30 [ 8172.794253] CR2: 0000000000000058 or accounting errors at umount time: [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs] [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G W 5.0.0-rc2-default #408 [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs] [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206 [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000 [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800 [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000 [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108 [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100 [ 8159.562852] FS: 00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000 [ 8159.564839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0 [ 8159.567898] Call Trace: [ 8159.568597] close_ctree+0x17f/0x350 [btrfs] [ 8159.569628] generic_shutdown_super+0x64/0x100 [ 8159.570808] kill_anon_super+0x14/0x30 [ 8159.571857] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8159.573063] deactivate_locked_super+0x29/0x60 [ 8159.574234] cleanup_mnt+0x3b/0x70 [ 8159.575176] task_work_run+0x98/0xc0 [ 8159.576177] exit_to_usermode_loop+0x83/0x90 [ 8159.577315] do_syscall_64+0x15b/0x180 [ 8159.578339] entry_SYSCALL_64_after_hwframe+0x49/0xbe This fix is based on 2 Josef's patches that used sideefects of btrfs_create_pending_block_groups, this fix introduces the helper that does what we need. CC: stable@vger.kernel.org # 4.4+ CC: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix potential oops in device_list_addAl Viro2019-01-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its result before the check for IS_ERR() is a bad idea. Fixes: d1a63002829a4 ("btrfs: add members to fs_devices to track fsid changes") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: don't end the transaction for delayed refs in throttleJosef Bacik2019-01-281-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously callers to btrfs_end_transaction_throttle() would commit the transaction if there wasn't enough delayed refs space. This happens in relocation, and if the fs is relatively empty we'll run out of delayed refs space basically immediately, so we'll just be stuck in this loop of committing the transaction over and over again. This code existed because we didn't have a good feedback mechanism for running delayed refs, but with the delayed refs rsv we do now. Delete this throttling code and let the btrfs_start_transaction() in relocation deal with putting pressure on the delayed refs infrastructure. With this patch we no longer take 5 minutes to balance a metadata only fs. Qu has submitted a fstest to catch slow balance or excessive transaction commits. Steps to reproduce: * create subvolume * create many (eg. 16000) inlined files, of size 2KiB * iteratively snapshot and touch several files to trigger metadata updates * start balance -m Reported-by: Qu Wenruo <wqu@suse.com> Fixes: 64403612b73a ("btrfs: rework btrfs_check_space_for_delayed_refs") Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add tags and steps to reproduce ] Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix deadlock when allocating tree block during leaf/node splitFilipe Manana2019-01-281-28/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When splitting a leaf or node from one of the trees that are modified when flushing pending block groups (extent, chunk, device and free space trees), we need to allocate a new tree block, which in turn can result in the need to allocate a new block group. After allocating the new block group we may need to flush new block groups that were previously allocated during the course of the current transaction, which is what may cause a deadlock due to attempts to write lock twice the same leaf or node, as when splitting a leaf or node we are holding a write lock on it and its parent node. The same type of deadlock can also happen when increasing the tree's height, since we are holding a lock on the existing root while allocating the tree block to use as the new root node. An example trace when the deadlock happens during the leaf split path is: [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G W 4.19.16 #1 [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018 [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] (...) [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246 [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001 [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48 [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000 [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000 [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18 [27175.303601] FS: 0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000 [27175.304468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0 [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [27175.307940] Call Trace: [27175.308802] btrfs_search_slot+0x779/0x9a0 [btrfs] [27175.309669] ? update_space_info+0xba/0xe0 [btrfs] [27175.310534] btrfs_insert_empty_items+0x67/0xc0 [btrfs] [27175.311397] btrfs_insert_item+0x60/0xd0 [btrfs] [27175.312253] btrfs_create_pending_block_groups+0xee/0x210 [btrfs] [27175.313116] do_chunk_alloc+0x25f/0x300 [btrfs] [27175.313984] find_free_extent+0x706/0x10d0 [btrfs] [27175.314855] btrfs_reserve_extent+0x9b/0x1d0 [btrfs] [27175.315707] btrfs_alloc_tree_block+0x100/0x5b0 [btrfs] [27175.316548] split_leaf+0x130/0x610 [btrfs] [27175.317390] btrfs_search_slot+0x94d/0x9a0 [btrfs] [27175.318235] btrfs_insert_empty_items+0x67/0xc0 [btrfs] [27175.319087] alloc_reserved_file_extent+0x84/0x2c0 [btrfs] [27175.319938] __btrfs_run_delayed_refs+0x596/0x1150 [btrfs] [27175.320792] btrfs_run_delayed_refs+0xed/0x1b0 [btrfs] [27175.321643] delayed_ref_async_start+0x81/0x90 [btrfs] [27175.322491] normal_work_helper+0xd0/0x320 [btrfs] [27175.323328] ? move_linked_works+0x6e/0xa0 [27175.324160] process_one_work+0x191/0x370 [27175.324976] worker_thread+0x4f/0x3b0 [27175.325763] kthread+0xf8/0x130 [27175.326531] ? rescuer_thread+0x320/0x320 [27175.327284] ? kthread_create_worker_on_cpu+0x50/0x50 [27175.328027] ret_from_fork+0x35/0x40 [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]--- Fix this by preventing the flushing of new blocks groups when splitting a leaf/node and when inserting a new root node for one of the trees modified by the flushing operation, similar to what is done when COWing a node/leaf from on of these trees. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383 Reported-by: Eli V <eliventer@gmail.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'for-5.0-rc2-tag' of ↵Linus Torvalds2019-01-214-10/+35
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A handful of fixes (some of them in testing for a long time): - fix some test failures regarding cleanup after transaction abort - revert of a patch that could cause a deadlock - delayed iput fixes, that can help in ENOSPC situation when there's low space and a lot data to write" * tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: wakeup cleaner thread when adding delayed iput btrfs: run delayed iputs before committing btrfs: wait on ordered extents on abort cleanup btrfs: handle delayed ref head accounting cleanup in abort Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
| * btrfs: wakeup cleaner thread when adding delayed iputJosef Bacik2019-01-183-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. Delaying iputs means we are potentially delaying the eviction of an inode and it's respective space. The cleaner thread only gets woken up every 30 seconds, or when we require space. If there are a lot of inodes that need to be deleted we could induce a serious amount of latency while we wait for these inodes to be evicted. So instead wakeup the cleaner if it's not already awake to process any new delayed iputs we add to the list. If we suddenly need space we will less likely be backed up behind a bunch of inodes that are waiting to be deleted, and we could possibly free space before we need to get into the flushing logic which will save us some latency. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: run delayed iputs before committingJosef Bacik2019-01-181-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: wait on ordered extents on abort cleanupJosef Bacik2019-01-181-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. generic/475 can produce this warning: [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs] [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G W 5.0.0-rc1-default+ #394 [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs] [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286 [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000 [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000 [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000 [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0 [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000 [ 8531.202707] FS: 00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000 [ 8531.204851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0 [ 8531.207516] Call Trace: [ 8531.208175] btrfs_free_fs_roots+0xdb/0x170 [btrfs] [ 8531.210209] ? wait_for_completion+0x5b/0x190 [ 8531.211303] close_ctree+0x157/0x350 [btrfs] [ 8531.212412] generic_shutdown_super+0x64/0x100 [ 8531.213485] kill_anon_super+0x14/0x30 [ 8531.214430] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8531.215539] deactivate_locked_super+0x29/0x60 [ 8531.216633] cleanup_mnt+0x3b/0x70 [ 8531.217497] task_work_run+0x98/0xc0 [ 8531.218397] exit_to_usermode_loop+0x83/0x90 [ 8531.219324] do_syscall_64+0x15b/0x180 [ 8531.220192] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8531.221286] RIP: 0033:0x7fc68e5e4d07 [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6 [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07 [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80 [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80 [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80 [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000 Leaving a tree in the rb-tree: 3853 void btrfs_free_fs_root(struct btrfs_root *root) 3854 { 3855 iput(root->ino_cache_inode); 3856 WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); CC: stable@vger.kernel.org Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add stacktrace ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle delayed ref head accounting cleanup in abortJosef Bacik2019-01-183-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. The test generic/475 on a 2G VM can trigger the problems eg.: [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs] [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394 [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014 [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs] [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206 [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000 [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400 [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000 [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108 [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100 [ 8502.170296] FS: 00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000 [ 8502.172439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0 [ 8502.175094] Call Trace: [ 8502.175759] close_ctree+0x17f/0x350 [btrfs] [ 8502.176721] generic_shutdown_super+0x64/0x100 [ 8502.177702] kill_anon_super+0x14/0x30 [ 8502.178607] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8502.179602] deactivate_locked_super+0x29/0x60 [ 8502.180595] cleanup_mnt+0x3b/0x70 [ 8502.181406] task_work_run+0x98/0xc0 [ 8502.182255] exit_to_usermode_loop+0x83/0x90 [ 8502.183113] do_syscall_64+0x15b/0x180 [ 8502.183919] entry_SYSCALL_64_after_hwframe+0x49/0xbe Corresponding to release_global_block_rsv() { ... WARN_ON(fs_info->delayed_refs_rsv.reserved > 0); CC: stable@vger.kernel.org Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add log dump ] Signed-off-by: David Sterba <dsterba@suse.com>
| * Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"David Sterba2019-01-181-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit e73e81b6d0114d4a303205a952ab2e87c44bd279. This patch causes a few problems: - adds latency to btrfs_finish_ordered_io - as btrfs_finish_ordered_io is used for free space cache, generating more work from btrfs_btree_balance_dirty_nodelay could end up in the same workque, effectively deadlocking 12260 kworker/u96:16+btrfs-freespace-write D [<0>] balance_dirty_pages+0x6e6/0x7ad [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90 [<0>] btrfs_finish_ordered_io+0x3da/0x770 [<0>] normal_work_helper+0x1c5/0x5a0 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x46/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Transaction commit will wait on the freespace cache: 838 btrfs-transacti D [<0>] btrfs_start_ordered_extent+0x154/0x1e0 [<0>] btrfs_wait_ordered_range+0xbd/0x110 [<0>] __btrfs_wait_cache_io+0x49/0x1a0 [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0 [<0>] commit_cowonly_roots+0x215/0x2b0 [<0>] btrfs_commit_transaction+0x37e/0x910 [<0>] transaction_kthread+0x14d/0x180 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff And then writepages ends up waiting on transaction commit: 9520 kworker/u96:13+flush-btrfs-1 D [<0>] wait_current_trans+0xac/0xe0 [<0>] start_transaction+0x21b/0x4b0 [<0>] cow_file_range_inline+0x10b/0x6b0 [<0>] cow_file_range.isra.69+0x329/0x4a0 [<0>] run_delalloc_range+0x105/0x3c0 [<0>] writepage_delalloc+0x119/0x180 [<0>] __extent_writepage+0x10c/0x390 [<0>] extent_write_cache_pages+0x26f/0x3d0 [<0>] extent_writepages+0x4f/0x80 [<0>] do_writepages+0x17/0x60 [<0>] __writeback_single_inode+0x59/0x690 [<0>] writeback_sb_inodes+0x291/0x4e0 [<0>] __writeback_inodes_wb+0x87/0xb0 [<0>] wb_writeback+0x3bb/0x500 [<0>] wb_workfn+0x40d/0x610 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x1e0/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Eventually, we have every process in the system waiting on balance_dirty_pages(), and nobody is able to make progress on page writeback. The original patch tried to fix an OOM condition, that happened on 4.4 but no success reproducing that on later kernels (4.19 and 4.20). This is more likely a problem in OOM itself. Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/ Reported-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org # 4.18+ CC: ethanlien <ethanlien@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'for-5.0-rc1-tag' of ↵Linus Torvalds2019-01-143-13/+64
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - two regression fixes in clone/dedupe ioctls, the generic check callback needs to lock extents properly and wait for io to avoid problems with writeback and relocation - fix deadlock when using free space tree due to block group creation - a recently added check refuses a valid fileystem with seeding device, make that work again with a quickfix, proper solution needs more intrusive changes * tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: Use real device structure to verify dev extent Btrfs: fix deadlock when using free space tree due to block group creation Btrfs: fix race between reflink/dedupe and relocation Btrfs: fix race between cloning range ending at eof and writeback
| * btrfs: Use real device structure to verify dev extentQu Wenruo2019-01-101-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel message: BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0 BTRFS error (device dm-6): failed to verify dev extents against chunks: -117 BTRFS error (device dm-6): open_ctree failed [CAUSE] Commit cf90d884b347 ("btrfs: Introduce mount time chunk <-> dev extent mapping check") introduced strict check on dev extents. We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and only dependent on @devid to find the real device. For seed devices, we call clone_fs_devices() in open_seed_devices() to allow us search seed devices directly. However clone_fs_devices() just populates devices with devid and dev uuid, without populating other essential members, like disk_total_bytes. This makes any device returned by btrfs_find_device(fs_info, devid, NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev extents on the seed device will not pass the device boundary check. [FIX] This patch will try to verify the device returned by btrfs_find_device() and if it's a dummy then re-search in seed devices. Fixes: cf90d884b347 ("btrfs: Introduce mount time chunk <-> dev extent mapping check") CC: stable@vger.kernel.org # 4.19+ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix deadlock when using free space tree due to block group creationFilipe Manana2019-01-091-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When modifying the free space tree we can end up COWing one of its extent buffers which in turn might result in allocating a new chunk, which in turn can result in flushing (finish creation) of pending block groups. If that happens we can deadlock because creating a pending block group needs to update the free space tree, and if any of the updates tries to modify the same extent buffer that we are COWing, we end up in a deadlock since we try to write lock twice the same extent buffer. So fix this by skipping pending block group creation if we are COWing an extent buffer from the free space tree. This is a case missed by commit 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches"). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173 Fixes: 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix race between reflink/dedupe and relocationFilipe Manana2019-01-091-6/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent rework that makes btrfs' remap_file_range operation use the generic helper generic_remap_file_range_prep() introduced a race between relocation and reflinking (for both cloning and deduplication) the file extents between the source and destination inodes. This happens because we no longer lock the source range anymore, and we do not lock it anymore because we wait for direct IO writes and writeback to complete early on the code path right after locking the inodes, which guarantees no other file operations interfere with the reflinking. However there is one exception which is relocation, since it replaces the byte number of file extents items in the fs tree after locking the range the file extent items represent. This is a problem because after finding each file extent to clone in the fs tree, the reflink process copies the file extent item into a local buffer, releases the search path, inserts new file extent items in the destination range and then increments the reference count for the extent mentioned in the file extent item that it previously copied to the buffer. If right after copying the file extent item into the buffer and releasing the path the relocation process updates the file extent item to point to the new extent, the reflink process ends up creating a delayed reference to increment the reference count of the old extent, for which the relocation process already created a delayed reference to drop it. This results in failure to run delayed references because we will attempt to increment the count of a reference that was already dropped. This is illustrated by the following diagram: CPU 1 CPU 2 relocation is running btrfs_clone_files() btrfs_clone() --> finds extent item in source range point to extent at bytenr X --> copies it into a local buffer --> releases path replace_file_extents() --> successfully locks the range represented by the file extent item --> replaces disk_bytenr field in the file extent item with some other value Y --> creates delayed reference to increment reference count for extent at bytenr Y --> creates delayed reference to drop the extent at bytenr X --> starts transaction --> creates delayed reference to increment extent at bytenr X <delayed references are run, due to a transaction commit for example, and the transaction is aborted with -EIO because we attempt to increment reference count for the extent at bytenr X after we freed it> When this race is hit the running transaction ends up getting aborted with an -EIO error and a trace like the following is produced: [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1 [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202 [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001 [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000 [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548 [ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000 [ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0 [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4382.564887] Call Trace: [ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs] [ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs] [ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs] [ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs] [ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs] [ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs] [ 4382.568112] ? _raw_spin_unlock+0x24/0x30 [ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs] [ 4382.569006] create_subvol+0x3c8/0x830 [btrfs] [ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60 [ 4382.570822] ? __sb_start_write+0xd4/0x1c0 [ 4382.571262] ? mnt_want_write_file+0x24/0x50 [ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs] [ 4382.572155] ? _copy_from_user+0x66/0x90 [ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs] [ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs] [ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570 [ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0 [ 4382.574379] ? _raw_spin_unlock+0x24/0x30 [ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0 [ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0 [ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 4382.576020] do_vfs_ioctl+0xa2/0x6f0 [ 4382.576405] ksys_ioctl+0x70/0x80 [ 4382.576776] __x64_sys_ioctl+0x16/0x20 [ 4382.577137] do_syscall_64+0x60/0x1b0 [ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7 [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003 [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044 [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010 [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050 [ 4382.580792] irq event stamp: 0 [ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null) [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null) [ 4382.582413] ---[ end trace d3c188e3e9367382 ]--- [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure [ 4382.624295] BTRFS info (device sdc): forced readonly Fix this by locking the source range before searching for the file extent items in the fs tree, since the relocation process will try to lock the range a file extent item represents before updating it with the new extent location. Fixes: 34a28e3d7753 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix race between cloning range ending at eof and writebackFilipe Manana2019-01-091-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent rework that makes btrfs' remap_file_range operation use the generic helper generic_remap_file_range_prep() introduced a race between writeback and cloning a range that covers the eof extent of the source file into a destination offset that is greater then the same file's size. This happens because we now wait for writeback to complete before doing the truncation of the eof block, while previously we did the truncation and then waited for writeback to complete. This leads to a race between writeback of the truncated block and cloning the file extents in the source range, because we copy each file extent item we find in the fs root into a buffer, then release the path and then increment the reference count for the extent referred in that file extent item we copied, which can no longer exist if writeback of the truncated eof block completes after we copied the file extent item into the buffer and before we incremented the reference count. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_clone_files() btrfs_cont_expand() btrfs_truncate_block() --> zeroes part of the page containg eof, marking it for delalloc btrfs_clone() --> finds extent item covering eof, points to extent at bytenr X --> copies it into a local buffer --> releases path writeback starts btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() --> creates delayed reference to drop the extent at bytenr X --> starts transaction --> creates delayed reference to increment extent at bytenr X <delayed references are run, due to a transaction commit for example, and the transaction is aborted with -EIO because we attempt to increment reference count for the extent at bytenr X after we freed it> When this race is hit the running transaction ends up getting aborted with an -EIO error and a trace like the following is produced: [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1 [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202 [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001 [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000 [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548 [ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000 [ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0 [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4382.564887] Call Trace: [ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs] [ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs] [ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs] [ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs] [ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs] [ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs] [ 4382.568112] ? _raw_spin_unlock+0x24/0x30 [ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs] [ 4382.569006] create_subvol+0x3c8/0x830 [btrfs] [ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60 [ 4382.570822] ? __sb_start_write+0xd4/0x1c0 [ 4382.571262] ? mnt_want_write_file+0x24/0x50 [ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs] [ 4382.572155] ? _copy_from_user+0x66/0x90 [ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs] [ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs] [ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570 [ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0 [ 4382.574379] ? _raw_spin_unlock+0x24/0x30 [ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0 [ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0 [ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 4382.576020] do_vfs_ioctl+0xa2/0x6f0 [ 4382.576405] ksys_ioctl+0x70/0x80 [ 4382.576776] __x64_sys_ioctl+0x16/0x20 [ 4382.577137] do_syscall_64+0x60/0x1b0 [ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7 [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003 [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044 [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010 [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050 [ 4382.580792] irq event stamp: 0 [ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null) [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null) [ 4382.582413] ---[ end trace d3c188e3e9367382 ]--- [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure [ 4382.624295] BTRFS info (device sdc): forced readonly Fix this by waiting for writeback to complete after truncating the eof block. Fixes: 34a28e3d7753 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'mount.part1' of ↵Linus Torvalds2019-01-052-75/+11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount API prep from Al Viro: "Mount API prereqs. Mostly that's LSM mount options cleanups. There are several minor fixes in there, but nothing earth-shattering (leaks on failure exits, mostly)" * 'mount.part1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits) mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT smack: rewrite smack_sb_eat_lsm_opts() smack: get rid of match_token() smack: take the guts of smack_parse_opts_str() into a new helper LSM: new method: ->sb_add_mnt_opt() selinux: rewrite selinux_sb_eat_lsm_opts() selinux: regularize Opt_... names a bit selinux: switch away from match_token() selinux: new helper - selinux_add_opt() LSM: bury struct security_mnt_opts smack: switch to private smack_mnt_opts selinux: switch to private struct selinux_mnt_opts LSM: hide struct security_mnt_opts from any generic code selinux: kill selinux_sb_get_mnt_opts() LSM: turn sb_eat_lsm_opts() into a method nfs_remount(): don't leak, don't ignore LSM options quietly btrfs: sanitize security_mnt_opts use selinux; don't open-code a loop in sb_finish_set_opts() LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount() new helper: security_sb_eat_lsm_opts() ...
| * | LSM: hide struct security_mnt_opts from any generic codeAl Viro2018-12-211-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Keep void * instead, allocate on demand (in parse_str_opts, at the moment). Eventually both selinux and smack will be better off with private structures with several strings in those, rather than this "counter and two pointers to dynamically allocated arrays" ugliness. This commit allows to do that at leisure, without disrupting anything outside of given module. Changes: * instead of struct security_mnt_opt use an opaque pointer initialized to NULL. * security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and security_free_mnt_opts() take it as var argument (i.e. as void **); call sites are unchanged. * security_sb_set_mnt_opts() and security_sb_remount() take it by value (i.e. as void *). * new method: ->sb_free_mnt_opts(). Takes void *, does whatever freeing that needs to be done. * ->sb_set_mnt_opts() and ->sb_remount() might get NULL as mnt_opts argument, meaning "empty". Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | btrfs: sanitize security_mnt_opts useAl Viro2018-12-212-58/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) keeping a copy in btrfs_fs_info is completely pointless - we never use it for anything. Getting rid of that allows for simpler calling conventions for setup_security_options() (caller is responsible for freeing mnt_opts in all cases). 2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(), same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are* close (in fact, selinux sb_set_mnt_opts() ought to punt to sb_remount() in "already initialized" case), but let's handle that uniformly. And the only reason why the original btrfs changes didn't go for security_sb_remount() in btrfs_remount() case is that it hadn't been exported. Let's export it for a while - it'll be going away soon anyway. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | new helper: security_sb_eat_lsm_opts()Al Viro2018-12-211-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | combination of alloc_secdata(), security_sb_copy_data(), security_sb_parse_opt_str() and free_secdata(). Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-01-051-2/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more updates from Andrew Morton: - procfs updates - various misc bits - lib/ updates - epoll updates - autofs - fatfs - a few more MM bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits) mm/page_io.c: fix polled swap page in checkpatch: add Co-developed-by to signature tags docs: fix Co-Developed-by docs drivers/base/platform.c: kmemleak ignore a known leak fs: don't open code lru_to_page() fs/: remove caller signal_pending branch predictions mm/: remove caller signal_pending branch predictions arch/arc/mm/fault.c: remove caller signal_pending_branch predictions kernel/sched/: remove caller signal_pending branch predictions kernel/locking/mutex.c: remove caller signal_pending branch predictions mm: select HAVE_MOVE_PMD on x86 for faster mremap mm: speed up mremap by 20x on large regions mm: treewide: remove unused address argument from pte_alloc functions initramfs: cleanup incomplete rootfs scripts/gdb: fix lx-version string output kernel/kcov.c: mark write_comp_data() as notrace kernel/sysctl: add panic_print into sysctl panic: add options to print system info when panic happens bfs: extra sanity checking and static inode bitmap exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ...
| * | | fs: don't open code lru_to_page()Nikolay Borisov2019-01-041-2/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple filesystems open code lru_to_page(). Rectify this by moving the macro from mm_inline (which is specific to lru stuff) to the more generic mm.h header and start using the macro where appropriate. No functional changes. Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* / | Remove 'type' argument from access_ok() functionLinus Torvalds2019-01-031-1/+1
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | btrfs: Fix typos in comments and stringsAndrea Gelmini2018-12-1725-69/+70
| | | | | | | | | | | | | | | | | | The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: improve error handling of btrfs_add_linkJohannes Thumshirn2018-12-171-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the error handling block, err holds the return value of either btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked since it's introduction with commit fe66a05a0679 (Btrfs: improve error handling for btrfs_insert_dir_item callers) in 2012. If the error handling in the error handling fails, there's not much left to do and the abort either happened earlier in the callees or is necessary here. So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort the transaction, but still return the original code of the failure stored in 'ret' as this will be reported to the user. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: use generic_remap_file_range_prep() for cloning and deduplicationFilipe Manana2018-12-171-491/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since cloning and deduplication are no longer Btrfs specific operations, we now have generic code to handle parameter validation, compare file ranges used for deduplication, clear capabilities when cloning, etc. This change makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a few bugs, such as: 1) When cloning, the destination file's capabilities were not dropped (the fstest generic/513 tests this); 2) We were not checking if the destination file is immutable; 3) Not checking if either the source or destination files are swap files (swap file support is coming soon for Btrfs); 4) System limits were not checked (resource limits and O_LARGEFILE). Note that the generic helper generic_remap_file_range_prep() does start and waits for writeback by calling filemap_write_and_wait_range(), however that is not enough for Btrfs for two reasons: 1) With compression, we need to start writeback twice in order to get the pages marked for writeback and ordered extents created; 2) filemap_write_and_wait_range() (and all its other variants) only waits for the IO to complete, but we need to wait for the ordered extents to finish, so that when we do the actual reflinking operations the file extent items are in the fs tree. This is also important due to the fact that the generic helper, for the deduplication case, compares the contents of the pages in the requested range, which might require reading extents from disk in the very unlikely case that pages get invalidated after writeback finishes (so the file extent items must be up to date in the fs tree). Since these reasons are specific to Btrfs we have to do it in the Btrfs code before calling generic_remap_file_range_prep(). This also results in a simpler way of dealing with existing delalloc in the source/target ranges, specially for the deduplication case where we used to lock all the pages first and then if we found any dealloc for the range, or ordered extent, we would unlock the pages trigger writeback and wait for ordered extents to complete, then lock all the pages again and check if deduplication can be done. So now we get a simpler approach: lock the inodes, then trigger writeback and then wait for ordered extents to complete. So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it) to eliminate duplicated code, fix a few bugs and benefit from future bug fixes done there - for example the recent clone and dedupe bugs involving reflinking a partial EOF block got a counterpart fix in the generic helper, since it affected all filesystems supporting these operations, so we no longer need special checks in Btrfs for them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Refactor main loop in extent_readpagesNikolay Borisov2018-12-171-20/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | extent_readpages processes all pages in the readlist in batches of 16, this is implemented by a single for loop but thanks to an if condition the loop does 2 things based on whether we've filled the batch or not. Additionally due to the structure of the code there is an additional check which deals with partial batches. Streamline all of this by explicitly using two loops. The outter one is used to process all pages while the inner one just fills in the batch of 16 (currently). Due to this new structure the code guarantees that all pages are processed in the loop hence the code to deal with any leftovers is eliminated. This also enable the compiler to inline __extent_readpages: ./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160) Function old new delta extent_readpages 476 1136 +660 __extent_readpages 820 - -820 Total: Before=44315, After=44155, chg -0.36% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove 1st shrink/grow phase from balanceNikolay Borisov2018-12-171-53/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first step of the rebalance process ensures there is 1MiB free on each device. This number seems rather small. And in fact when talking to the original authors their opinions were: "man that's a little bonkers" "i don't think we even need that code anymore" "I think it was there to make sure we had room for the blank 1M at the beginning. I bet it goes all the way back to v0" "we just don't need any of that tho, i say we just delete it" Clearly, this piece of code has lost its original intent throughout the years. It doesn't really bring any real practical benefits to the relocation process. Additionally, this patch makes the balance process more lightweight by removing a pair of shrink/grow operations which are rather expensive for heavily populated filesystems. This is mainly due to shrink requiring relocating block groups, involving heavy use of the btree. The intermediate shrink/grow can fail and leave the filesystem in a middle state that would need to be changed back by the user. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: send, fix race with transaction commits that create snapshotsFilipe Manana2018-12-171-6/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we create a snapshot of a snapshot currently being used by a send operation, we can end up with send failing unexpectedly (returning -ENOENT error to user space for example). The following diagram shows how this happens. CPU 1 CPU2 CPU3 btrfs_ioctl_send() (...) create_snapshot() -> creates snapshot of a root used by the send task btrfs_commit_transaction() create_pending_snapshot() __get_inode_info() btrfs_search_slot() btrfs_search_slot_get_root() down_read commit_root_sem get reference on eb of the commit root -> eb with bytenr == X up_read commit_root_sem btrfs_cow_block(root node) btrfs_free_tree_block() -> creates delayed ref to free the extent btrfs_run_delayed_refs() -> runs the delayed ref, adds extent to fs_info->pinned_extents btrfs_finish_extent_commit() unpin_extent_range() -> marks extent as free in the free space cache transaction commit finishes btrfs_start_transaction() (...) btrfs_cow_block() btrfs_alloc_tree_block() btrfs_reserve_extent() -> allocates extent at bytenr == X btrfs_init_new_buffer(bytenr X) btrfs_find_create_tree_block() alloc_extent_buffer(bytenr X) find_extent_buffer(bytenr X) -> returns existing eb, which the send task got (...) -> modifies content of the eb with bytenr == X -> uses an eb that now belongs to some other tree and no more matches the commit root of the snapshot, resuts will be unpredictable The consequences of this race can be various, and can lead to searches in the commit root performed by the send task failing unexpectedly (unable to find inode items, returning -ENOENT to user space, for example) or not failing because an inode item with the same number was added to the tree that reused the metadata extent, in which case send can behave incorrectly in the worst case or just fail later for some reason. Fix this by performing a copy of the commit root's extent buffer when doing a search in the context of a send operation. CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e2e9: Btrfs: move get root out of btrfs_search_slot to a helper CC: stable@vger.kernel.org # 4.4.x: f9ddfd0592a: Btrfs: remove unused check of skip_locking CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: use nofs context when initializing security xattrs to avoid deadlockFilipe Manana2018-12-171-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | When initializing the security xattrs, we are holding a transaction handle therefore we need to use a GFP_NOFS context in order to avoid a deadlock with reclaim in case it's triggered. Fixes: 39a27ec1004e8 ("btrfs: use GFP_KERNEL for xattr and acl allocations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: run delayed items before dropping the snapshotJosef Bacik2018-12-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With my delayed refs patches in place we started seeing a large amount of aborts in __btrfs_free_extent: BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0 Call Trace: ? btrfs_merge_delayed_refs+0xaf/0x340 __btrfs_run_delayed_refs+0x6ea/0xfc0 ? btrfs_set_path_blocking+0x31/0x60 btrfs_run_delayed_refs+0xeb/0x180 btrfs_commit_transaction+0x179/0x7f0 ? btrfs_check_space_for_delayed_refs+0x30/0x50 ? should_end_transaction.isra.19+0xe/0x40 btrfs_drop_snapshot+0x41c/0x7c0 btrfs_clean_one_deleted_snapshot+0xb5/0xd0 cleaner_kthread+0xf6/0x120 kthread+0xf8/0x130 ? btree_invalidatepage+0x90/0x90 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 This was because btrfs_drop_snapshot depends on the root not being modified while it's dropping the snapshot. It will unlock the root node (and really every node) as it walks down the tree, only to re-lock it when it needs to do something. This is a problem because if we modify the tree we could cow a block in our path, which frees our reference to that block. Then once we get back to that shared block we'll free our reference to it again, and get ENOENT when trying to lookup our extent reference to that block in __btrfs_free_extent. This is ultimately happening because we have delayed items left to be processed for our deleted snapshot _after_ all of the inodes are closed for the snapshot. We only run the delayed inode item if we're deleting the inode, and even then we do not run the delayed insertions or delayed removals. These can be run at any point after our final inode does its last iput, which is what triggers the snapshot deletion. We can end up with the snapshot deletion happening and then have the delayed items run on that file system, resulting in the above problem. This problem has existed forever, however my patches made it much easier to hit as I wake up the cleaner much more often to deal with delayed iputs, which made us more likely to start the snapshot dropping work before the transaction commits, which is when the delayed items would generally be run. Before, generally speaking, we would run the delayed items, commit the transaction, and wakeup the cleaner thread to start deleting snapshots, which means we were less likely to hit this problem. You could still hit it if you had multiple snapshots to be deleted and ended up with lots of delayed items, but it was definitely harder. Fix for now by simply running all the delayed items before starting to drop the snapshot. We could make this smarter in the future by making the delayed items per-root, and then simply drop any delayed items for roots that we are going to delete. But for now just a quick and easy solution is the safest. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
OpenPOWER on IntegriCloud