summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus-4.4' of ↵Linus Torvalds2015-11-2712-66/+219
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has Mark Fasheh's patches to fix quota accounting during subvol deletion, which we've been working on for a while now. The patch is pretty small but it's a key fix. Otherwise it's a random assortment" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix balance range usage filters in 4.4-rc btrfs: qgroup: account shared subtree during snapshot delete Btrfs: use btrfs_get_fs_root in resolve_indirect_ref btrfs: qgroup: fix quota disable during rescan Btrfs: fix race between cleaner kthread and space cache writeout Btrfs: fix scrub preventing unused block groups from being deleted Btrfs: fix race between scrub and block group deletion btrfs: fix rcu warning during device replace btrfs: Continue replace when set_block_ro failed btrfs: fix clashing number of the enhanced balance usage filter Btrfs: fix the number of transaction units needed to remove a block group Btrfs: use global reserve when deleting unused block group after ENOSPC Btrfs: tests: checking for NULL instead of IS_ERR() btrfs: fix signed overflows in btrfs_sync_file
| * btrfs: fix balance range usage filters in 4.4-rcHolger Hoffstätte2015-11-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a regression in 4.4-rc since commit bc3094673f22 (btrfs: extend balance filter usage to take minimum and maximum) in that existing (non-ranged) balance with -dusage=x no longer works; all chunks are skipped. After staring at the code for a while and wondering why a non-ranged balance would even need min and max thresholds (..which then were not set correctly, leading to the bug) I realized that the only problem was the fact that the filter functions were named wrong, thanks to patching copypasta. Simply renaming both functions lets the existing btrfs-progs call balance with -dusage=x and now the non-ranged filter function is invoked, properly using only a single chunk limit. Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: qgroup: account shared subtree during snapshot deleteMark Fasheh2015-11-252-7/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.') removed our qgroup accounting during btrfs_drop_snapshot(). Predictably, this results in qgroup numbers going bad shortly after a snapshot is removed. Fix this by adding a dirty extent record when we encounter extents during our shared subtree walk. This effectively restores the functionality we had with the original shared subtree walking code in 1152651 (btrfs: qgroup: account shared subtrees during snapshot delete). The idea with the original patch (and this one) is that shared subtrees can get skipped during drop_snapshot. The shared subtree walk then allows us a chance to visit those extents and add them to the qgroup work for later processing. This ultimately makes the accounting for drop snapshot work. The new qgroup code nicely handles all the other extents during the tree walk via the ref dec/inc functions so we don't have to add actions beyond what we had originally. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: use btrfs_get_fs_root in resolve_indirect_refJosef Bacik2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The backref code will look up the fs_root we're trying to resolve our indirect refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete as it won't be able to search down the snapshot we are deleting, which will cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref so we can always get the root we're looking for. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: qgroup: fix quota disable during rescanJustin Maggard2015-11-251-1/+2
| | | | | | | | | | | | | | | | | | | | There's a race condition that leads to a NULL pointer dereference if you disable quotas while a quota rescan is running. To fix this, we just need to wait for the quota rescan worker to actually exit before tearing down the quota structures. Signed-off-by: Justin Maggard <jmaggard@netgear.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix race between cleaner kthread and space cache writeoutFilipe Manana2015-11-251-13/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a block group becomes unused and the cleaner kthread is currently running, we can end up getting the current transaction aborted with error -ENOENT when we try to commit the transaction, leading to the following trace: [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]() [59779.272594] BTRFS: Transaction aborted (error -2) (...) [59779.291137] Call Trace: [59779.291621] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [59779.292543] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [59779.293435] [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.295000] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [59779.296138] [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs] [59779.297663] [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.299141] [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs] [59779.300359] [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs] [59779.301805] [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs] [59779.302893] [<ffffffff81196634>] sync_filesystem+0x7f/0x93 (...) [59779.318186] ---[ end trace 577e2daff90da33a ]--- The following diagram illustrates a sequence of steps leading to this problem: CPU 1 CPU 2 <at transaction N> adds bg A to list fs_info->unused_bgs adds bg B to list fs_info->unused_bgs <transaction kthread commits transaction N and wakes up the cleaner kthread> cleaner kthread delete_unused_bgs() sees bg A in list fs_info->unused_bgs btrfs_start_transaction() <transaction N + 1 starts> deletes bg A update_block_group(bg C) --> adds bg C to list fs_info->unused_bgs deletes bg B sees bg C in the list fs_info->unused_bgs btrfs_remove_chunk(bg C) btrfs_remove_block_group(bg C) --> checks if the block group is in a dirty list, and because it isn't now, it does nothing --> the block group item is deleted from the extent tree --> adds bg C to list transaction->dirty_bgs some task calls btrfs_commit_transaction(t N + 1) commit_cowonly_roots() btrfs_write_dirty_block_groups() --> sees bg C in cur_trans->dirty_bgs --> calls write_one_cache_group() which returns -ENOENT because it did not find the block group item in the extent tree --> transaction aborte with -ENOENT because write_one_cache_group() returned that error So fix this by adding a block group to the list of dirty block groups before adding it to the list of unused block groups. This happened on a stress test using fsstress plus concurrent calls to fallocate 20G and truncate (releasing part of the space allocated with fallocate). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix scrub preventing unused block groups from being deletedFilipe Manana2015-11-253-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently scrub can race with the cleaner kthread when the later attempts to delete an unused block group, and the result is preventing the cleaner kthread from ever deleting later the block group - unless the block group becomes used and unused again. The following diagram illustrates that race: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs and removes it from that list scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO sees the block group is already RO and therefore doesn't delete it nor adds it back to unused list So fix this by making scrub add the block group again to the list of unused block groups if the block group is still unused when it finished scrubbing it and it hasn't been removed already. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix race between scrub and block group deletionFilipe Manana2015-11-251-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Scrub can race with the cleaner kthread deleting block groups that are unused (and with relocation too) leading to a failure with error -EINVAL that gets returned to user space. The following diagram illustrates how it happens: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs sets block group to RO btrfs_remove_chunk(bg X) deletes device extents scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) btrfs_remove_block_group(bg X) deletes block group from fs_info->block_group_cache_tree removes extent map from fs_info->mapping_tree scrub_chunk(offset X) searches fs_info->mapping_tree for extent map starting at offset X --> doesn't find any such extent map --> returns -EINVAL and scrub errors out to userspace with -EINVAL Fix this by dealing with an extent map lookup failure as an indicator of block group deletion. Issue reproduced with fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: fix rcu warning during device replaceDavid Sterba2015-11-251-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The test btrfs/011 triggers a rcu warning Reviewed-by: Anand Jain <anand.jain@oracle.com> =============================== [ INFO: suspicious RCU usage. ] 4.4.0-rc1-default+ #286 Tainted: G W ------------------------------- fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by btrfs/28786: 0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs] 1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs] 2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs] 3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs] stack backtrace: CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286 Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010 0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001 0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78 ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220 Call Trace: [<ffffffff8141d47b>] dump_stack+0x4f/0x74 [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140 [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs] [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80 [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs] [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs] [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs] [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs] [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs] [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0 [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs] [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0 [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0 [<ffffffff811e3336>] ? __fget_light+0x86/0xb0 [<ffffffff811e3369>] ? __fdget+0x9/0x20 [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80 [<ffffffff811d7483>] SyS_ioctl+0x53/0x80 [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f This is because of unprotected use of rcu_dereference in btrfs_scratch_superblocks. We can't add rcu locks around the whole function because we read the superblock. The fix will use the rcu string buffer directly without the rcu locking. Thi is safe as the device will not go away in the meantime. We're holding the device list mutexes. Restructuring the code to narrow down the rcu section turned out to be impossible, we need to call filp_open (through update_dev_time) on the buffer and this could call kmalloc/__might_sleep. We could call kstrdup with GFP_ATOMIC but it's not absolutely necessary. Fixes: 12b1c2637b6e (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks) Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Continue replace when set_block_ro failedZhaolei2015-11-251-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfstests/011 failed in node with small_size filesystem. Can be reproduced by following script: DEV_LIST="/dev/vdd /dev/vde" DEV_REPLACE="/dev/vdf" do_test() { local mkfs_opt="$1" local size="$2" dmesg -c >/dev/null umount $SCRATCH_MNT &>/dev/null echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}" mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1 mount "${DEV_LIST[0]}" $SCRATCH_MNT echo -n "Writing big files" dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1 for ((i = 1; i <= size; i++)); do echo -n . /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1 done echo echo Start replace btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || { dmesg return 1 } return 0 } # Set size to value near fs size # for example, 1897 can trigger this bug in 2.6G device. # ./do_test "-d raid1 -m raid1" 1897 System will report replace fail with following warning in dmesg: [ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started [ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28 [ 135.543505] ------------[ cut here ]------------ [ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440() [ 135.545276] Modules linked in: [ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256 [ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 [ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000 [ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4 [ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70 [ 135.550758] Call Trace: [ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55 [ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0 [ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20 [ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440 [ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0 [ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0 [ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580 [ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70 [ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90 [ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80 [ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]--- Reason: When big data writen to fs, the whole free space will be allocated for data chunk. And operation as scrub need to set_block_ro(), and when there is only one metadata chunk in system(or other metadata chunks are all full), the function will try to allocate a new chunk, and failed because no space in device. Fix: When set_block_ro failed for metadata chunk, it is not a problem because scrub_lock paused commit_trancaction in same time, and metadata are always cowed, so the on-the-fly writepages will not write data into same place with scrub/replace. Let replace continue in this case is no problem. Tested by above script, and xfstests/011, plus 100 times xfstests/070. Changelog v1->v2: 1: Add detail comments in source and commit-message. 2: Add dmesg detail into commit-message. 3: Limit return value of -ENOSPC to be passed. All suggested by: Filipe Manana <fdmanana@gmail.com> Suggested-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: fix clashing number of the enhanced balance usage filterDavid Sterba2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | I've accidentally picked an already used number for the enhanced usage filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase, no backward compatibility issues. Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix the number of transaction units needed to remove a block groupFilipe Manana2015-11-253-5/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: use global reserve when deleting unused block group after ENOSPCFilipe Manana2015-11-256-26/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: tests: checking for NULL instead of IS_ERR()Dan Carpenter2015-11-251-1/+3
| | | | | | | | | | | | | | | | | | btrfs_alloc_dummy_root() return an error pointer on failure, it never returns NULL. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: fix signed overflows in btrfs_sync_fileDavid Sterba2015-11-251-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The calculation of range length in btrfs_sync_file leads to signed overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin. https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 The fsync call passes 0 and LLONG_MAX, the range length does not fit to loff_t and overflows, but the value is converted to u64 so it silently works as expected. The minimal fix is a typecast to u64, switching functions to take (start, end) instead of (start, len) would be more intrusive. Coccinelle script found that there's one more opencoded calculation of the length. <smpl> @@ loff_t start, end; @@ * end - start </smpl> CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* | Merge branch 'for-linus-4.4' of ↵Linus Torvalds2015-11-138-146/+163
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and cleanups from Chris Mason: "Some of this got cherry-picked from a github repo this week, but I verified the patches. We have three small scrub cleanups and a collection of fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Use fs_info directly in btrfs_delete_unused_bgs btrfs: Fix lost-data-profile caused by balance bg btrfs: Fix lost-data-profile caused by auto removing bg btrfs: Remove len argument from scrub_find_csum btrfs: Reduce unnecessary arguments in scrub_recheck_block btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum btrfs: scrub: setup all fields for sblock_to_check btrfs: scrub: set error stats when tree block spanning stripes Btrfs: fix race when listing an inode's xattrs Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow Btrfs: fix race leading to incorrect item deletion when dropping extents Btrfs: fix sleeping inside atomic context in qgroup rescan worker Btrfs: fix race waiting for qgroup rescan worker btrfs: qgroup: exit the rescan worker during umount Btrfs: fix extent accounting for partial direct IO writes
| * btrfs: Use fs_info directly in btrfs_delete_unused_bgsZhao Lei2015-11-101-2/+2
| | | | | | | | | | | | | | | | No need to use root->fs_info in btrfs_delete_unused_bgs(), use fs_info directly instead. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Fix lost-data-profile caused by balance bgZhao Lei2015-11-101-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage $TEST_DIR dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem usage $TEST_DIR Result: We can see "no data chunk" in first "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 1.07GiB And "data chunks changed from raid1 to single" in second "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Data,single: Size:256.00MiB, Used:0.00B /dev/vdh 256.00MiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 841.92MiB Reason: btrfs balance delete last data chunk in case of no data in the filesystem, then we can see "no data chunk" by "fi usage" command. And when we do write operation to fs, the only available data profile is 0x0, result is all new chunks are allocated single type. Fix: Allocate a data chunk explicitly to ensure we don't lose the raid profile for data. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Fix lost-data-profile caused by auto removing bgZhao Lei2015-11-101-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" umount "$TEST_DEV" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs filesystem usage $TEST_DIR We can see the data chunk changed from raid1 to single: # btrfs filesystem usage $TEST_DIR Data,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB # Reason: When a empty filesystem mount with -o nospace_cache, the last data blockgroup will be auto-removed in umount. Then if we mount it again, there is no data chunk in the filesystem, so the only available data profile is 0x0, result is all new chunks are created as single type. Fix: Don't auto-delete last blockgroup for a raid type. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Remove len argument from scrub_find_csumZhao Lei2015-11-101-4/+3
| | | | | | | | | | | | | | It is useless. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Reduce unnecessary arguments in scrub_recheck_blockZhao Lei2015-11-101-20/+8
| | | | | | | | | | | | | | | | We don't need pass so many arguments for recheck sblock now, this patch cleans them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for ↵Zhao Lei2015-11-101-94/+36
| | | | | | | | | | | | | | | | | | | | scrub_recheck_block_checksum We can use existing scrub_checksum_data() and scrub_checksum_tree_block() for scrub_recheck_block_checksum(), instead of write duplicated code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksumZhao Lei2015-11-101-0/+4
| | | | | | | | | | | | | | | | | | | | | | We should reset sblock->xxx_error stats before calling scrub_recheck_block_checksum(). Current code run correctly because all sblock are allocated by k[cz]alloc(), and the error stats are not got changed. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: scrub: setup all fields for sblock_to_checkZhao Lei2015-11-101-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | scrub_setup_recheck_block() isn't setup all necessary fields for sblock_to_check because history reason. So current code need more arguments in severial functions, and more local variables, just to passing these lacked values to necessary place. This patch setup above fields to sblock_to_check in scrub_setup_recheck_block(), for: 1: more cleanup for function arg, local variable 2: to make sblock_to_check complete, then we can use sblock_to_check without concern about some uninitialized member. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: scrub: set error stats when tree block spanning stripesZhao Lei2015-11-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is better to show error stats to user when we found tree block spanning stripes. On a btrfs created by old version of btrfs-convert: Before patch: # btrfs scrub start -B /dev/vdh scrub done for 8b342d35-2904-41ab-b3cb-2f929709cf47 scrub started at Tue Aug 25 21:19:09 2015 and finished after 00:00:00 total bytes scrubbed: 53.54MiB with 0 errors # dmesg ... [ 128.711434] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832 [ 128.712744] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368 ... After patch: # btrfs scrub start -B /dev/vdh scrub done for ff7f844b-7a4e-4b1a-88a9-8252ab25be1b scrub started at Tue Aug 25 21:42:29 2015 and finished after 00:00:00 total bytes scrubbed: 53.60MiB with 2 errors error details: corrected errors: 0, uncorrectable errors: 2, unverified errors: 0 ERROR: There are uncorrectable errors. # dmesg ...omit... # Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix race when listing an inode's xattrsFilipe Manana2015-11-091-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When listing a inode's xattrs we have a time window where we race against a concurrent operation for adding a new hard link for our inode that makes us not return any xattr to user space. In order for this to happen, the first xattr of our inode needs to be at slot 0 of a leaf and the previous leaf must still have room for an inode ref (or extref) item, and this can happen because an inode's listxattrs callback does not lock the inode's i_mutex (nor does the VFS does it for us), but adding a hard link to an inode makes the VFS lock the inode's i_mutex before calling the inode's link callback. If we have the following leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ] slot N - 2 slot N - 1 slot 0 The race illustrated by the following sequence diagram is possible: CPU 1 CPU 2 btrfs_listxattr() searches for key (257 XATTR_ITEM 0) gets path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() releases the path adds key (257 INODE_REF 666) to the end of leaf X (slot N), and leaf X now has N + 1 items searches for the key (257 INODE_REF 256), with path->keep_locks == 1, because that is the last key it saw in leaf X before releasing the path ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in leaf X, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 666) btrfs_listxattr's loop iteration sees that the type of the key pointed by the path is different from the type BTRFS_XATTR_ITEM_KEY and so it breaks the loop and stops looking for more xattr items --> the application doesn't get any xattr listed for our inode So fix this by breaking the loop only if the key's type is greater than BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana2015-11-091-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * Btrfs: fix race leading to incorrect item deletion when dropping extentsFilipe Manana2015-11-081-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running a stress test I got the following warning triggered: [191627.672810] ------------[ cut here ]------------ [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]() (...) [191627.701485] Call Trace: [191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c [191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0 [191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18 [191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs] [191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs] [191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs] [191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs] [191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs] [191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae [191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2 [191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7 [191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28 [191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70 [191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.744206] ---[ end trace bbfddacb7aaada8d ]--- $ cat -n fs/btrfs/file.c 691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, (...) 758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 759 if (key.objectid > ino || 760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) 761 break; 762 763 fi = btrfs_item_ptr(leaf, path->slots[0], 764 struct btrfs_file_extent_item); 765 extent_type = btrfs_file_extent_type(leaf, fi); 766 767 if (extent_type == BTRFS_FILE_EXTENT_REG || 768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) 774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) 778 } else { 779 WARN_ON(1); 780 extent_end = search_start; 781 } (...) This happened because the item we were processing did not match a file extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this case we cast the item to a struct btrfs_file_extent_item pointer and then find a type field value that does not match any of the expected values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens due to a tiny time window where a race can happen as exemplified below. For example, consider the following scenario where we're using the NO_HOLES feature and we have the following two neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather than explicit because NO_HOLES is enabled). Now if our inode has an ordered extent for the range [4K, 8K[ that is finishing, the following can happen: CPU 1 CPU 2 btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() Searches for the key (257 EXTENT_DATA 4096) through btrfs_lookup_file_extent() Key not found and we get a path where path->nodes[0] == leaf X and path->slots[0] == N Because path->slots[0] is >= btrfs_header_nritems(leaf X), we call btrfs_next_leaf() btrfs_next_leaf() releases the path inserts key (257 INODE_REF 4096) at the end of leaf X, leaf X now has N + 1 keys, and the new key is at slot N btrfs_next_leaf() searches for key (257 INODE_REF 256), with path->keep_locks set to 1, because it was the last key it saw in leaf X finds it in leaf X again and notices it's no longer the last key of the leaf, so it returns 0 with path->nodes[0] == leaf X and path->slots[0] == N (which is now < btrfs_header_nritems(leaf X)), pointing to the new key (257 INODE_REF 4096) __btrfs_drop_extents() casts the item at path->nodes[0], slot path->slots[0], to a struct btrfs_file_extent_item - it does not skip keys for the target inode with a type less than BTRFS_EXTENT_DATA_KEY (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY) sees a bogus value for the type field triggering the WARN_ON in the trace shown above, and sets extent_end = search_start (4096) does the if-then-else logic to fixup 0 length extent items created by a past bug from hole punching: if (extent_end == key.offset && extent_end >= search_start) goto delete_extent_item; that evaluates to true and it ends up deleting the key pointed to by path->slots[0], (257 INODE_REF 4096), from leaf X The same could happen for example for a xattr that ends up having a key with an offset value that matches search_start (very unlikely but not impossible). So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are skipped, never casted to struct btrfs_file_extent_item and never deleted by accident. Also protect against the unexpected case of getting a key for a lower inode number by skipping that key and issuing a warning. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * Btrfs: fix sleeping inside atomic context in qgroup rescan workerFilipe Manana2015-11-051-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are holding a btree path with spinning locks and then we attempt to clone an extent buffer, which calls kmem_cache_alloc() and this function can sleep, causing the following trace to be reported on a debug kernel: [107118.218536] BUG: sleeping function called from invalid context at mm/slab.c:2871 [107118.224110] in_atomic(): 1, irqs_disabled(): 0, pid: 19148, name: kworker/u32:3 [107118.226120] INFO: lockdep is turned off. [107118.226843] Preemption disabled at:[<ffffffffa05ffa22>] btrfs_clear_lock_blocking_rw+0x96/0xea [btrfs] [107118.229175] CPU: 3 PID: 19148 Comm: kworker/u32:3 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [107118.231326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [107118.233687] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [107118.236835] 0000000000000000 ffff880424bf3b78 ffffffff812566f4 0000000000000000 [107118.238369] ffff880424bf3ba0 ffffffff81070664 ffffffff817f1cd5 0000000000000b37 [107118.239769] 0000000000000000 ffff880424bf3bc8 ffffffff8107070a 0000000000008850 [107118.241244] Call Trace: [107118.241729] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [107118.242602] [<ffffffff81070664>] ___might_sleep+0x23a/0x241 [107118.243586] [<ffffffff8107070a>] __might_sleep+0x9f/0xa6 [107118.244532] [<ffffffff8115af70>] cache_alloc_debugcheck_before+0x25/0x36 [107118.245939] [<ffffffff8115d52b>] kmem_cache_alloc+0x50/0x215 [107118.246930] [<ffffffffa05e627e>] __alloc_extent_buffer+0x2a/0x11f [btrfs] [107118.248121] [<ffffffffa05ecb1a>] btrfs_clone_extent_buffer+0x3d/0xdd [btrfs] [107118.249451] [<ffffffffa06239ea>] btrfs_qgroup_rescan_worker+0x16d/0x434 [btrfs] [107118.250755] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [107118.251754] [<ffffffffa05f7952>] normal_work_helper+0x14c/0x32a [btrfs] [107118.252899] [<ffffffffa05f7952>] ? normal_work_helper+0x14c/0x32a [btrfs] [107118.254195] [<ffffffffa05f7c82>] btrfs_qgroup_rescan_helper+0x12/0x14 [btrfs] [107118.255436] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [107118.263690] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [107118.264888] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [107118.267413] [<ffffffff8106904d>] kthread+0xef/0xf7 [107118.268417] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [107118.269505] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [107118.270491] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 So just use blocking locks for our path to solve this. This fixes the patch titled: "btrfs: qgroup: Don't copy extent buffer to do qgroup rescan" Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * Btrfs: fix race waiting for qgroup rescan workerFilipe Manana2015-11-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were initializing the completion (fs_info->qgroup_rescan_completion) object after releasing the qgroup rescan lock, which gives a small time window for a rescan waiter to not actually wait for the rescan worker to finish. Example: CPU 1 CPU 2 fs_info->qgroup_rescan_completion->done is 0 btrfs_qgroup_rescan_worker() complete_all(&fs_info->qgroup_rescan_completion) sets fs_info->qgroup_rescan_completion->done to UINT_MAX / 2 ... do some other stuff .... qgroup_rescan_init() mutex_lock(&fs_info->qgroup_rescan_lock) set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN in fs_info->qgroup_flags mutex_unlock(&fs_info->qgroup_rescan_lock) btrfs_qgroup_wait_for_completion() mutex_lock(&fs_info->qgroup_rescan_lock) sees flag BTRFS_QGROUP_STATUS_FLAG_RESCAN in fs_info->qgroup_flags mutex_unlock(&fs_info->qgroup_rescan_lock) wait_for_completion_interruptible( &fs_info->qgroup_rescan_completion) fs_info->qgroup_rescan_completion->done is > 0 so it returns immediately init_completion(&fs_info->qgroup_rescan_completion) sets fs_info->qgroup_rescan_completion->done to 0 So fix this by initializing the completion object while holding the mutex fs_info->qgroup_rescan_lock. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * btrfs: qgroup: exit the rescan worker during umountJustin Maggard2015-11-052-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I was hitting a consistent NULL pointer dereference during shutdown that showed the trace running through end_workqueue_bio(). I traced it back to the endio_meta_workers workqueue being poked after it had already been destroyed. Eventually I found that the root cause was a qgroup rescan that was still in progress while we were stopping all the btrfs workers. Currently we explicitly pause balance and scrub operations in close_ctree(), but we do nothing to stop the qgroup rescan. We should probably be doing the same for qgroup rescan, but that's a much larger change. This small change is good enough to allow me to unmount without crashing. Signed-off-by: Justin Maggard <jmaggard@netgear.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
| * Btrfs: fix extent accounting for partial direct IO writesFilipe Manana2015-11-051-15/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a write using direct IO we can end up not doing the whole write operation using the direct IO path, in that case we fallback to a buffered write to do the remaining IO. This happens for example if the range we are writing to contains a compressed extent. When we do a partial write and fallback to buffered IO, due to the existence of a compressed extent for example, we end up not adjusting the outstanding extents counter of our inode which ends up getting decremented twice, once by the DIO ordered extent for the partial write and once again by btrfs_direct_IO(), resulting in an arithmetic underflow at extent-tree.c:drop_outstanding_extent(). For example if we have: extents [ prealloc extent ] [ compressed extent ] offsets A B C D E and at the moment our inode's outstanding extents counter is 0, if we do a direct IO write against the range [B, D[ (which has a length smaller than 128Mb), we end up bumping our inode's outstanding extents counter to 1, we create a DIO ordered extent for the range [B, C[ and then fallback to a buffered write for the range [C, D[. The direct IO handler (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by 1, leaving it with a value of 0, through a call to btrfs_delalloc_release_space() and then shortly after the DIO ordered extent finishes and calls btrfs_delalloc_release_metadata() which ends up to attempt to decrement the inode's outstanding extents counter by 1, resulting in an assertion failure at drop_outstanding_extent() because the operation would result in an arithmetic underflow (0 - 1). This produces the following trace: [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526 [125471.338844] ------------[ cut here ]------------ [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173! [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs] [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000 [125471.340745] RIP: 0010:[<ffffffffa0550da1>] [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP: 0018:ffff88040a11bc78 EFLAGS: 00010296 [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000 [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000 [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000 [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88 [125471.340745] FS: 0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000 [125471.340745] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0 [125471.340745] Stack: [125471.340745] ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1 [125471.340745] ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000 [125471.340745] 0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa [125471.340745] Call Trace: [125471.340745] [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs] [125471.340745] [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs] [125471.340745] [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs] [125471.340745] [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs] [125471.340745] [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs] [125471.340745] [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [125471.340745] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [125471.340745] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106904d>] kthread+0xef/0xf7 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41 [125471.340745] RIP [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP <ffff88040a11bc78> [125471.539620] ---[ end trace 144259f7838b4aa4 ]--- So fix this by ensuring we adjust the outstanding extents counter when we do the fallback just like we do for the case where the whole write can be done through the direct IO path. We were also adjusting the outstanding extents counter by a constant value of 1, which is incorrect because we were ignorning that we account extents in BTRFS_MAX_EXTENT_SIZE units, o fix that as well. The following test case for fstests reproduces this issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_xfs_io_command "falloc" rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create a compressed extent covering the range [700K, 800K[. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Create prealloc extent covering the range [600K, 700K[. $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo # Write 80K of data to the range [640K, 720K[ using direct IO. This # range covers both the prealloc extent and the compressed extent. # Because there's a compressed extent in the range we are writing to, # the DIO write code path ends up only writing the first 60k of data, # which goes to the prealloc extent, and then falls back to buffered IO # for writing the remaining 20K of data - because that remaining data # maps to a file range containing a compressed extent. # When falling back to buffered IO, we used to trigger an assertion when # releasing reserved space due to bad accounting of the inode's # outstanding extents counter, which was set to 1 but we ended up # decrementing it by 1 twice, once through the ordered extent for the # 60K of data we wrote using direct IO, and once through the main direct # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write # wrote less than 80K of data (60K). $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now similar test as above but for very large write operations. This # triggers special cases for an inode's outstanding extents accounting, # as internally btrfs logically splits extents into 128Mb units. $XFS_IO_PROG -f -s \ -c "pwrite -S 0xaa -b 128M 258M 128M" \ -c "falloc 0 258M" \ $SCRATCH_MNT/bar | _filter_xfs_io $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \ | _filter_xfs_io # Now verify the file contents are correct and that they are the same # even after unmounting and mounting the fs again (or evicting the page # cache). # # For file foo, all bytes in the range [0, 640K[ must have a value of # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb # and all bytes in the range [720K, 800K[ must have a value of 0xaa. # # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00, # all bytes in the range [3M, 259M[ must have a value of 0xbb and all # bytes in the range [259M, 386M[ must have a value of 0xaa. # echo "File digests before remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch _scratch_remount echo "File digests after remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch status=0 exit Fixes: e1cbbfa5f5aa ("Btrfs: fix outstanding_extents accounting in DIO") Fixes: 3e05bde8c3c2 ("Btrfs: only adjust outstanding_extents when we do a short write") Signed-off-by: Filipe Manana <fdmanana@suse.com>
* | fs/btrfs/inode.c: remove unnecessary new_valid_dev() checkYaowei Bai2015-11-091-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | new_valid_dev() always returns 1, so the !new_valid_dev() check is not needed. Remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-11-076-17/+16
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...
| * | mm, fs: introduce mapping_gfp_constraint()Michal Hocko2015-11-063-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman2015-11-063-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-linus-4.4' of ↵Linus Torvalds2015-11-0642-1065/+2627
|\ \ \ | |/ / |/| / | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "We have a lot of subvolume quota improvements in here, along with big piles of cleanups from Dave Sterba and Anand Jain and others. Josef pitched in a batch of allocator fixes based on production use here at FB. We found that mount -o ssd_spread greatly improved our performance on hardware raid5/6, but it exposed some CPU bottlenecks in the allocator. These patches make a huge difference" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (100 commits) Btrfs: fix hole punching when using the no-holes feature Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state btrfs: Fix a data space underflow warning btrfs: qgroup: Fix a rebase bug which will cause qgroup double free btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans btrfs: clear PF_NOFREEZE in cleaner_kthread() btrfs: qgroup: Don't copy extent buffer to do qgroup rescan btrfs: add balance filters limits, stripes and usage to supported mask btrfs: extend balance filter usage to take minimum and maximum btrfs: add balance filter for stripes btrfs: extend balance filter limit to take minimum and maximum btrfs: fix use after free iterating extrefs btrfs: check unsupported filters in balance arguments Btrfs: fix regression running delayed references when using qgroups Btrfs: fix regression when running delayed references Btrfs: don't do extra bitmap search in one bit case Btrfs: keep track of largest extent in bitmaps Btrfs: don't keep trying to build clusters if we are fragmented Btrfs: cut down on loops through the allocator Btrfs: don't continue setting up space cache when enospc ...
| * Btrfs: fix hole punching when using the no-holes featureFilipe Manana2015-11-031-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are using the no-holes feature, if we punch a hole into a file range that already contains a hole which overlaps the range we are passing to fallocate(), we end up removing the extent map that represents the existing hole without adding a new one. This happens because with the no-holes feature we do not have explicit extent items to represent holes and therefore the call to __btrfs_drop_extents(), made from btrfs_punch_hole(), returns an end offset to the variable drop_end that is smaller than the end of the range passed to fallocate(), while it drops all existing extent maps in that range. Normally having a missing extent map is not a problem, for example for a readpages() operation we just end up building the extent map by looking at the fs/subvol tree for a matching extent item (or a lack of one for implicit holes). However for an fsync that uses the fast path, which needs to look at the list of modified extent maps, this means the fsync will not record information about the complete hole we had before the fallocate() call into the log tree, resulting in a file with content/layout that does not match what we had neither before nor after the hole punch operation. The following test case for fstests reproduces the issue. It fails without this change because we get a file with a different digest after the fsync log replay and also with a different extent/hole layout. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/punch . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_xfs_io_command "fpunch" _require_xfs_io_command "fiemap" _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV # This test was motivated by an issue found in btrfs when the btrfs # no-holes feature is enabled (introduced in kernel 3.14). So enable # the feature if the fs being tested is btrfs. if [ $FSTYP == "btrfs" ]; then _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes" fi rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create out test file with some data and then fsync it. # We do the fsync only to make sure the last fsync we do in this test # triggers the fast code path of btrfs' fsync implementation, a # condition necessary to trigger the bug btrfs had. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \ -c "fsync" \ $SCRATCH_MNT/foobar | _filter_xfs_io # Now punch a hole against the range [96K, 128K[. $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar # Punch another hole against a range that overlaps the previous range # and ends beyond eof. $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar # Punch another hole against a range that overlaps the first range # ([96K, 128K[) and ends at eof. $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar # Fsync our file. We want to verify that, after a power failure and # mounting the filesystem again, the file content reflects all the hole # punch operations. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar | _filter_scratch echo "Fiemap before power failure:" $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap # Silently drop all writes and umount to simulate a crash/power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file # contents. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File digest after log replay:" # Must match the same digest we got before the power failure. md5sum $SCRATCH_MNT/foobar | _filter_scratch echo "Fiemap after log replay:" # Must match the same extent listing we got before the power failure. $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap _unmount_flakey status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT statechandan2015-11-031-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When executing generic/001 in a loop on a ppc64 machine (with both sectorsize and nodesize set to 64k), the following call trace is observed, WARNING: at /root/repos/linux/fs/btrfs/locking.c:253 Modules linked in: CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54 task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000 NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000 REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d) MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000 CFAR: c0000000004a3b30 SOFTE: 1 GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0 GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005 GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030 GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800 GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049 GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000 GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000 GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0 NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0 LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 Call Trace: [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable) [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00 [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120 [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620 [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0 [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420 [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030 [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250 [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590 [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780 [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250 [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00 [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0 [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570 [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0 [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30 [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0 [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30 [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0 [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0 [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0 [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100 [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80 [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68 Instruction dump: fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040 812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c The above call trace is seen even on x86_64; albeit very rarely and that too with nodesize set to 64k and with nospace_cache mount option being used. The reason for the above call trace is, btrfs_remove_chunk check_system_chunk Allocate chunk if required For each physical stripe on underlying device, btrfs_free_dev_extent ... Take lock on Device tree's root node btrfs_cow_block("dev tree's root node"); btrfs_reserve_extent find_free_extent index = BTRFS_RAID_DUP; have_caching_bg = false; When in LOOP_CACHING_NOWAIT state, Assume we find a block group which is being cached; Hence have_caching_bg is set to true When repeating the search for the next RAID index, we set have_caching_bg to false. Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we allocate a chunk and try to add entries corresponding to the chunk's physical stripe into the device tree. When doing so the task deadlocks itself waiting for the blocking lock on the root node of the device tree. This commit fixes the issue by introducing a new local variable to help indicate as to whether a block group of any RAID type is being cached. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: Fix a data space underflow warningQu Wenruo2015-11-031-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even with quota disabled, generic/127 will trigger a kernel warning by underflow data space info. The bug is caused by buffered write, which in case of short copy, the start parameter for btrfs_delalloc_release_space() is wrong, and round_up/down() in btrfs_delalloc_release() extents the range to page aligned, decreasing one more page than expected. This patch will fix it by passing correct start. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: qgroup: Fix a rebase bug which will cause qgroup double freeQu Wenruo2015-10-261-4/+0
| | | | | | | | | | | | | | | | | | | | | | When rebasing my patchset, I forgot to pick up a cleanup patch to remove old hotfix in 4.2 release. Witouth the cleanup, it will screw up new qgroup reserve framework and always cause minus reserved number. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: qgroup: Fix a race in delayed_ref which leads to abort transQu Wenruo2015-10-265-21/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Between btrfs_allocerved_file_extent() and btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs are run and delayed ref head maybe freed before btrfs_add_delayed_qgroup_reserve(). This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT, and cause transaction to be aborted. This patch will record qgroup reserve space info into delayed_ref_head at btrfs_add_delayed_ref(), to eliminate the race window. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: clear PF_NOFREEZE in cleaner_kthread()Jiri Kosina2015-10-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | cleaner_kthread() kthread calls try_to_freeze() at the beginning of every cleanup attempt. This operation can't ever succeed though, as the kthread hasn't marked itself as freezable. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark cleaner_kthread() freezable (as my understanding is that it can generate filesystem I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: qgroup: Don't copy extent buffer to do qgroup rescanQu Wenruo2015-10-261-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ancient qgroup code call memcpy() on a extent buffer and use it for leaf iteration. As extent buffer contains lock, pointers to pages, it's never sane to do such copy. The following bug may be caused by this insane operation: [92098.841309] general protection fault: 0000 [#1] SMP [92098.841338] Modules linked in: ... [92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted 4.3.0-rc1 #1 [92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [92098.842261] Call Trace: [92098.842277] [<ffffffffc035a5d8>] ? read_extent_buffer+0xb8/0x110 [btrfs] [92098.842304] [<ffffffffc0396d00>] ? btrfs_find_all_roots+0x60/0x70 [btrfs] [92098.842329] [<ffffffffc039af3d>] btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs] Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(), called in reading key from the copied extent_buffer. This patch will use btrfs_clone_extent_buffer() to a better copy of extent buffer to deal such case. Reported-by: Stephane Lesimple <stephane_btrfs@lesimple.fr> Suggested-by: Filipe Manana <fdmanana@kernel.org> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: add balance filters limits, stripes and usage to supported maskDavid Sterba2015-10-261-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable the extended 'limit' syntax (a range), the new 'stripes' and extended 'usage' syntax (a range) filters in the filters mask. The patch comes separate and not within the series that introduced the new filters because the patch adding the mask was merged in a late rc. The integration branch was based on an older rc and could not merge the patch due to the missing changes. Prerequisities: * btrfs: check unsupported filters in balance arguments * btrfs: extend balance filter limit to take minimum and maximum * btrfs: add balance filter for stripes * btrfs: extend balance filter usage to take minimum and maximum Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: extend balance filter usage to take minimum and maximumDavid Sterba2015-10-263-3/+53
| | | | | | | | | | | | | | | | | | | | | | | | Similar to the 'limit' filter, we can enhance the 'usage' filter to accept a range. The change is backward compatible, the range is applied only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag. We don't have a usecase yet, the current syntax has been sufficient. The enhancement should provide parity with other range-like filters. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: add balance filter for stripesGabríel Arthúr Pétursson2015-10-263-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | Balance block groups which have the given number of stripes, defined by a range min..max. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is> [ renamed bargs members, added to the UAPI, wrote the changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: extend balance filter limit to take minimum and maximumDavid Sterba2015-10-263-2/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'limit' filter is underdesigned, it should have been a range for [min,max], with some relaxed semantics when one of the bounds is missing. Besides that, using a full u64 for a single value is a waste of bytes. Let's fix both by extending the use of the u64 bytes for the [min,max] range. This can be done in a backward compatible way, the range will be interpreted only if the appropriate flag is set (BTRFS_BALANCE_ARGS_LIMIT_RANGE). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: fix use after free iterating extrefsChris Mason2015-10-261-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.7+ cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: check unsupported filters in balance argumentsDavid Sterba2015-10-262-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
OpenPOWER on IntegriCloud