summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | btrfs: Remove redundant checks from btrfs_alloc_data_chunk_ondemandNikolay Borisov2017-08-161-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many commits ago the data space_info in alloc_data_chunk_ondemand used to be acquired from the inode. At that point commit 33b4d47f5e24 ("Btrfs: deal with NULL space info") got introduced to deal with spurios cases where the space info could be null, following a rebalance. Nowadays, however, the space info is referenced directly from the btrfs_fs_info struct which is initialised at filesystem mount time. This makes the null checks redundant, so remove them. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: Remove redundant argument of flush_spaceNikolay Borisov2017-08-161-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of flush_space pass the same number for orig/num_bytes arguments. Let's remove one of the numbers and also modify the trace point to show only a single number - bytes requested. Seems that last point where the two parameters were treated differently is before the ticketed enospc rework. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: resume qgroup rescan on rw remountAleksa Sarai2017-08-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several distributions mount the "proper root" as ro during initrd and then remount it as rw before pivot_root(2). Thus, if a rescan had been aborted by a previous shutdown, the rescan would never be resumed. This issue would manifest itself as several btrfs ioctl(2)s causing the entire machine to hang when btrfs_qgroup_wait_for_completion was hit (due to the fs_info->qgroup_rescan_running flag being set but the rescan itself not being resumed). Notably, Docker's btrfs storage driver makes regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT (causing this problem to be manifested on boot for some machines). Cc: <stable@vger.kernel.org> # v3.11+ Cc: Jeff Mahoney <jeffm@suse.com> Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount") Signed-off-by: Aleksa Sarai <asarai@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: clean up extraneous computations in add_delayed_refsEdmund Nadolski2017-08-161-17/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Repeating the same computation in multiple places is not necessary. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: allow backref search checks for shared extentsEdmund Nadolski2017-08-161-49/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When called with a struct share_check, find_parent_nodes() will detect a shared extent and immediately return with BACKREF_SHARED_FOUND. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: add cond_resched() calls when resolving backrefsEdmund Nadolski2017-08-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since backref resolution is CPU-intensive, the cond_resched calls should help alleviate soft lockup occurences. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: backref, add tracepoints for prelim_ref insertion and mergingJeff Mahoney2017-08-163-58/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a tracepoint event for prelim_ref insertion and merging. For each, the ref being inserted or merged and the count of tree nodes is issued. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: add a node counter to each of the rbtreesJeff Mahoney2017-08-161-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds counters to each of the rbtrees so that we can tell how large they are growing for a given workload. These counters will be exported by tracepoints in the next patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: convert prelimary reference tracking to use rbtreesEdmund Nadolski2017-08-161-157/+285
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's been known for a while that the use of multiple lists that are periodically merged was an algorithmic problem within btrfs. There are several workloads that don't complete in any reasonable amount of time (e.g. btrfs/130) and others that cause soft lockups. The solution is to use a set of rbtrees that do insertion merging for both indirect and direct refs, with the former converting refs into the latter. The result is a btrfs/130 workload that used to take several hours now takes about half of that. This runtime still isn't acceptable and a future patch will address that by moving the rbtrees higher in the stack so the lookups can be shared across multiple calls to find_parent_nodes. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: remove ref_tree implementation from backref.cEdmund Nadolski2017-08-161-348/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit afce772e87c3 ("btrfs: fix check_shared for fiemap ioctl") added the ref_tree code in backref.c to reduce backref searching for shared extents under the FIEMAP ioctl. This code will not be compatible with the upcoming rbtree changes for improved backref searching, so this patch removes the ref_tree code. The rbtree changes will provide the equivalent functionality for FIEMAP. The above commit also introduced transaction semantics around calls to btrfs_check_shared() in order to accurately account for delayed refs. This functionality needs to be retained, so a complete revert of the above commit is not desirable. This patch therefore removes the ref_tree portion of the commit as above, however it does not remove the transaction portion. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: btrfs_check_shared should manage its own transactionEdmund Nadolski2017-08-163-33/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit afce772e87c3 ("btrfs: fix check_shared for fiemap ioctl") added transaction semantics around calls to btrfs_check_shared() in order to provide accurate accounting of delayed refs. The transaction management should be done inside btrfs_check_shared(), so that callers do not need to manage transactions individually. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: backref, cleanup __ namespace abuseJeff Mahoney2017-08-161-116/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We typically use __ to indicate a helper routine that shouldn't be called directly without understanding the proper context required to do so. We use static functions to indicate that a function is private to a particular C file. The backref code uses static function and __ prefixes on nearly everything, which makes the code difficult to read and establishes a pattern for future code that shouldn't be followed. This patch drops all the unnecessary prefixes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: backref, add unode_aux_to_inode_list helperJeff Mahoney2017-08-161-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replacing the double cast and ternary conditional with a helper makes the code easier on the eyes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: backref, constify some argumentsJeff Mahoney2017-08-161-13/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This constifies a few buffers used in the backref code. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: constify tracepoint argumentsJeff Mahoney2017-08-163-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: struct-funcs, constify readersJeff Mahoney2017-08-164-89/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have reader helpers for most of the on-disk structures that use an extent_buffer and pointer as offset into the buffer that are read-only. We should mark them as const and, in turn, allow consumers of these interfaces to mark the buffers const as well. No impact on code, but serves as documentation that a buffer is intended not to be modified. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: remove unused sectorsize memberNikolay Borisov2017-08-164-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sectorsize member of btrfs_block_group_cache is unused. So remove it, this reduces the number of holes in the struct. With patch: /* size: 856, cachelines: 14, members: 40 */ /* sum members: 837, holes: 4, sum holes: 19 */ /* bit holes: 1, sum bit holes: 29 bits */ /* last cacheline: 24 bytes */ Without patch: /* size: 864, cachelines: 14, members: 41 */ /* sum members: 841, holes: 5, sum holes: 23 */ /* bit holes: 1, sum bit holes: 29 bits */ /* last cacheline: 32 bytes */ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: Be explicit about usage of min()Nikolay Borisov2017-08-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __btrfs_alloc_chunk contains code which boils down to: ndevs = min(ndevs, devs_max) It's conditional upon devs_max not being 0. However, it cannot really be 0 since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use min explicitly. This has no functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: Use explicit round_down call rather than open-coding itNikolay Borisov2017-08-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: convert while loop to list_for_each_entryNikolay Borisov2017-08-161-9/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | No functional changes, just make the loop a bit more readable Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-09-076-37/+34
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
| * | | block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig2017-08-236-20/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | btrfs: index check-integrity state hash by a dev_tChristoph Hellwig2017-08-231-18/+13
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | We won't have the struct block_device available in the bio soon, so switch to the numerical dev_t instead of the block_device pointer for looking up the check-integrity state. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* / / Btrfs: fix blk_status_t/errno confusionOmar Sandoval2017-08-245-60/+64
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'for-4.13-part3' of ↵Linus Torvalds2017-07-283-10/+8
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes addressing problems reported by users, and there's one more regression fix" * 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: round down size diff when shrinking/growing device Btrfs: fix early ENOSPC due to delalloc btrfs: fix lockup in find_free_extent with read-only block groups Btrfs: fix dir item validation when replaying xattr deletes
| * btrfs: round down size diff when shrinking/growing deviceNikolay Borisov2017-07-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Further testing showed that the fix introduced in 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") was insufficient and it could still lead to discrepancies between the total_bytes in the super block and the device total bytes. So this patch also ensures that the difference between old/new sizes when shrinking/growing is also rounded down. This ensure that we won't be subtracting/adding a non-sectorsize multiples to the superblock/device total sizees. Fixes: 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix early ENOSPC due to delallocOmar Sandoval2017-07-241-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix lockup in find_free_extent with read-only block groupsJeff Mahoney2017-07-241-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have a block group that is all of the following: 1) uncached in memory 2) is read-only 3) has a disk cache state that indicates we need to recreate the cache AND the file system has enough free space fragmentation such that the request for an extent of a given size can't be honored; AND have a single CPU core; AND it's the block group with the highest starting offset such that there are no opportunities (like reading from disk) for the loop to yield the CPU; We can end up with a lockup. The root cause is simple. Once we're in the position that we've read in all of the other block groups directly and none of those block groups can honor the request, there are no more opportunities to sleep. We end up trying to start a caching thread which never gets run if we only have one core. This *should* present as a hung task waiting on the caching thread to make some progress, but it doesn't. Instead, it degrades into a busy loop because of the placement of the read-only check. During the first pass through the loop, block_group->cached will be set to BTRFS_CACHE_STARTED and have_caching_bg will be set. Then we hit the read-only check and short circuit the loop. We're not yet in LOOP_CACHING_WAIT, so we skip that loop back before going through the loop again for other raid groups. Then we move to LOOP_CACHING_WAIT state. During the this pass through the loop, ->cached will still be BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter cache_block_group, do a lot of nothing, and return, and also set have_caching_bg again. Then we hit the read-only check and short circuit the loop. The same thing happens as before except now we DO trigger the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the top. We do this forever. There are two fixes in this patch since they address the same underlying bug. The first is to add a cond_resched to the end of the loop to ensure that the caching thread always has an opportunity to run. This will fix the soft lockup issue, but find_free_extent will still loop doing nothing until the thread has completed. The second is to move the read-only check to the top of the loop. We're never going to return an allocation within a read-only block group so we may as well skip it early. The check for ->cached == BTRFS_CACHE_ERROR would cause the same problem except that BTRFS_CACHE_ERROR is considered a "done" state and we won't re-set have_caching_bg again. Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in the testing process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix dir item validation when replaying xattr deletesFilipe Manana2017-07-191-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We were passing an incorrect slot number to the function that validates directory items when we are replaying xattr deletes from a log tree. The correct slot is stored at variable 'i' and not at 'path->slots[0]', so the call to the validation function was only correct for the first iteration of the loop, when 'i == path->slots[0]'. After this fix, the fstest generic/066 passes again. Fixes: 8ee8c2d62d5f ("btrfs: Verify dir_item in replay_xattr_deletes") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'work.mount' of ↵Linus Torvalds2017-07-151-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
| * | VFS: Don't use save/replace_mount_options if not using generic_show_optionsDavid Howells2017-07-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs, debugfs, reiserfs and tracefs call save_mount_options() and reiserfs calls replace_mount_options(), but they then implement their own ->show_options() methods and don't touch s_options, rendering the saved options unnecessary. I'm trying to eliminate s_options to make it easier to implement a context-based mount where the mount options can be passed individually over a file descriptor. Remove the calls to save/replace_mount_options() call in these cases. Signed-off-by: David Howells <dhowells@redhat.com> cc: Chris Mason <clm@fb.com> cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> cc: Steven Rostedt <rostedt@goodmis.org> cc: linux-btrfs@vger.kernel.org cc: reiserfs-devel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-4.13-part2' of ↵Linus Torvalds2017-07-147-57/+88
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've identified and fixed a silent corruption (introduced by code in the first pull), a fixup after the blk_status_t merge and two fixes to incremental send that Filipe has been hunting for some time" * 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix unexpected return value of bio_readpage_error btrfs: btrfs_create_repair_bio never fails, skip error handling btrfs: cloned bios must not be iterated by bio_for_each_segment_all Btrfs: fix write corruption due to bio cloning on raid5/6 Btrfs: incremental send, fix invalid memory access Btrfs: incremental send, fix invalid path for link commands
| * | Btrfs: fix unexpected return value of bio_readpage_errorLiu Bo2017-07-142-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With blk_status_t conversion (that are now present in master), bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set %ret if it runs without problems. This fixes that unexpected return value by changing btrfs_check_repairable() to return a bool instead of updating %ret, and patch is applicable to both codebases with and without blk_status_t. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: btrfs_create_repair_bio never fails, skip error handlingDavid Sterba2017-07-142-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | As the function uses the non-failing bio allocation, we can remove error handling from the callers as well. Signed-off-by: David Sterba <dsterba@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: cloned bios must not be iterated by bio_for_each_segment_allDavid Sterba2017-07-144-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've started using cloned bios more in 4.13, there are some specifics regarding the iteration. Filipe found [1] that the raid56 iterated a cloned bio using bio_for_each_segment_all, which is incorrect. The cloned bios have wrong bi_vcnt and this could lead to silent corruptions. This patch adds assertions to all remaining bio_for_each_segment_all cases. [1] https://patchwork.kernel.org/patch/9838535/ Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix write corruption due to bio cloning on raid5/6Filipe Manana2017-07-131-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent changes to make bio cloning faster (added in the 4.13 merge window) by using the bio_clone_fast() API introduced a regression on raid5/6 modes, because cloned bios have an invalid bi_vcnt field (therefore it can not be used) and the raid5/6 code uses the bio_for_each_segment_all() API to iterate the segments of a bio, and this API uses a bio's bi_vcnt field. The issue is very simple to trigger by doing for example a direct IO write against a raid5 or raid6 filesystem and then attempting to read what we wrote before: $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf $ mount /dev/sdc /mnt $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar $ od -t x1 /mnt/foobar od: /mnt/foobar: read error: Input/output error For that example, the following is also reported in dmesg/syslog: [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 A scrub will also fail correcting bad copies, mentioning the following in dmesg/syslog: [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $ [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$ [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$ [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$ [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$ [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$ [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$ [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $ [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0 [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$ [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0 [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde So fix this by using the bio_for_each_segment() API and setting before the bio's bi_iter field to the value of the corresponding btrfs bio container's saved iterator if we are processing a cloned bio in the raid5/6 code (the same code processes both cloned and non-cloned bios). This incorrect iteration of cloned bios was also causing some occasional BUG_ONs when running fstest btrfs/064, which have a trace like the following: [ 6674.416156] ------------[ cut here ]------------ [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897! [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1 [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs] [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000 [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs] [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217 [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001 [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004 [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001 [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002 [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000 [ 6674.416235] FS: 0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000 [ 6674.416235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0 [ 6674.416239] Call Trace: [ 6674.416259] __raid56_parity_recover+0xfc/0x16e [btrfs] [ 6674.416276] raid56_parity_recover+0x157/0x16b [btrfs] [ 6674.416293] btrfs_map_bio+0xe0/0x259 [btrfs] [ 6674.416310] btrfs_submit_bio_hook+0xbf/0x147 [btrfs] [ 6674.416327] end_bio_extent_readpage+0x27b/0x4a0 [btrfs] [ 6674.416331] bio_endio+0x17d/0x1b3 [ 6674.416346] end_workqueue_fn+0x3c/0x3f [btrfs] [ 6674.416362] btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs] [ 6674.416379] btrfs_endio_helper+0xe/0x10 [btrfs] [ 6674.416381] process_one_work+0x276/0x4b6 [ 6674.416384] worker_thread+0x1ac/0x266 [ 6674.416386] ? rescuer_thread+0x278/0x278 [ 6674.416387] kthread+0x106/0x10e [ 6674.416389] ? __list_del_entry+0x22/0x22 [ 6674.416391] ret_from_fork+0x27/0x40 [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97 [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90 [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]--- Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | Btrfs: incremental send, fix invalid memory accessFilipe Manana2017-07-061-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing an incremental send, while processing an extent that changed between the parent and send snapshots and that extent was an inline extent in the parent snapshot, it's possible to access a memory region beyond the end of leaf if the inline extent is very small and it is the first item in a leaf. An example scenario is described below. The send snapshot has the following leaf: leaf 33865728 items 33 free space 773 generation 46 owner 5 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f (...) item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53 generation 36 type 1 (regular) extent data disk byte 12791808 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53 generation 36 type 1 (regular) extent data disk byte 138170368 nr 225280 extent data offset 0 nr 225280 ram 225280 extent compression 0 (none) (...) And the parent snapshot has the following leaf: leaf 31272960 items 17 free space 17 generation 31 owner 5 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44 generation 31 type 0 (inline) inline extent data size 23 ram_bytes 613 compression 1 (zlib) (...) When computing the send stream, it is detected that the extent of inode 335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we grab the leaf from the parent snapshot and access the inline extent item. However, before jumping to the 'out' label, we access the 'offset' and 'disk_bytenr' fields of the extent item, which should not be done for inline extents since the inlined data starts at the offset of the 'disk_bytenr' field and can be very small. For example accessing the 'offset' field of the file extent item results in the following trace: [ 599.705368] general protection fault: 0000 [#1] PREEMPT SMP [ 599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$ [ 599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1 [ 599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000 [ 599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs] [ 599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286 [ 599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001 [ 599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f [ 599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098 [ 599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7 [ 599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088 [ 599.709340] FS: 00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000 [ 599.709340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0 [ 599.709340] Call Trace: [ 599.709340] btrfs_get_token_64+0x93/0xce [btrfs] [ 599.709340] ? printk+0x48/0x50 [ 599.709340] btrfs_get_64+0xb/0xd [btrfs] [ 599.709340] process_extent+0x3a1/0x1106 [btrfs] [ 599.709340] ? btree_read_extent_buffer_pages+0x5/0xef [btrfs] [ 599.709340] changed_cb+0xb03/0xb3d [btrfs] [ 599.709340] ? btrfs_get_token_32+0x7a/0xcc [btrfs] [ 599.709340] btrfs_compare_trees+0x432/0x53d [btrfs] [ 599.709340] ? process_extent+0x1106/0x1106 [btrfs] [ 599.709340] btrfs_ioctl_send+0x960/0xe26 [btrfs] [ 599.709340] btrfs_ioctl+0x181b/0x1fed [btrfs] [ 599.709340] ? trace_hardirqs_on_caller+0x150/0x1ac [ 599.709340] vfs_ioctl+0x21/0x38 [ 599.709340] ? vfs_ioctl+0x21/0x38 [ 599.709340] do_vfs_ioctl+0x611/0x645 [ 599.709340] ? rcu_read_unlock+0x5b/0x5d [ 599.709340] ? __fget+0x6d/0x79 [ 599.709340] SyS_ioctl+0x57/0x7b [ 599.709340] entry_SYSCALL_64_fastpath+0x18/0xad [ 599.709340] RIP: 0033:0x7f51945eec47 [ 599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47 [ 599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004 [ 599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700 [ 599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046 [ 599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040 [ 599.709340] ? trace_hardirqs_off_caller+0x43/0xb1 [ 599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$ [ 599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00 [ 599.762057] ---[ end trace fe00d7af61b9f49e ]--- This is because the 'offset' field starts at an offset of 37 bytes (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8 bytes and therefore attemping to read it causes a 1 byte access beyond the end of the leaf, as the first item's content in a leaf is located at the tail of the leaf, the item size is 44 bytes and the offset of that field plus its length (37 + 8 = 45) goes beyond the item's size by 1 byte. So fix this by accessing the 'offset' and 'disk_bytenr' fields after jumping to the 'out' label if we are processing an inline extent. We move the reading operation of the 'disk_bytenr' field too because we have the same problem as for the 'offset' field explained above when the inline data is less then 8 bytes. The access to the 'generation' field is also moved but just for the sake of grouping access to all the fields. Fixes: e1cbfd7bf6da ("Btrfs: send, fix file hole not being preserved due to inline extent") Cc: <stable@vger.kernel.org> # v4.12+ Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | Btrfs: incremental send, fix invalid path for link commandsFilipe Manana2017-07-061-30/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some scenarios an incremental send stream can contain link commands with an invalid target path. Such scenarios happen after moving some directory inode A, renaming a regular file inode B into the old name of inode A and finally creating a new hard link for inode B at directory inode A. Consider the following example scenario where this issue happens. Parent snapshot: . (ino 256) | |--- dir1/ (ino 257) | |--- dir2/ (ino 258) | |--- dir3/ (ino 259) | |--- file1 (ino 261) | |--- dir4/ (ino 262) | |--- dir5/ (ino 260) Send snapshot: . (ino 256) | |--- dir1/ (ino 257) |--- dir2/ (ino 258) | |--- dir3/ (ino 259) | |--- dir4 (ino 261) | |--- dir6/ (ino 263) |--- dir44/ (ino 262) |--- file11 (ino 261) |--- dir55/ (ino 260) When attempting to apply the corresponding incremental send stream, a link command contains an invalid target path which makes the receiver fail. The following is the verbose output of the btrfs receive command: receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...) utimes utimes dir1 utimes dir1/dir2/dir3 utimes rename dir1/dir2/dir3/dir4 -> o262-7-0 link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1 link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory The following steps happen during the computation of the incremental send stream the lead to this issue: 1) When processing inode 261, we orphanize inode 262 due to a name/location collision with one of the new hard links for inode 261 (created in the second step below). 2) We create one of the 2 new hard links for inode 261, the one whose location is at "dir1/dir2/dir3/dir4". 3) We then attempt to create the other new hard link for inode 261, which has inode 262 as its parent directory. Because the path for this new hard link was computed before we started processing the new references (hard links), it reflects the old name/location of inode 262, that is, it does not account for the orphanization step that happened when we started processing the new references for inode 261, whence it is no longer valid, causing the receiver to fail. So fix this issue by recomputing the full path of new references if we ended up orphanizing other inodes which are directories. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>
* | | Merge branch 'nowait-aio-btrfs-fixup' of ↵Linus Torvalds2017-07-101-12/+14
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "This fixes a user-visible bug introduced by the nowait-aio patches merged in this cycle" * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: nowait aio: Correct assignment of pos
| * | | btrfs: nowait aio: Correct assignment of posGoldwyn Rodrigues2017-07-101-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Assigning pos for usage early messes up in append mode, where the pos is re-assigned in generic_write_checks(). Assign pos later to get the correct position to write from iocb->ki_pos. Since check_can_nocow also uses the value of pos, we shift generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT are present in generic_write_checks(), so checking for IOCB_NOWAIT is enough. Also, put locking sequence in the fast path. This fixes a user visible bug, as reported: "apparently breaks several shell related features on my system. In zsh history stopped working, because no new entries are added anymore. I fist noticed the issue when I tried to build mplayer. It uses a shell script to generate a help_mp.h file: [...] Here is a simple testcase: % echo "foo" >> test % echo "foo" >> test % cat test foo % " Fixes: edf064e7c6fe ("btrfs: nowait aio support") CC: Jens Axboe <axboe@kernel.dk> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Link: https://lkml.kernel.org/r/20170704042306.GA274@x4 Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | Merge tag 'for-linus-v4.13-2' of ↵Linus Torvalds2017-07-071-5/+8
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
| * | | | btrfs: minimal conversion to errseq_t writeback error reporting on fsyncJeff Layton2017-07-061-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just check and advance the errseq_t in the file before returning, and use an errseq_t based check for writeback errors. Other internal callers of filemap_* functions are left as-is. Signed-off-by: Jeff Layton <jlayton@redhat.com>
* | | | | Merge branch 'for-4.13' of ↵Linus Torvalds2017-07-063-13/+13
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "These are the percpu changes for the v4.13-rc1 merge window. There are a couple visibility related changes - tracepoints and allocator stats through debugfs, along with __ro_after_init markings and a cosmetic rename in percpu_counter. Please note that the simple O(#elements_in_the_chunk) area allocator used by percpu allocator is again showing scalability issues, primarily with bpf allocating and freeing large number of counters. Dennis is working on the replacement allocator and the percpu allocator will be seeing increased churns in the coming cycles" * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix static checker warnings in pcpu_destroy_chunk percpu: fix early calls for spinlock in pcpu_stats percpu: resolve err may not be initialized in pcpu_alloc percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch percpu: add tracepoint support for percpu memory percpu: expose statistics about percpu memory via debugfs percpu: migrate percpu data structures to internal header percpu: add missing lockdep_assert_held to func pcpu_free_area mark most percpu globals as __ro_after_init
| * | | | | percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batchNikolay Borisov2017-06-203-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, percpu_counter_add is a wrapper around __percpu_counter_add which is preempt safe due to explicit calls to preempt_disable. Given how __ prefix is used in percpu related interfaces, the naming unfortunately creates the false sense that __percpu_counter_add is less safe than percpu_counter_add. In terms of context-safety, they're equivalent. The only difference is that the __ version takes a batch parameter. Make this a bit more explicit by just renaming __percpu_counter_add to percpu_counter_add_batch. This patch doesn't cause any functional changes. tj: Minor updates to patch description for clarity. Cosmetic indentation updates. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Cc: linux-mm@kvack.org Cc: "David S. Miller" <davem@davemloft.net>
* | | | | | Merge branch 'for-4.13-part1' of ↵Linus Torvalds2017-07-0545-1359/+1680
|\ \ \ \ \ \ | | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal, refactoring or cleanups. Except the two top patches, all have been in for-next for an extensive amount of time. User visible changes: - statx support - quota override tunable - improved compression thresholds - obsoleted mount option alloc_start Core updates: - bio-related updates: - faster bio cloning - no allocation failures - preallocated flush bios - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates - prep work for btree_inode removal - dir-item validation - qgoup fixes and updates - cleanups: - removed unused struct members, unused code, refactoring - argument refactoring (fs_info/root, caller -> callee sink) - SEARCH_TREE ioctl docs" * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: Remove false alert when fiemap range is smaller than on-disk extent btrfs: Don't clear SGID when inheriting ACLs btrfs: fix integer overflow in calc_reclaim_items_nr btrfs: scrub: fix target device intialization while setting up scrub context btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Add quick exit for non-fs extents Btrfs: rework delayed ref total_bytes_pinned accounting Btrfs: return old and new total ref mods when adding delayed refs Btrfs: always account pinned bytes when dropping a tree block ref Btrfs: update total_bytes_pinned when pinning down extents Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 btrfs: fix validation of XATTR_ITEM dir items btrfs: Verify dir_item in iterate_object_props btrfs: Check name_len before in btrfs_del_root_ref btrfs: Check name_len before reading btrfs_get_name ...
| * | | | | btrfs: Remove false alert when fiemap range is smaller than on-disk extentQu Wenruo2017-06-291-16/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent before submit it to user") introduced a warning to catch unemitted cached fiemap extent. However such warning doesn't take the following case into consideration: 0 4K 8K |<---- fiemap range --->| |<----------- On-disk extent ------------------>| In this case, the whole 0~8K is cached, and since it's larger than fiemap range, it break the fiemap extent emit loop. This leaves the fiemap extent cached but not emitted, and caught by the final fiemap extent sanity check, causing kernel warning. This patch removes the kernel warning and renames the sanity check to emit_last_fiemap_cache() since it's possible and valid to have cached fiemap extent. Reported-by: David Sterba <dsterba@suse.cz> Reported-by: Adam Borowski <kilobyte@angband.pl> Fixes: 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent ...") Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: Don't clear SGID when inheriting ACLsJan Kara2017-06-291-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: stable@vger.kernel.org CC: linux-btrfs@vger.kernel.org CC: David Sterba <dsterba@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: fix integer overflow in calc_reclaim_items_nrChris Mason2017-06-2910-25/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af244 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: scrub: fix target device intialization while setting up scrub contextDavid Sterba2017-06-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a helper but wrongly sets up the target device. Incidentally there's a local variable with the same name as a parameter in the previous function, so this got caught during runtime as crash in test btrfs/027. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ↵Qu Wenruo2017-06-298-46/+117
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ranges [BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
OpenPOWER on IntegriCloud