summaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | xfs: defer agfl frees from inode inactivationBrian Foster2018-05-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS inode chunks are already freed via deferred operations (which now also defer AGFL block frees), but inode btree blocks are freed directly in the associated context. This has been known to lead to log reservation overruns in particular workloads where an inobt block free may require several AGFL block frees (and thus several allocation btree modifications) before the inobt block itself is actually freed. To avoid this problem, defer the frees of any AGFL blocks before the inobt block free takes place. This requires passing the dfops from xfs_inactive_ifree() down through the inobt ->[alloc|free]_block() callouts, which essentially only requires to attach the dfops to the transaction since it is already carried all the way through to the inobt update and allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: defer agfl block frees from deferred ops processing contextBrian Foster2018-05-091-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that AGFL block frees are deferred when dfops is set in the transaction, start deferring AGFL block frees from contexts that are known to push the limits of existing log reservations. The first such context is deferred operation processing itself. This primarily targets deferred extent frees (such as file extents and inode chunks), but in doing so covers all allocation operations that occur in deferred operation processing context. Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the processing sequence. This means that any AGFL block frees due to allocation events result in the addition of new EFIs to the dfops rather than being processed immediately. xfs_defer_finish() rolls the transaction at least once more to process the frees of the AGFL blocks back to the allocation btrees and returns once the AGFL is rectified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: defer agfl block frees when dfops is availableBrian Foster2018-05-096-7/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The AGFL fixup code executes before every block allocation/free and rectifies the AGFL based on the current, dynamic allocation requirements of the fs. The AGFL must hold a minimum number of blocks to satisfy a worst case split of the free space btrees caused by the impending allocation operation. The AGFL is also updated to maintain the implicit requirement for a minimum number of free slots to satisfy a worst case join of the free space btrees. Since the AGFL caches individual blocks, AGFL reduction typically involves multiple, single block frees. We've had reports of transaction overrun problems during certain workloads that boil down to AGFL reduction freeing multiple blocks and consuming more space in the log than was reserved for the transaction. Since the objective of freeing AGFL blocks is to ensure free AGFL free slots are available for the upcoming allocation, one way to address this problem is to release surplus blocks from the AGFL immediately but defer the free of those blocks (similar to how file-mapped blocks are unmapped from the file in one transaction and freed via a deferred operation) until the transaction is rolled. This turns AGFL reduction into an operation with predictable log reservation consumption. Add the capability to defer AGFL block frees when a deferred ops list is available to the AGFL fixup code. Add a dfops pointer to the transaction to carry dfops through various contexts to the allocator context. Deferring AGFL frees is conditional behavior based on whether the transaction pointer is populated. The long term objective is to reuse the transaction pointer to clean up all unrelated callchains that pass dfops on the stack along with a transaction and in doing so, consistently defer AGFL blocks from the allocator. A bit of customization is required to handle deferred completion processing because AGFL blocks are accounted against a per-ag reservation pool and AGFL blocks are not inserted into the extent busy list when freed (they are inserted when used and released back to the AGFL). Reuse the majority of the existing deferred extent free infrastructure and customize it appropriately to handle AGFL blocks. Note that this patch only adds infrastructure. It does not change behavior because no callers have been updated to pass ->t_agfl_dfops into the allocation code. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: create agfl block free helper functionBrian Foster2018-05-092-10/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor the AGFL block free code into a new helper such that it can be invoked from deferred context. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: print specific dqblk that failed verifiersEric Sandeen2018-05-091-19/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than printing the top of the buffer that held a corrupted dqblk, restructure things to print out the specific one that failed by pushing the calls to the verifier_error function down into the verifier which iterates over the buffer and detects the error. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: add full xfs_dqblk verifierEric Sandeen2018-05-094-11/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an xfs_dqblk verifier so that it can check the uuid on V5 filesystems; it calls the existing xfs_dquot_verify verifier to validate the xfs_disk_dquot_t contained inside it. This lets us move the uuid verification out of the crc verifier, which makes little sense. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: pass full xfs_dqblk to repair during quotacheckEric Sandeen2018-05-093-14/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's a bit dicey to pass in the smaller xfs_disk_dquot and then cast it to something larger; pass in the full xfs_dqblk so we know the caller has sent us the right thing. Rename the function to xfs_dqblk_repair for clarity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: check type in quota verifier during quotacheckEric Sandeen2018-05-091-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During quotacheck we send in the quota type, so verify that as well. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: remove unused flags arg from xfs_dquot_verifyEric Sandeen2018-05-095-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Long ago the flags argument was used to determine whether to issue warnings about corruptions, but that's done elsewhere now and the flag is unused here, so remove it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: clean up locking in xfs_file_iomap_beginDave Chinner2018-05-091-37/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than checking what kind of locking is needed in a helper function and then jumping through hoops to do the locking in line, move the locking to the helper function that does all the checks and rename it to xfs_ilock_for_iomap(). This also allows us to hoist all the nonblocking checks up into the locking helper, further simplifier the code flow in xfs_file_iomap_begin() and making it easier to understand. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: simplify xfs_file_iomap_begin() logicDave Chinner2018-05-091-36/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current logic that determines whether allocation should be done has grown somewhat spaghetti like with the addition of IOMAP_NOWAIT functionality. Separate out each of the different cases into single, obvious checks to get rid most of the nested IOMAP_NOWAIT checks in the allocation logic. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | iomap: iomap_dio_rw() handles all sync writesDave Chinner2018-05-091-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently iomap_dio_rw() only handles (data)sync write completions for AIO. This means we can't optimised non-AIO IO to minimise device flushes as we can't tell the caller whether a flush is required or not. To solve this problem and enable further optimisations, make iomap_dio_rw responsible for data sync behaviour for all IO, not just AIO. In doing so, the sync operation is now accounted as part of the DIO IO by inode_dio_end(), hence post-IO data stability updates will no long race against operations that serialise via inode_dio_wait() such as truncate or hole punch. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: move generic_write_sync calls inwardsDave Chinner2018-05-091-15/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prepare for iomap iinfrastructure based DSYNC optimisations. While moving the code araound, move the XFS write bytes metric update for direct IO into xfs_dio_write_end_io callback so that we always capture the amount of data written via AIO+DIO. This fixes the problem where queued AIO+DIO writes are not accounted to this metric. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: don't retry xfs_buf_find on XBF_TRYLOCK failureDave Chinner2018-05-091-28/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When looking at an event trace recently, I noticed that non-blocking buffer lookup attempts would fail on cached locked buffers and then run the slow cache-miss path. This means we are doing an xfs_buf allocation, lookup and free unnecessarily every time we avoid blocking on a locked buffer. Fix this by changing _xfs_buf_find() to return an error status to the caller to indicate that we failed the lock attempt rather than just returning a NULL. This allows the higher level code to discriminate between a cache miss and an cache hit that we failed to lock. This also allows us to return a -EFSCORRUPTED state if we are asked to look up a block number outside the range of the filesystem in _xfs_buf_find(), which moves us one step closer to being able to handle such errors in a more graceful manner at the higher levels. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: make xfs_buf_incore out of lineDave Chinner2018-05-094-23/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move xfs_buf_incore out of line and make it the only way to look up a buffer in the buffer cache from outside the buffer cache. Convert the external users of _xfs_buf_find() to xfs_buf_incore() and make _xfs_buf_find() static. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: actually rename xfs_incore -> xfs_buf_incore] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: trace ATTR flags in xattr tracepointsEric Sandeen2018-05-091-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE flags as well as the OP_FLAGS. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: validate allocated inode numberDave Chinner2018-05-091-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have corrupted free inode btrees, we can attempt to allocate inodes that we know are already allocated. Catch allocation of these inodes and report corruption as early as possible to prevent corruption propagation or deadlocks. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: validate cached inodes are free when allocatedDave Chinner2018-05-091-25/+48
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent fuzzed filesystem image cached random dcache corruption when the reproducer was run. This often showed up as panics in lookup_slow() on a null inode->i_ops pointer when doing pathwalks. BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 .... Call Trace: lookup_slow+0x44/0x60 walk_component+0x3dd/0x9f0 link_path_walk+0x4a7/0x830 path_lookupat+0xc1/0x470 filename_lookup+0x129/0x270 user_path_at_empty+0x36/0x40 path_listxattr+0x98/0x110 SyS_listxattr+0x13/0x20 do_syscall_64+0xf5/0x280 entry_SYSCALL_64_after_hwframe+0x42/0xb7 but had many different failure modes including deadlocks trying to lock the inode that was just allocated or KASAN reports of use-after-free violations. The cause of the problem was a corrupt INOBT on a v4 fs where the root inode was marked as free in the inobt record. Hence when we allocated an inode, it chose the root inode to allocate, found it in the cache and re-initialised it. We recently fixed a similar inode allocation issue caused by inobt record corruption problem in xfs_iget_cache_miss() in commit ee457001ed6c ("xfs: catch inode allocation state mismatch corruption"). This change adds similar checks to the cache-hit path to catch it, and turns the reproducer into a corruption shutdown situation. Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix typos in comment] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | Merge branch 'work.lookup' of ↵Linus Torvalds2018-06-041-8/+8
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache lookup cleanups from Al Viro: "Cleaning ->lookup() instances up - mostly d_splice_alias() conversions" * 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (29 commits) switch the rest of procfs lookups to d_splice_alias() procfs: switch instantiate_t to d_splice_alias() don't bother with tid_fd_revalidate() in lookups proc_lookupfd_common(): don't bother with instantiate unless the file is open procfs: get rid of ancient BS in pid_revalidate() uses cifs_lookup(): switch to d_splice_alias() cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL 9p: unify paths in v9fs_vfs_lookup() ncp_lookup(): use d_splice_alias() hfsplus: switch to d_splice_alias() hfs: don't allow mounting over .../rsrc hfs: use d_splice_alias() omfs_lookup(): report IO errors, use d_splice_alias() orangefs_lookup: simplify openpromfs: switch to d_splice_alias() xfs_vn_lookup: simplify a bit adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias() adfs_lookup_byname: .. *is* taken care of in fs/namei.c romfs_lookup: switch to d_splice_alias() qnx6_lookup: switch to d_splice_alias() ...
| * | | xfs_vn_lookup: simplify a bitAl Viro2018-05-221-8/+8
| | |/ | |/| | | | | | | | | | | | | | | | have all post-xfs_lookup() branches converge on d_splice_alias() Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'hch.procfs' of ↵Linus Torvalds2018-06-041-30/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull procfs updates from Al Viro: "Christoph's proc_create_... cleanups series" * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits) xfs, proc: hide unused xfs procfs helpers isdn/gigaset: add back gigaset_procinfo assignment proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields tty: replace ->proc_fops with ->proc_show ide: replace ->proc_fops with ->proc_show ide: remove ide_driver_proc_write isdn: replace ->proc_fops with ->proc_show atm: switch to proc_create_seq_private atm: simplify procfs code bluetooth: switch to proc_create_seq_data netfilter/x_tables: switch to proc_create_seq_private netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data neigh: switch to proc_create_seq_data hostap: switch to proc_create_{seq,single}_data bonding: switch to proc_create_seq_data rtc/proc: switch to proc_create_single_data drbd: switch to proc_create_single resource: switch to proc_create_seq_data staging/rtl8192u: simplify procfs code jfs: simplify procfs code ...
| * | | xfs, proc: hide unused xfs procfs helpersArnd Bergmann2018-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These two functions now trigger a warning when CONFIG_PROC_FS is disabled: fs/xfs/xfs_stats.c:128:12: error: 'xqmstat_proc_show' defined but not used [-Werror=unused-function] static int xqmstat_proc_show(struct seq_file *m, void *v) ^~~~~~~~~~~~~~~~~ fs/xfs/xfs_stats.c:118:12: error: 'xqm_proc_show' defined but not used [-Werror=unused-function] static int xqm_proc_show(struct seq_file *m, void *v) ^~~~~~~~~~~~~ Previously, they were referenced from an unused 'static const' structure, which is silently dropped by gcc. We can address the warning by adding the same #ifdef around them that hides the reference. Fixes: 3f3942aca6da ("proc: introduce proc_create_single{,_data}") Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | proc: introduce proc_create_single{,_data}Christoph Hellwig2018-05-161-29/+2
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | Variants of proc_create{,_data} that directly take a seq_file show callback and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
* / | xfs: convert to bioset_init()/mempool_init()Kent Overstreet2018-05-303-8/+7
|/ / | | | | | | | | | | | | | | | | Convert XFS to embedded bio sets. Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | xfs: cap the length of deduplication requestsDarrick J. Wong2018-05-021-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Since deduplication potentially has to read in all the pages in both files in order to compare the contents, cap the deduplication request length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound on the request length and can't just lock up the kernel forever. Found by running generic/304 after commit 1ddae54555b62 ("common/rc: add missing 'local' keywords"). Reported-by: matorola@gmail.com Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
* | xfs: don't fail when converting shortform attr to long form during ATTR_REPLACEDarrick J. Wong2018-04-171-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kanda Motohiro reported that expanding a tiny xattr into a large xattr fails on XFS because we remove the tiny xattr from a shortform fork and then try to re-add it after converting the fork to extents format having not removed the ATTR_REPLACE flag. This fails because the attr is no longer present, causing a fs shutdown. This is derived from the patch in his bug report, but we really shouldn't ignore a nonzero retval from the remove call. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119 Reported-by: kanda.motohiro@gmail.com Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | xfs: prevent creating negative-sized file via INSERT_RANGEDarrick J. Wong2018-04-171-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During the "insert range" fallocate operation, i_size grows by the specified 'len' bytes. XFS verifies that i_size + len < s_maxbytes, as it should. But this comparison is done using the signed 'loff_t', and 'i_size + len' can wrap around to a negative value, causing the check to incorrectly pass, resulting in an inode with "negative" i_size. This is possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX. ext4 and f2fs don't run into this because they set a smaller s_maxbytes. Fix it by using subtraction instead. Reproducer: xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096" Fixes: a904b1ca5751 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: <stable@vger.kernel.org> # v4.1+ Originally-From: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix signed integer addition overflow too] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | xfs: set format back to extents if xfs_bmap_extents_to_btreeEric Sandeen2018-04-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If xfs_bmap_extents_to_btree fails in a mode where we call xfs_iroot_realloc(-1) to de-allocate the root, set the format back to extents. Otherwise we can assume we can dereference ifp->if_broot based on the XFS_DINODE_FMT_BTREE format, and crash. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | xfs: enhance dinode verifierEric Sandeen2018-04-171-0/+21
|/ | | | | | | | | | | | | | | | | | Add several more validations to xfs_dinode_verify: - For LOCAL data fork formats, di_nextents must be 0. - For LOCAL attr fork formats, di_anextents must be 0. - For inodes with no attr fork offset, - format must be XFS_DINODE_FMT_EXTENTS if set at all - di_anextents must be 0. Thanks to dchinner for pointing out a couple related checks I had forgotten to add. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199377 Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* Merge tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2018-04-1237-205/+172
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more xfs updates from Darrick Wong: "Most of these are code cleanups, but there are a couple of notable use-after-free bug fixes. This series has been run through a full xfstests run over the week and through a quick xfstests run against this morning's master, with no major failures reported. - clean up unnecessary function call parameters - fix a use-after-free bug when aborting logging intents - refactor filestreams state data to avoid use-after-free bug - fix incorrect removal of cow extents when truncating extended attributes. - refactor open-coded __set_page_dirty in favor of using vfs function. - fix a deadlock when fstrim and fs shutdown race" * tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: Force log to disk before reading the AGF during a fstrim Export __set_page_dirty xfs: only cancel cow blocks when truncating the data fork xfs: non-scrub - remove unused function parameters xfs: remove filestream item xfs_inode reference xfs: fix intent use-after-free on abort xfs: Remove "committed" argument of xfs_dir_ialloc
| * Force log to disk before reading the AGF during a fstrimCarlos Maiolino2018-04-101-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Forcing the log to disk after reading the agf is wrong, we might be calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held. This can cause a deadlock when racing a fstrim with a filesystem shutdown. The deadlock has been identified due a miscalculation bug in device-mapper dm-thin, which returns lack of space to its users earlier than the device itself really runs out of space, changing the device-mapper volume into an error state. The problem happened while filling the filesystem with a single file, triggering the bug in device-mapper, consequently causing an IO error and shutting down the filesystem. If such file is removed, and fstrim executed before the XFS finishes the shut down process, the fstrim process will end up holding the buffer lock, and going to sleep on the cil wait queue. At this point, the shut down process will try to wake up all the threads waiting on the cil wait queue, but for this, it will try to hold the same buffer log already held my the fstrim, locking up the filesystem. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * Export __set_page_dirtyMatthew Wilcox2018-04-101-13/+2
| | | | | | | | | | | | | | | | | | | | XFS currently contains a copy-and-paste of __set_page_dirty(). Export it from buffer.c instead. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: only cancel cow blocks when truncating the data forkDarrick J. Wong2018-04-101-6/+8
| | | | | | | | | | | | | | | | | | In xfs_itruncate_extents, only cancel cow blocks and clear the reflink flag if we were asked to truncate the data fork. Attr fork blocks cannot be shared, so this makes no sense. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: non-scrub - remove unused function parametersEric Sandeen2018-04-0927-76/+48
| | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: remove filestream item xfs_inode referenceChristoph Hellwig2018-04-094-24/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filestreams allocator stores an xfs_fstrm_item structure in the MRU to cache inode number to agno mappings for a particular length of time. Each xfs_fstrm_item contains the internal MRU structure, an inode pointer and agno value. The inode pointer stored in the xfs_fstrm_item is not referenced, however, which means the inode itself can be removed and reclaimed before the MRU item is freed. If this occurs, xfs_fstrm_free_func() can access freed or unrelated memory through xfs_fstrm_item->ip and crash. The obvious solution is to grab an inode reference for xfs_fstrm_item. The filestream mechanism only actually uses the inode pointer as a means to access the xfs_mount, however. Rather than add unnecessary complexity, simplify the implementation to store an xfs_mount pointer in struct xfs_mru_cache, and pass it to the free callback. This also requires updates to the tracepoint class to provide the associated data via parameters rather than the inode and a minor hack to peek at the MRU key to establish the inode number at free time. Based on debugging work and an earlier patch from Brian Foster, who also wrote most of this changelog. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: fix intent use-after-free on abortDave Chinner2018-04-024-76/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an intent is aborted during it's initial commit through xfs_defer_trans_abort(), there is a use after free. The current report is for a RUI through this path in generic/388: Freed by task 6274: __kasan_slab_free+0x136/0x180 kmem_cache_free+0xe7/0x4b0 xfs_trans_free_items+0x198/0x2e0 __xfs_trans_commit+0x27f/0xcc0 xfs_trans_roll+0x17b/0x2a0 xfs_defer_trans_roll+0x6ad/0xe60 xfs_defer_finish+0x2a6/0x2140 xfs_alloc_file_space+0x53a/0xf90 xfs_file_fallocate+0x5c6/0xac0 vfs_fallocate+0x2f5/0x930 ioctl_preallocate+0x1dc/0x320 do_vfs_ioctl+0xfe4/0x1690 The problem is that the RUI has two active references - one in the current transaction, and another held by the defer_ops structure that is passed to the RUD (intent done) so that both the intent and the intent done structures are freed on commit of the intent done. Hence during abort, we need to release the intent item, because the defer_ops reference is released separately via ->abort_intent callback. Fix all the intent code to do this correctly. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: Remove "committed" argument of xfs_dir_iallocChandan Rajendra2018-04-024-16/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_dir_ialloc() rolls the current transaction when allocation of a new inode required the space manager to perform an allocation and replinish the Inode btree. None of the callers of xfs_dir_ialloc() need to know if the transaction was committed. Hence this commit removes the "committed" argument of xfs_dir_ialloc. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | export __set_page_dirtyMatthew Wilcox2018-04-111-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS currently contains a copy-and-paste of __set_page_dirty(). Export it from buffer.c instead. Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Dave Chinner <david@fromorbit.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'libnvdimm-for-4.17' of ↵Linus Torvalds2018-04-103-17/+23
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This cycle was was not something I ever want to repeat as there were several late changes that have only now just settled. Half of the branch up to commit d2c997c0f145 ("fs, dax: use page->mapping to warn...") have been in -next for several releases. The of_pmem driver and the address range scrub rework were late arrivals, and the dax work was scaled back at the last moment. The of_pmem driver missed a previous merge window due to an oversight. A sense of obligation to rectify that miss is why it is included for 4.17. It has acks from PowerPC folks. Stephen reported a build failure that only occurs when merging it with your latest tree, for now I have fixed that up by disabling modular builds of of_pmem. A test merge with your tree has received a build success report from the 0day robot over 156 configs. An initial version of the ARS rework was submitted before the merge window. It is self contained to libnvdimm, a net code reduction, and passing all unit tests. The filesystem-dax changes are based on the wait_var_event() functionality from tip/sched/core. However, late review feedback showed that those changes regressed truncate performance to a large degree. The branch was rewound to drop the truncate behavior change and now only includes preparation patches and cleanups (with full acks and reviews). The finalization of this dax-dma-vs-trnucate work will need to wait for 4.18. Summary: - A rework of the filesytem-dax implementation provides for detection of unmap operations (truncate / hole punch) colliding with in-progress device-DMA. A fix for these collisions remains a work-in-progress pending resolution of truncate latency and starvation regressions. - The of_pmem driver expands the users of libnvdimm outside of x86 and ACPI to describe an implementation of persistent memory on PowerPC with Open Firmware / Device tree. - Address Range Scrub (ARS) handling is completely rewritten to account for the fact that ARS may run for 100s of seconds and there is no platform defined way to cancel it. ARS will now no longer block namespace initialization. - The NVDIMM Namespace Label implementation is updated to handle label areas as small as 1K, down from 128K. - Miscellaneous cleanups and updates to unit test infrastructure" * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits) libnvdimm, of_pmem: workaround OF_NUMA=n build error nfit, address-range-scrub: add module option to skip initial ars nfit, address-range-scrub: rework and simplify ARS state machine nfit, address-range-scrub: determine one platform max_ars value powerpc/powernv: Create platform devs for nvdimm buses doc/devicetree: Persistent memory region bindings libnvdimm: Add device-tree based driver libnvdimm: Add of_node to region and bus descriptors libnvdimm, region: quiet region probe libnvdimm, namespace: use a safe lookup for dimm device name libnvdimm, dimm: fix dpa reservation vs uninitialized label area libnvdimm, testing: update the default smart ctrl_temperature libnvdimm, testing: Add emulation for smart injection commands nfit, address-range-scrub: introduce nfit_spa->ars_state libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device' nfit, address-range-scrub: fix scrub in-progress reporting dax, dm: allow device-mapper to operate without dax support dax: introduce CONFIG_DAX_DRIVER fs, dax: use page->mapping to warn if truncate collides with a busy page ext2, dax: introduce ext2_dax_aops ...
| * \ Merge branch 'for-4.17/dax' into libnvdimm-for-nextDan Williams2018-04-093-17/+23
| |\ \
| | * | xfs, dax: introduce xfs_dax_aopsDan Williams2018-03-303-17/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for the dax implementation to start associating dax pages to inodes via page->mapping, we need to provide a 'struct address_space_operations' instance for dax. Otherwise, direct-I/O triggers incorrect page cache assumptions and warnings like the following: WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] [..] CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O 4.15.0-rc2+ #984 [..] Call Trace: set_page_dirty_lock+0x40/0x60 bio_set_pages_dirty+0x37/0x50 iomap_dio_actor+0x2b7/0x3b0 ? iomap_dio_zero+0x110/0x110 iomap_apply+0xa4/0x110 iomap_dio_rw+0x29e/0x3b0 ? iomap_dio_zero+0x110/0x110 ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] xfs_file_read_iter+0xa0/0xc0 [xfs] __vfs_read+0xf9/0x170 vfs_read+0xa6/0x150 SyS_pread64+0x93/0xb0 entry_SYSCALL_64_fastpath+0x1f/0x96 ...where the default set_page_dirty() handler assumes that dirty state is being tracked in 'struct page' flags. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | | | Merge branch 'work.misc' of ↵Linus Torvalds2018-04-061-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, including Christoph's I_DIRTY patches" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: move I_DIRTY_INODE to fs.h ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls fs: fold open_check_o_direct into do_dentry_open vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents vfs: make sure struct filename->iname is word-aligned get rid of pointless includes of fs_struct.h [poll] annotate SAA6588_CMD_POLL users
| * | | | vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalentsIngo Molnar2018-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2018-04-051-1/+1
|\ \ \ \ \ | |_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: kfifo: fix inaccurate comment tools/thermal: tmon: fix for segfault net: Spelling s/stucture/structure/ edd: don't spam log if no EDD information is present Documentation: Fix early-microcode.txt references after file rename tracing: Block comments should align the * on each line treewide: Fix typos in printk GenWQE: Fix a typo in two comments treewide: Align function definition open/close braces
| * | | | treewide: Align function definition open/close bracesJoe Perches2018-03-261-1/+1
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some functions definitions have either the initial open brace and/or the closing brace outside of column 1. Move those braces to column 1. This allows various function analyzers like gnu complexity to work properly for these modified functions. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Nicolin Chen <nicoleotsuka@gmail.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | xfs: do not log/recover swapext extent owner changes for deleted inodesEric Sandeen2018-03-292-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today if we run xfs_fsr and crash[1], log replay can fail because the recovery code tries to instantiate the donor inode from disk to replay the swapext, but it's been deleted and we get verifier failures when we try to read the inode off disk with i_mode == 0. This fixes both sides: We don't log the swapext change if the inode has been deleted, and we don't try to recover it either. [1] or if systemd doesn't cleanly unmount root, as it is wont to do ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | | xfs: clean up xfs_mount allocation and dynamic initializersBrian Foster2018-03-263-13/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the generic data structures embedded in xfs_mount are dynamically initialized immediately after mp is allocated. A few fields are left out and initialized during the xfs_mountfs() sequence, after mp has been attached to the superblock. To clean this up and help prevent premature access of associated fields, refactor xfs_mount allocation and all dependent init calls into a new helper. This self-documents that all low level data structures (i.e., locks, trees, etc.) should be initialized before xfs_mount is attached to the superblock. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | | xfs: remove dead inode version setting codeDave Chinner2018-03-231-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can only get into the branch if CRCs are enabled, so there's no need to check inside the branch for CRCs being enabled.... Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | | xfs: catch inode allocation state mismatch corruptionDave Chinner2018-03-231-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently came across a V4 filesystem causing memory corruption due to a newly allocated inode being setup twice and being added to the superblock inode list twice. From code inspection, the only way this could happen is if a newly allocated inode was not marked as free on disk (i.e. di_mode wasn't zero). Running the metadump on an upstream debug kernel fails during inode allocation like so: XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod= e.c, line: 838 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:114! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0= 1/2014 RIP: 0010:assfail+0x28/0x30 RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202 RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30 R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10 FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000= 000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0 Call Trace: xfs_ialloc+0x383/0x570 xfs_dir_ialloc+0x6a/0x2a0 xfs_create+0x412/0x670 xfs_generic_create+0x1f7/0x2c0 ? capable_wrt_inode_uidgid+0x3f/0x50 vfs_mkdir+0xfb/0x1b0 SyS_mkdir+0xcf/0xf0 do_syscall_64+0x73/0x1a0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Extracting the inode number we crashed on from an event trace and looking at it with xfs_db: xfs_db> inode 184452204 xfs_db> p core.magic = 0x494e core.mode = 0100644 core.version = 2 core.format = 2 (extents) core.nlinkv2 = 1 core.onlink = 0 ..... Confirms that it is not a free inode on disk. xfs_repair also trips over this inode: ..... zero length extent (off = 0, fsbno = 0) in ino 184452204 correcting nextents for inode 184452204 bad attribute fork in inode 184452204, would clear attr fork bad nblocks 1 for inode 184452204, would reset to 0 bad anextents 1 for inode 184452204, would reset to 0 imap claims in-use inode 184452204 is free, would correct imap would have cleared inode 184452204 ..... disconnected inode 184452204, would move to lost+found And so we have a situation where the directory structure and the inobt thinks the inode is free, but the inode on disk thinks it is still in use. Where this corruption came from is not possible to diagnose, but we can detect it and prevent the kernel from oopsing on lookup. The reproducer now results in: $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5} mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex= ists mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex= ists mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu= re needs cleaning mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o= utput error mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o= utput error .... And this corruption shutdown: [ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not= marked free on disk [ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 = of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670 [ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #= 443 [ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO= S 1.10.2-1 04/01/2014 [ 54.852859] Call Trace: [ 54.853531] dump_stack+0x85/0xc5 [ 54.854385] xfs_trans_cancel+0x197/0x1c0 [ 54.855421] xfs_create+0x425/0x670 [ 54.856314] xfs_generic_create+0x1f7/0x2c0 [ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50 [ 54.858586] vfs_mkdir+0xfb/0x1b0 [ 54.859458] SyS_mkdir+0xcf/0xf0 [ 54.860254] do_syscall_64+0x73/0x1a0 [ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 54.862492] RIP: 0033:0x7fb73bddf547 [ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000= 000000000053 [ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73= bddf547 [ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda= a55449a [ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a= 8670dd0 [ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000= 00001ff [ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda= a553500 [ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1= 024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050 [ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutt= ing down filesystem [ 54.884597] XFS (loop0): Please umount the filesystem and rectify the = problem(s) Note that this crash is only possible on v4 filesystemsi or v5 filesystems mounted with the ikeep mount option. For all other V5 filesystems, this problem cannot occur because we don't read inodes we are allocating from disk - we simply overwrite them with the new inode information. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Tested-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | | xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corruptDarrick J. Wong2018-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we encounter discrepancies here so that we know that it's a cross referencing error, not necessarily a corruption in the inobt itself. The userspace xfs_scrub program will try to repair outright corruptions in the agi/inobt prior to phase 3 so that the inode scan will proceed. If only a cross-referencing error is noted, the repair program defers the repair attempt until it can check the other space metadata at least once. It is therefore essential that the inobt scrubber can correctly distinguish between corruptions and "unable to cross-reference something else with this inobt". The same reasoning applies to "xfs: record inode buf errors as a xref error in inobt scrubber". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
OpenPOWER on IntegriCloud