summaryrefslogtreecommitdiffstats
path: root/fs/xfs/libxfs
Commit message (Collapse)AuthorAgeFilesLines
...
* xfs: report fs and rt health via geometry structureDarrick J. Wong2019-04-142-1/+13
| | | | | | | | Use our newly expanded geometry structure to report the overall fs and realtime health status. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: add a new ioctl to describe allocation group geometryDarrick J. Wong2019-04-143-0/+68
| | | | | | | Add a new ioctl to describe an allocation group's geometry. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: bump XFS_IOC_FSGEOMETRY to v5 structuresDave Chinner2019-04-142-28/+64
| | | | | | | | | | | | | | | Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we can't just add a new field to it. Hence we need to bump the definition to V5 and and treat the V4 ioctl and structure similar to v1 to v3. While doing this, clean up all the definitions associated with the XFS_IOC_FSGEOMETRY ioctl. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: forward port to 5.1, expand structure size to 256 bytes] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: clear BAD_SUMMARY if unmounting an unhealthy filesystemDarrick J. Wong2019-04-141-0/+2
| | | | | | | | | | If we know the filesystem metadata isn't healthy during unmount, we want to encourage the administrator to run xfs_repair right away. We can't do this if BAD_SUMMARY will cause an unclean log unmount to force summary recalculation, so turn it off if the fs is bad. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: replace the BAD_SUMMARY mount flag with the equivalent health codeDarrick J. Wong2019-04-141-2/+3
| | | | | | | | Replace the BAD_SUMMARY mount flag with calls to the equivalent health tracking code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: track metadata health statusDarrick J. Wong2019-04-141-0/+183
| | | | | | | | | | Add the necessary in-core metadata fields to keep track of which parts of the filesystem have been observed and which parts were observed to be unhealthy, and print a warning at unmount time if we have unfixed problems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: don't account extra agfl blocks as availableBrian Foster2019-04-141-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The block allocation AG selection code has parameters that allow a caller to perform multiple allocations from a single AG and transaction (under certain conditions). The parameters specify the total block allocation count required by the transaction and the AG selection code selects and locks an AG that will be able to satisfy the overall requirement. If the available block accounting calculation turns out to be inaccurate and a subsequent allocation call fails with -ENOSPC, the resulting transaction cancel leads to filesystem shutdown because the transaction is dirty. This exact problem can be reproduced with a highly parallel space consumer and fsstress workload running long enough to a large filesystem against -ENOSPC conditions. A bmbt block allocation request made for inode extent to bmap format conversion after an extent allocation is expected to be satisfied by the same AG and the same transaction as the extent allocation. The bmbt block allocation fails, however, because the block availability of the AG has changed since the AG was selected (outside of the blocks used for the extent itself). The inconsistent block availability calculation is caused by the deferred block freeing behavior of the AGFL. This immediately removes extra blocks from the AGFL to free up AGFL slots, but rather than immediately freeing such blocks as was done in the past, the block free is deferred such that said blocks are not available for allocation until the current transaction commits. The AG selection logic currently considers all AGFL blocks as available and executes shortly before any extra AGFL blocks are freed. This means the block availability of the current AG can change before the first allocation even occurs, but in practice a failure is more likely to manifest via a subsequent allocation because extent allocation usually has a contiguity requirement larger than a single block that can't be satisfied from the AGFL. In general, XFS prefers operational robustness to absolute allocation efficiency. In other words, we prefer to return -ENOSPC slightly earlier at the expense of not being able to allocate every last block in an AG to avoid this kind of problem. As such, update the AG block availability calculation to consider extra AGFL blocks as unavailable since they are immediately removed following the calculation and will not become available until the current transaction commits. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: always init bma in xfs_bmapi_writeDarrick J. Wong2019-03-191-5/+5
| | | | | | | | | | Always init the tp/ip fields of bma in xfs_bmapi_write so that the bmapi_finish at the bottom never trips over null transaction or inode pointers. Coverity-id: 1443964 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: don't trip over uninitialized buffer on extent read of corrupted inodeBrian Foster2019-03-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've had rather rare reports of bmap btree block corruption where the bmap root block has a level count of zero. The root cause of the corruption is so far unknown. We do have verifier checks to detect this form of on-disk corruption, but this doesn't cover a memory corruption variant of the problem. The latter is a reasonable possibility because the root block is part of the inode fork and can reside in-core for some time before inode extents are read. If this occurs, it leads to a system crash such as the following: BUG: unable to handle kernel paging request at ffffffff00000221 PF error: [normal kernel read fault] ... RIP: 0010:xfs_trans_brelse+0xf/0x200 [xfs] ... Call Trace: xfs_iread_extents+0x379/0x540 [xfs] xfs_file_iomap_begin_delay+0x11a/0xb40 [xfs] ? xfs_attr_get+0xd1/0x120 [xfs] ? iomap_write_begin.constprop.40+0x2d0/0x2d0 xfs_file_iomap_begin+0x4c4/0x6d0 [xfs] ? __vfs_getxattr+0x53/0x70 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 iomap_apply+0x63/0x130 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 iomap_file_buffered_write+0x62/0x90 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 xfs_file_buffered_aio_write+0xe4/0x3b0 [xfs] __vfs_write+0x150/0x1b0 vfs_write+0xba/0x1c0 ksys_pwrite64+0x64/0xa0 do_syscall_64+0x5a/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe The crash occurs because xfs_iread_extents() attempts to release an uninitialized buffer pointer as the level == 0 value prevented the buffer from ever being allocated or read. Change the level > 0 assert to an explicit error check in xfs_iread_extents() to avoid crashing the kernel in the event of localized, in-core inode corruption. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: clean up xfs_dir2_leaf_addnameDarrick J. Wong2019-03-121-18/+15
| | | | | | | | Remove typedefs and consolidate local variable initialization. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
* xfs: zero initialize highstale and lowstale in xfs_dir2_leaf_addnameDarrick J. Wong2019-03-101-2/+2
| | | | | | | | | | | | | | | | | | | | Smatch complains about the following: fs/xfs/libxfs/xfs_dir2_leaf.c:848 xfs_dir2_leaf_addname() error: uninitialized symbol 'lowstale'. fs/xfs/libxfs/xfs_dir2_leaf.c:849 xfs_dir2_leaf_addname() error: uninitialized symbol 'highstale'. I don't think there's any incorrect behavior associated with the uninitialized variable, but as the author of the previous zero-init patch points out, it's best not to be passing around pointers to uninitialized stack areas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
* xfs: clean up xfs_dir2_leafn_addDarrick J. Wong2019-03-081-12/+8
| | | | | | | Remove typedefs and consolidate local variable initialization. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
* xfs: Zero initialize highstale and lowstale in xfs_dir2_leafn_addNathan Chancellor2019-03-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | When building with -Wsometimes-uninitialized, Clang warns: fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'lowstale' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'highstale' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] While it isn't technically wrong, it isn't a problem in practice because highstale and lowstale are only initialized in xfs_dir2_leafn_add when compact is not zero then they are passed to xfs_dir3_leaf_find_entry, where they are initialized before use when compact is zero. Regardless, it's better not to be passing around uninitialized stack memory so zero initialize these variables, which silences this warning. Link: https://github.com/ClangBuiltLinux/linux/issues/393 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: fix uninitialized error variablesDarrick J. Wong2019-02-251-1/+1
| | | | | | | smatch complained about some uninitialized error returns, so fix those. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
* xfs: make COW fork unwritten extent conversions more robustChristoph Hellwig2019-02-212-10/+7
| | | | | | | | | | | | | | | | | | | If we have racing buffered and direct I/O COW fork extents under writeback can have been moved to the data fork by the time we call xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly harmless as the block numbers don't change by this move, except for the fact that xfs_bmapi_write will crash or trigger asserts when not finding existing extents, even despite trying to paper over this with the XFS_BMAPI_CONVERT_ONLY flag. Instead of special casing non-transaction conversions in the already way too complicated xfs_bmapi_write just add a new helper for the much simpler non-transactional COW fork case, which simplify ignores not found extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: fix xfs_buf magic number endian checksDarrick J. Wong2019-02-185-19/+19
| | | | | | | | Create a separate magic16 check function so that we don't run afoul of static checkers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: move stat accounting to xfs_bmapi_convert_delallocChristoph Hellwig2019-02-171-0/+3
| | | | | | | | | | | | | | | | This way we can actually count how many bytes got converted and how many calls we need, unlike in the caller which doesn't have the detailed view. Note that this includes a slight change in behavior as the xs_xstrat_quick is now bumped for every allocation instead of just the one covering the requested writeback offset, which makes a lot more sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: move transaction handling to xfs_bmapi_convert_delallocChristoph Hellwig2019-02-172-7/+36
| | | | | | | | | | No need to deal with the transaction and the inode locking in the caller. Note that we also switch to passing whichfork as the second paramter, matching what most related functions do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: split XFS_BMAPI_DELALLOC handling from xfs_bmapi_writeChristoph Hellwig2019-02-172-52/+58
| | | | | | | | | | | | | | Delalloc conversion has traditionally been part of our function to allocate blocks on disk (first xfs_bmapi, then xfs_bmapi_write), but delalloc conversion is a little special as we really do not want to allocate blocks over holes, for which we don't have reservations. Split the delalloc conversions into a separate helper to keep the code simple and structured. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: factor out two helpers from xfs_bmapi_writeChristoph Hellwig2019-02-171-34/+44
| | | | | | | | | | We want to be able to reuse them for the upcoming dedidcated delalloc convert routine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: simplify the xfs_bmap_btree_to_extents calling conventionsChristoph Hellwig2019-02-171-52/+26
| | | | | | | | | | | | | | Move boilerplate code from the callers into xfs_bmap_btree_to_extents: - exit early without failure if we don't need to convert to the extent format - assert that we have a btree cursor - don't reinitialize the passed in logflags argument Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: rename m_inotbt_nores to m_finobt_noresDarrick J. Wong2019-02-142-3/+3
| | | | | | | | | Rename this flag variable to imply more strongly that it's related to the free inode btree (finobt) operation. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: add magic numbers to dquot buffer opsDarrick J. Wong2019-02-111-0/+4
| | | | | | | | Add dquot magic numbers to the buffer ops type, in case we ever want to use them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: add inode magic to inode verifierDarrick J. Wong2019-02-111-1/+5
| | | | | | | Use xfs_verify_magic to check the magic numbers of inodes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: factor xfs_da3_blkinfo verification into common helperBrian Foster2019-02-114-37/+43
| | | | | | | | | | | | | | | | | With the verifier magic value helper in place, we've left a bit more duplicate code across the verifiers that involve struct xfs_da3_blkinfo. This includes the da node, xattr leaf and dir leaf verifiers, all of which perform similar checks for v4 and v5 filesystems. Create a common helper to verify an xfs_da3_blkinfo structure, taking care to only access v5 fields where appropriate, and refactor the aforementioned verifiers to use the helper. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: miscellaneous verifier magic value fixupsBrian Foster2019-02-1113-46/+57
| | | | | | | | | | | | Most buffer verifiers have hardcoded magic value checks conditionalized on the version of the filesystem. The magic value field of the verifier structure facilitates abstraction of some of this code. Populate the ->magic field of various verifiers to take advantage of this abstraction. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: use verifier magic field in dir2 leaf verifiersBrian Foster2019-02-111-68/+19
| | | | | | | | | | | | | | | The dir2 leaf verifiers share the same underlying structure verification code, but implement six accessor functions to multiplex the code across the two verifiers. Further, the magic value isn't sufficiently abstracted such that the common helper has to manually fix up the magic from the caller on v5 filesystems. Use the magic field in the verifier structure to eliminate the duplicate code and clean this all up. No functional change. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: distinguish between bnobt and cntbt magic valuesBrian Foster2019-02-111-35/+25
| | | | | | | | | | | | | | | | | | The allocation btree verifiers share code that is unable to detect cross-tree magic value corruptions such as a bnobt block with a cntbt magic value. Populate the b_ops->magic field of the associated verifier structures such that the structure verifier can check the magic value against the expected value based on tree type. The btree level check requires knowledge of the tree type to determine the appropriate maximum value. This was previously part of the hardcoded magic value checks. With that code removed, peek at the first magic value in the verifier to determine the expected tree type of the current block. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: split up allocation btree verifierBrian Foster2019-02-113-7/+14
| | | | | | | | | | | | Similar to the inode btree verifier, the same allocation btree verifier structure is shared between the by-bno (bnobt) and by-size (cntbt) btrees. This prevents the ability to distinguish magic values between them. Separate the verifier into two, one for each tree, and assign them appropriately. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: distinguish between inobt and finobt magic valuesBrian Foster2019-02-111-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | The inode btree verifier code is shared between the inode btree and free inode btree because the underlying metadata formats are essentially equivalent. A side effect of this is that the verifier cannot determine whether a particular btree block should have an inobt or finobt magic value. This logic allows an unfortunate xfs_repair bug to escape detection where certain level > 0 nodes of the finobt are stamped with inobt magic by xfs_repair finobt reconstruction. This is fortunately not a severe problem since the inode btree magic values do not contribute to any changes in kernel behavior, but we do need a means to detect and prevent this problem in the future. Add a field to xfs_buf_ops to store the v4 and v5 superblock magic values expected by a particular verifier. Add a helper to check an on-disk magic value against the value expected by the verifier. Call the helper from the shared [f]inobt verifier code for magic value verification. This ensures that the inode btree blocks each have the appropriate magic value based on specific tree type and superblock version. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: create a separate finobt verifierBrian Foster2019-02-113-2/+10
| | | | | | | | | | | | The inobt verifier is reused for the inobt and finobt, which prevents the ability to distinguish between magic values on a per-tree basis. Create a separate finobt structure in preparation for changes to enforce the appropriate magic value for the associated tree. This patch has no functional change. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: always check magic values in on-disk byte orderBrian Foster2019-02-115-7/+8
| | | | | | | | | | | | | | Most verifiers that check on-disk magic values convert the CPU endian magic value constant to disk endian to facilitate compile time optimization of the byte swap and reduce the need for runtime byte swaps in buffer verifiers. Several buffer verifiers do not follow this pattern. Update those verifiers for consistency. Also fix up a random typo in the inode readahead verifier name. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: cache unlinked pointers in an rhashtableDarrick J. Wong2019-02-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a rhashtable to cache the unlinked list incore. This should speed up unlinked processing considerably when there are a lot of inodes on the unlinked list because iunlink_remove no longer has to traverse an entire bucket list to find which inode points to the one being removed. The incore list structure records "X.next_unlinked = Y" relations, with the rhashtable using Y to index the records. This makes finding the inode X that points to a inode Y very quick. If our cache fails to find anything we can always fall back on the old method. FWIW this drastically reduces the amount of time it takes to remove inodes from the unlinked list. I wrote a program to open a lot of O_TMPFILE files and then close them in the same order, which takes a very long time if we have to traverse the unlinked lists. With the ptach, I see: + /d/t/tmpfile/tmpfile Opened 193531 files in 6.33s. Closed 193531 files in 5.86s real 0m12.192s user 0m0.064s sys 0m11.619s + cd / + umount /mnt real 0m0.050s user 0m0.004s sys 0m0.030s And without the patch: + /d/t/tmpfile/tmpfile Opened 193588 files in 6.35s. Closed 193588 files in 751.61s real 12m38.853s user 0m0.084s sys 12m34.470s + cd / + umount /mnt real 0m0.086s user 0m0.000s sys 0m0.060s Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: add xfs_verify_agino_or_null helperDarrick J. Wong2019-02-113-2/+16
| | | | | | | | | Add a new helper to check that a per-AG inode pointer is either null or points somewhere valid within that AG. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: use the latest extent at writeback delalloc conversion timeBrian Foster2019-02-112-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping outside of the current page. This is because the ilock is cycled between the time the caller originally looked up the mapping and across each real allocation of the provided file range. This code has collected various hacks over the years to help combat the symptoms of these races (i.e., truncate race detection, allocation into hole detection, etc.), but none address the fundamental problem that the imap may not be valid at allocation time. Rather than continue to use race detection hacks, update writeback delalloc conversion to a model that explicitly converts the delalloc extent backing the current file offset being processed. The current file offset is the only block we can trust to remain once the ilock is dropped because any operation that can remove the block (truncate, hole punch, etc.) must flush and discard pagecache pages first. Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc() mechanism to request allocation of the entire delalloc extent backing the current offset instead of assuming the extent passed by the caller is unchanged. Record the range specified by the caller and apply it to the resulting allocated extent so previous checks by the caller for COW fork overlap are not lost. Finally, overload the bmapi delalloc flag with the range reval flag behavior since this is the only use case for both. This ensures that writeback always picks up the correct and current extent associated with the page, regardless of races with other extent modifying operations. If operating on a data fork and the COW overlap state has changed since the ilock was cycled, the caller revalidates against the COW fork sequence number before using the imap for the next block. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: create delalloc bmapi wrapper for full extent allocationBrian Foster2019-02-112-4/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping. This stems from the fact that the bmapi allocation code requires a file range to allocate and the writeback conversion code assumes the range of the currently cached mapping is still valid with respect to the fork. It may not be valid, however, because the ilock is cycled (potentially multiple times) between the time the cached mapping was populated and the delalloc conversion occurs. To facilitate a solution to this problem, create a new xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file (FSB) offset and attempts to allocate whatever delalloc extent backs the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set the range based on the extent backing the bno parameter unless bno lands in a hole. If bno does land in a hole, fall back to the current behavior (which may result in an error or quietly skipping holes in the specified range depending on other parameters). This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: remove superfluous writeback mapping eof trimmingBrian Foster2019-02-112-12/+0
| | | | | | | | | | | Now that the cached writeback mapping is explicitly invalidated on data fork changes, the EOF trimming band-aid is no longer necessary. Remove xfs_trim_extent_eof() as well since it has no other users. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: update fork seq counter on data fork changesBrian Foster2019-02-112-8/+7
| | | | | | | | | | | | | | | | The sequence counter in the xfs_ifork structure is only updated on COW forks. This is because the counter is currently only used to optimize out repetitive COW fork checks at writeback time. Tweak the extent code to update the seq counter regardless of the fork type in preparation for using this counter on data forks as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: check attribute name validityDarrick J. Wong2019-02-112-1/+18
| | | | | | | Check extended attribute entry names for invalid characters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: check directory name validityDarrick J. Wong2019-02-112-0/+18
| | | | | | | Check directory entry names for invalid characters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: scrub should flag dir/attr offsets that aren't mappable with xfs_dablk_tDarrick J. Wong2019-02-112-0/+12
| | | | | | | | Teach scrub to flag extent maps that exceed the range that can be mapped with a xfs_dablk_t. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: stringify scrub types in ftrace outputDarrick J. Wong2018-12-191-2/+2
| | | | | | | Use __print_symbolic to print the scrub type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
* xfs: stringify btree cursor types in ftrace outputDarrick J. Wong2018-12-191-0/+13
| | | | | | | Use __print_symbolic to print the btree type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
* xfs: move XFS_INODE_FORMAT_STR mappings to libxfsDarrick J. Wong2018-12-191-0/+10
| | | | | | | | Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it updated, and add necessary TRACE_DEFINE_ENUM. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
* xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfsDarrick J. Wong2018-12-191-0/+5
| | | | | | | | Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to keep it updated, and TRACE_DEFINE_ENUM the values while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
* xfs: fix symbolic enum printing in ftrace outputDarrick J. Wong2018-12-191-0/+4
| | | | | | | | | | ftrace's __print_symbolic() has a (very poorly documented) requirement that any enum values used in the symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in the ftrace ring buffer. Fix this unsatisfied requirement. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
* xfs: cache minimum realtime summary levelOmar Sandoval2018-12-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The realtime summary is a two-dimensional array on disk, effectively: u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap] rsum[log][bbno] is the number of extents of size 2**log which start in bitmap block bbno. xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether rsum[log][bbno] != 0 for any log level. However, the summary array is stored in row-major order (i.e., like an array in C), so all of these entries are not adjacent, but rather spread across the entire summary file. In the worst case (a full bitmap block), xfs_rtany_summary() has to check every level. This means that on a moderately-used realtime device, an allocation will waste a lot of time finding, reading, and releasing buffers for the realtime summary. In particular, one of our storage services (which runs on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems) spends almost 5% of its CPU cycles in xfs_rtbuf_get() and xfs_trans_brelse() called from xfs_rtany_summary(). One solution would be to also store the summary with the dimensions swapped. However, this would require a disk format change to a very old component of XFS. Instead, we can cache the minimum size which contains any extents. We do so lazily; rather than guaranteeing that the cache contains the precise minimum, it always contains a loose lower bound which we tighten when we read or update a summary block. This only uses a few kilobytes of memory and is already serialized via the realtime bitmap and summary inode locks, so the cost is minimal. With this change, the same workload only spends 0.2% of its CPU cycles in the realtime allocator. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: precalculate cluster alignment in inodes and blocksDarrick J. Wong2018-12-122-7/+6
| | | | | | | | | Store the inode cluster alignment information in units of inodes and blocks in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: precalculate inodes and blocks per inode clusterDarrick J. Wong2018-12-121-13/+10
| | | | | | | | Store the number of inodes and blocks per inode cluster in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: add a block to inode count converterDarrick J. Wong2018-12-123-11/+12
| | | | | | | | | | Add new helpers to convert units of fs blocks into inodes, and AG blocks into AG inodes, respectively. Convert all the open-coded conversions and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate. The OFFBNO_TO_AGINO macro is retained for xfs_repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
OpenPOWER on IntegriCloud