summaryrefslogtreecommitdiffstats
path: root/fs/xfs/libxfs
Commit message (Collapse)AuthorAgeFilesLines
...
| * xfs: refactor inode verifier corruption error printingDarrick J. Wong2018-01-291-4/+2
| | | | | | | | | | | | | | | | | | | | | | Refactor inode verifier error reporting into a non-libxfs function so that we aren't encoding the message format in libxfs. This also changes the kernel dmesg output to resemble buffer verifier errors more closely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: bmap code cleanupShan Hai2018-01-291-24/+8
| | | | | | | | | | | | | | | | | | | | Remove the extent size hint and realtime inode relevant code from the xfs_bmapi_reserve_delalloc since it is not called on the inode with extent size hint set or on a realtime inode. Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * Split buffer's b_fspriv fieldCarlos Maiolino2018-01-2911-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By splitting the b_fspriv field into two different fields (b_log_item and b_li_list). It's possible to get rid of an old ABI workaround, by using the new b_log_item field to store xfs_buf_log_item separated from the log items attached to the buffer, which will be linked in the new b_li_list field. This way, there is no more need to reorder the log items list to place the buf_log_item at the beginning of the list, simplifying a bit the logic to handle buffer IO. This also opens the possibility to change buffer's log items list into a proper list_head. b_log_item field is still defined as a void *, because it is still used by the log buffers to store xlog_in_core structures, and there is no need to add an extra field on xfs_buf just for xlog_in_core. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: minor style changes] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: check sb_agblocks and sb_agblklog when validating superblockDarrick J. Wong2018-01-172-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we don't check sb_agblocks or sb_agblklog when we validate the superblock, which means that we can fuzz garbage values into those values and the mount succeeds. This leads to all sorts of UBSAN warnings in xfs/350 since we can then coerce other parts of xfs into shifting by ridiculously large values. Once we've validated agblocks, make sure the agcount makes sense. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: recheck reflink / dirty page status before freeing CoW reservationsDarrick J. Wong2018-01-171-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eryu Guan reported seeing occasional hangs when running generic/269 with a new fsstress that supports clonerange/deduperange. The cause of this hang is an infinite loop when we convert the CoW fork extents from unwritten to real just prior to writing the pages out; the infinite loop happens because there's nothing in the CoW fork to convert, and so it spins forever. The fundamental issue here is that when we go to perform these CoW fork conversions, we're supposed to have an extent waiting for us, but the low space CoW reaper has snuck in and blown them away! There are four conditions that can dissuade the reaper from touching our file -- no reflink iflag; dirty page cache; writeback in progress; or directio in progress. We check the four conditions prior to taking the locks, but we neglect to recheck them once we have the locks, which is how we end up whacking the writeback that's in progress. Therefore, refactor the four checks into a helper function and call it once again once we have the locks to make sure we really want to reap the inode. While we're at it, add an ASSERT for this weird condition so that we'll fail noisily if we ever screw this up again. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: btree format ifork loader should check for zero numrecsDarrick J. Wong2018-01-171-0/+1
| | | | | | | | | | | | | | | | | | A btree format inode fork with zero records makes no sense, so reject it if we see it, or else we can miscalculate memory allocations. Found by zeroes fuzzing {a,u3}.bmbt.numrecs in xfs/{374,378,412} with KASAN. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: attr leaf verifier needs to check for obviously bad countDarrick J. Wong2018-01-171-5/+21
| | | | | | | | | | | | | | | | | | | | In the attribute leaf verifier, we can check for obviously bad values of firstused and count so that later attempts at lasthash don't run off the end of the memory buffer. Found by ones fuzzing hdr.count in xfs/400 with KASAN. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: directory scrubber must walk through data block to offsetDarrick J. Wong2018-01-173-22/+27
| | | | | | | | | | | | | | | | | | | | | | In xfs_scrub_dir_rec, we must walk through the directory block entries to arrive at the offset given by the hash structure. If we blindly trust the hash address, we can end up midway into a directory entry and stray outside the block. Found by lastbit fuzzing lents[3].address in xfs/390 with KASAN enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: cross-reference the realtime bitmapDarrick J. Wong2018-01-171-0/+21
| | | | | | | | | | | | | | | | While we're scrubbing various btrees, cross-reference the records with the other metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: add scrub cross-referencing helpers for the refcount btreesDarrick J. Wong2018-01-172-0/+22
| | | | | | | | | | | | | | | | Add a couple of functions to the refcount btrees that will be used to cross-reference metadata against the refcountbt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: add scrub cross-referencing helpers for the rmap btreesDarrick J. Wong2018-01-172-0/+72
| | | | | | | | | | | | | | | | Add a couple of functions to the rmap btrees that will be used to cross-reference metadata against the rmapbt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: add scrub cross-referencing helpers for the inode btreesDarrick J. Wong2018-01-172-0/+105
| | | | | | | | | | | | | | | | Add a couple of functions to the inode btrees that will be used to cross-reference metadata against the inobt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: add scrub cross-referencing helpers for the free space btreesDarrick J. Wong2018-01-174-1/+62
| | | | | | | | | | | | | | | | | | Add a couple of functions to the free space btrees that will be used to cross-reference metadata against the bnobt/cntbt, and a generic btree function that provides the real implementation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: cancel tx on xfs_defer_finish() error during xattr set/removeBrian Foster2018-01-161-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Chris Dunlop reports a problem where an xattr operation fails, reports the following error to syslog and hangs during unmount: ================================================ [ BUG: lock held when returning to user space! ] ... ------------------------------------------------ <PID> is leaving the kernel with locks still held! 1 lock held by <PID>: #0: (sb_internal){......}, at: [<ffffffffa07692a3>] xfs_trans_alloc+0xe3/0x130 [xfs] The failure/shutdown occurs during deferred ops processing which leads to an error return from xfs_defer_finish() via xfs_attr_leaf_addname(). While the root cause of the failure is unknown corruption, the cause of the subsequent BUG above and unmount hang is failure to cancel the transaction before returning to userspace. The transaction is not cancelled because the out_defer_cancel error handling paths in the xfs_attr_[leaf|node]_[add|remove]name() functions clear args.trans without releasing the transaction. The callers therefore lose the reference to the transaction and fail to cancel it. Since xfs_attr_[set|remove]() always cancel args.trans when != NULL and xfs_defer_finish()->...->xfs_trans_roll() should always return with a valid transaction, update the leaf/node xattr functions to not reset args.trans in the error path responsible for cancelling deferred ops. Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: account finobt blocks properly in perag reservationBrian Foster2018-01-121-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS started using the perag metadata reservation pool for free inode btree blocks in commit 76d771b4cbe33 ("xfs: use per-AG reservations for the finobt"). To handle backwards compatibility, finobt blocks are accounted against the pool so long as the full reservation is available at mount time. Otherwise the ->m_inotbt_nores flag is set and the filesystem falls back to the traditional per-transaction finobt reservation. This commit has two problems: - finobt blocks are always accounted against the metadata reservation on allocation, regardless of ->m_inotbt_nores state - finobt blocks are never returned to the reservation pool on free The first problem affects reflink+finobt filesystems where the full finobt reservation is not available at mount time. finobt blocks are essentially stolen from the reflink reservation, putting refcountbt management at risk of allocation failure. The second problem is an unconditional leak of metadata reservation whenever finobt is enabled. Update the finobt block allocation callouts to consider ->m_inotbt_nores and account blocks appropriately. Blocks should be consistently accounted against the metadata pool when ->m_inotbt_nores is false and otherwise tagged as RESV_NONE. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: fix check on struct_version for versions 4 or greaterColin Ian King2018-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | It appears that the check for versions 4 or more is incorrect and is off-by-one. Fix this. Detected by CoverityScan, CID#1463775 ("Logically dead code") Fixes: ac503a4cc9e8 ("xfs: refactor the geometry structure filling function") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: use %px for data pointers when debuggingDarrick J. Wong2018-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting with commit 57e734423ad ("vsprintf: refactor %pK code out of pointer"), the behavior of the raw '%p' printk format specifier was changed to print a 32-bit hash of the pointer value to avoid leaking kernel pointers into dmesg. For most situations that's good. This is /undesirable/ behavior when we're trying to debug XFS, however, so define a PTR_FMT that prints the actual pointer when we're in debug mode. Note that %p for tracepoints still prints the raw pointer, so in the long run we could consider rewriting some of these messages as tracepoints. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: change 0x%p -> %p in print messagesDarrick J. Wong2018-01-121-1/+1
| | | | | | | | | | | | | | Since %p prepends "0x" to the outputted string, we can drop the prefix. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: harden directory integrity checks some moreDarrick J. Wong2018-01-091-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a malicious filesystem image contains a block+ format directory wherein the directory inode's core.mode is set such that S_ISDIR(core.mode) == 0, and if there are subdirectories of the corrupted directory, an attempt to traverse up the directory tree will crash the kernel in __xfs_dir3_data_check. Running the online scrub's parent checks will tend to do this. The crash occurs because the directory inode's d_ops get set to xfs_dir[23]_nondir_ops (it's not a directory) but the parent pointer scrubber's indiscriminate call to xfs_readdir proceeds past the ASSERT if we have non fatal asserts configured. Fix the null pointer dereference crash in __xfs_dir3_data_check by looking for S_ISDIR or wrong d_ops; and teach the parent scrubber to bail out if it is fed a non-directory "parent". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: refactor the geometry structure filling functionDarrick J. Wong2018-01-084-75/+89
| | | | | | | | | | | | | | | | | | Refactor the geometry structure filling function to use the superblock to fill the fields. While we're at it, make the function less indenty and use some whitespace to make the function easier to read. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: hoist xfs_fs_geometry to libxfsDarrick J. Wong2018-01-082-0/+82
| | | | | | | | | | | | | | | | Move xfs_fs_geometry to libxfs so that we can clean up the fs geometry reporting in xfsprogs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: trace log reservations at mount timeDarrick J. Wong2018-01-082-1/+4
| | | | | | | | | | | | | | | | | | | | At each mount, emit the transaction reservation type information via tracepoints. This makes it easier to compare the log reservation info calculated by the kernel and xfsprogs so that we can more easily diagnose minimum log size failures on freshly formatted filesystems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * xfs: standardize quota verification function outputsDarrick J. Wong2018-01-082-94/+54
| | | | | | | | | | | | | | | | | | | | | | Rename xfs_dqcheck to xfs_dquot_verify and make it return an xfs_failaddr_t like every other structure verifier function. This enables us to check on-disk quotas in the same way that we check everything else. Callers are now responsible for logging errors, as XFS_QMOPT_DOWARN goes away. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: separate dquot repair into a separate functionDarrick J. Wong2018-01-082-8/+17
| | | | | | | | | | | | | | | | | | Move the dquot repair code into a separate function and remove XFS_QMOPT_DQREPAIR in favor of calling the helper directly. Remove other dead code because quotacheck is the only caller of DQREPAIR. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: create a new buf_ops pointer to verify structure metadataDarrick J. Wong2018-01-0816-21/+124
| | | | | | | | | | | | | | | | | | Expose all metadata structure buffer verifier functions via buf_ops. These will be used by the online scrub mechanism to look for problems with buffers that are already sitting around in memory. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: fail out of xfs_attr3_leaf_lookup_int if it looks corruptDarrick J. Wong2018-01-081-3/+6
| | | | | | | | | | | | | | | | | | If the xattr leaf block looks corrupt, return -EFSCORRUPTED to userspace instead of ASSERTing on debug kernels or running off the end of the buffer on regular kernels. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: provide a centralized method for verifying inline fork dataDarrick J. Wong2018-01-082-20/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | Replace the current haphazard dir2 shortform verifier callsites with a centralized verifier function that can be called either with the default verifier functions or with a custom set. This helps us strengthen integrity checking while providing us with flexibility for repair tools. xfs_repair wants this to be able to supply its own verifier functions when trying to fix possibly corrupt metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: refactor short form directory structure verifier functionDarrick J. Wong2018-01-083-17/+16
| | | | | | | | | | | | | | | | Change the short form directory structure verifier function to return the instruction pointer of a failing check or NULL if everything's ok. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: create structure verifier function for short form symlinksDarrick J. Wong2018-01-082-0/+35
| | | | | | | | | | | | | | Create a function to check the structure of short form symlink targets. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: create structure verifier function for shortform xattrsDarrick J. Wong2018-01-082-0/+75
| | | | | | | | | | | | | | | | Create a function to perform structure verification for short form extended attributes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: move inode fork verifiers to xfs_dinode_verifyDarrick J. Wong2018-01-082-89/+69
| | | | | | | | | | | | | | | | Consolidate the fork size and format verifiers to xfs_dinode_verify so that we can reject bad inodes earlier and in a single place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: verify dinode header firstDarrick J. Wong2018-01-081-10/+13
| | | | | | | | | | | | | | | | | | | | Move the v3 inode integrity information (crc, owner, metauuid) before we look at anything else in the inode so that we don't waste time on a torn write or a totally garbled block. This makes xfs_dinode_verify more consistent with the other verifiers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: refactor verifier callers to print address of failing checkDarrick J. Wong2018-01-0818-97/+200
| | | | | | | | | | | | | | | | Refactor the callers of verifiers to print the instruction address of a failing check. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: have buffer verifier functions report failing addressDarrick J. Wong2018-01-0820-274/+322
| | | | | | | | | | | | | | | | | | Modify each function that checks the contents of a metadata buffer to return the instruction address of the failing test so that we can report more precise failure errors to the log. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: refactor xfs_verifier_error and xfs_buf_ioerrorDarrick J. Wong2018-01-0818-142/+74
| | | | | | | | | | | | | | | | | | | | Since all verification errors also mark the buffer as having an error, we can combine these two calls. Later we'll add a xfs_failaddr_t parameter to promote the idea of reporting corruption errors and the address of the failing check to enable better debugging reports. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: remove XFS_WANT_CORRUPTED_RETURN from dir3 data verifiersDarrick J. Wong2018-01-083-53/+61
| | | | | | | | | | | | | | | | | | | | | | Since __xfs_dir3_data_check verifies on-disk metadata, we can't have it noisily blowing asserts and hanging the system on corrupt data coming in off the disk. Instead, have it return a boolean like all the other checker functions, and only have it noisily fail if we fail in debug mode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: refactor short form btree pointer verificationDarrick J. Wong2018-01-081-6/+6
| | | | | | | | | | | | | | | | Now that we have xfs_verify_agbno, use it to verify short form btree pointers instead of open-coding them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: refactor long-format btree header verification routinesDarrick J. Wong2018-01-083-20/+50
| | | | | | | | | | | | | | | | Create two helper functions to verify the headers of a long format btree block. We'll use this later for the realtime rmapbt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: remove XFS_FSB_SANITY_CHECKDarrick J. Wong2018-01-084-9/+5
| | | | | | | | | | | | | | | | We already have a function to verify fsb pointers, so get rid of the last users of the (less robust) macro. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: eliminate duplicate icreate tx reservation functionsBrian Foster2018-01-081-46/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The create transaction reservation calculation has two different branches of code depending on whether the filesystem is a v5 format fs or older. Each branch considers the max reservation between the allocation case (new chunk allocation + record insert) and the modify case (chunk exists, record modification) of inode allocation. The modify case is the same for both superblock versions with the exception of the finobt. The finobt helper checks the feature bit, however, and so the modify case already shares the same code. Now that inode chunk allocation has been refactored into a helper that checks the superblock version to calculate the appropriate reservation for the create transaction, the only remaining difference between the create and icreate branches is the call to the finobt helper. As noted above, the finobt helper is a no-op when the feature is not enabled. Therefore, these branches are effectively duplicate and can be condensed. Remove the xfs_calc_create_*() branch of functions and update the various callers to use the xfs_calc_icreate_*() variant. The latter creates the same reservation size for v4 create transactions as the removed branch. As such, this patch does not result in transaction reservation changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: refactor inode chunk alloc/free tx reservationBrian Foster2018-01-081-15/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The reservation for the various forms of inode allocation is scattered across several different functions. This includes two variants of chunk allocation (v5 icreate transactions vs. older create transactions) and the inode free transaction. To clean up some of this code and clarify the purpose of specific allocfree reservations, continue the pattern of defining helper functions for smaller operational units of broader transactions. Refactor the reservation into an inode chunk alloc/free helper that considers the various conditions based on filesystem format. An inode chunk free involves an extent free and buffer invalidations. The latter requires reservation for log headers only. An inode chunk allocation modifies the free space btrees and logs the chunk on v4 supers. v5 supers initialize the inode chunk using ordered buffers and so do not log the chunk. As a side effect of this refactoring, add one more allocfree res to the ifree transaction. Technically this does not serve a specific purpose because inode chunks are freed via deferred operations and thus occur after a transaction roll. tr_ifree has a bit of a history of tx overruns caused by too many agfl fixups during sustained file deletion workloads, so add this extra reservation as a form of padding nonetheless. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: include an allocfree res for inobt modificationsBrian Foster2018-01-081-41/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Analysis of recent reports of log reservation overruns and code inspection has uncovered that the reservations associated with inode operations may not cover the worst case scenarios. In particular, many cases only include one allocfree res. for a particular operation even though said operations may also entail AGFL fixups and inode btree block allocations in addition to the actual inode chunk allocation. This can easily turn into two or three block allocations (or frees) per operation. In theory, the only way to define the worst case reservation is to include an allocfree res for each individual allocation in a transaction. Since that is impractical (we can perform multiple agfl fixups per tx and not every allocation results in a full tree operation), we need to find a reasonable compromise that addresses the deficiency in practice without blowing out the size of the transactions. Since the inode btrees are not filled by the AGFL, record insertion and removal can directly result in block allocations and frees depending on the shape of the tree. These allocations and frees occur in the same transaction context as the inobt update itself, but are separate from the allocation/free that might be required for an inode chunk. Therefore, it makes sense to assume that an [f]inobt insert/remove can directly result in one or more block allocations on behalf of the tree. Refactor the inode transaction reservations to include one allocfree res. per inode btree modification to cover allocations required by the tree itself. This separates the reservation required to allocate the inode chunk from the reservation required for inobt record insertion/removal. Apply the same logic to the finobt. This results in killing off the finobt modify condition because we no longer assume that the broader transaction reservation will cover finobt block allocations and finobt shape changes can occur in either of the inobt allocation or modify situations. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: truncate transaction does not modify the inobtBrian Foster2018-01-081-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | The truncate transaction does not ever modify the inode btree, but includes an associated log reservation. Update xfs_calc_itruncate_reservation() to remove the reservation associated with inobt updates. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: fix up agi unlinked list reservationsBrian Foster2018-01-081-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current AGI unlinked list addition and removal reservations do not reflect the worst case log usage. An unlinked list removal can log up to two on-disk inode clusters but only includes reservation for one. An unlinked list addition logs the on-disk cluster but includes reservation for an in-core inode. Update the AGI unlinked list reservation helpers to calculate the correct worst case reservation for the associated operations. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * xfs: include inobt buffers in ifree tx log reservationBrian Foster2018-01-081-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tr_ifree transaction handles inode unlinks and inode chunk frees. The current transaction calculation does not accurately reflect worst case changes to the inode btree, however. The inobt portion of the current transaction reservation only covers modification of a single inobt buffer (for the particular inode record). This is a historical artifact from the days before XFS supported full inode chunk removal. When support for inode chunk removal was added in commit 254f6311ed1b ("Implement deletion of inode clusters in XFS."), the additional log reservation required for chunk removal was not added correctly. The new reservation only considered the header overhead of associated buffers rather than the full contents of the btrees and AGF and AGFL buffers affected by the transaction. The reservation for the free space btrees was subsequently fixed up in commit 5fe6abb82f76 ("Add space for inode and allocation btrees to ITRUNCATE log reservation"), but the res. for full inobt joins has never been added. Further review of the ifree reservation uncovered a couple more problems: - The undocumented +2 blocks are intended for the AGF and AGFL, but are also not sized correctly and should be logged as full sectors (not FSBs). - The additional single block header is undocumented and serves no apparent purpose. Update xfs_calc_ifree_reservation() to include a full inobt join in the reservation calculation. Refactor the undocumented blocks appropriately and fix up the comments to reflect the current calculation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | Merge tag 'iversion-v4.16-1' of ↵Linus Torvalds2018-01-291-2/+5
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull inode->i_version rework from Jeff Layton: "This pile of patches is a rework of the inode->i_version field. We have traditionally incremented that field on every inode data or metadata change. Typically this increment needs to be logged on disk even when nothing else has changed, which is rather expensive. It turns out though that none of the consumers of that field actually require this behavior. The only real requirement for all of them is that it be different iff the inode has changed since the last time the field was checked. Given that, we can optimize away most of the i_version increments and avoid dirtying inode metadata when the only change is to the i_version and no one is querying it. Queries of the i_version field are rather rare, so we can help write performance under many common workloads. This patch series converts existing accesses of the i_version field to a new API, and then converts all of the in-kernel filesystems to use it. The last patch in the series then converts the backend implementation to a scheme that optimizes away a large portion of the metadata updates when no one is looking at it. In my own testing this series significantly helps performance with small I/O sizes. I also got this email for Christmas this year from the kernel test robot (a 244% r/w bandwidth improvement with XFS over DAX, with 4k writes): https://lkml.org/lkml/2017/12/25/8 A few of the earlier patches in this pile are also flowing to you via other trees (mm, integrity, and nfsd trees in particular)". * tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits) fs: handle inode->i_version more efficiently btrfs: only dirty the inode in btrfs_update_time if something was changed xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing fs: only set S_VERSION when updating times if necessary IMA: switch IMA over to new i_version API xfs: convert to new i_version API ufs: use new i_version API ocfs2: convert to new i_version API nfsd: convert to new i_version API nfs: convert to new i_version API ext4: convert to new i_version API ext2: convert to new i_version API exofs: switch to new i_version API btrfs: convert to new i_version API afs: convert to new i_version API affs: convert to new i_version API fat: convert to new i_version API fs: don't take the i_lock in inode_inc_iversion fs: new API for handling inode->i_version ntfs: remove i_version handling ...
| * xfs: convert to new i_version APIJeff Layton2018-01-291-2/+5
| | | | | | | | | | | | Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com>
* | xfs: only skip rmap owner checks for unknown-owner rmap removalDarrick J. Wong2017-12-211-24/+52
| | | | | | | | | | | | | | | | | | For rmap removal, refactor the rmap owner checks into a separate function, then skip the checks if we are performing an unknown-owner removal. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* | xfs: always honor OWN_UNKNOWN rmap removal requestsDarrick J. Wong2017-12-213-3/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling xfs_rmap_free with an unknown owner is supposed to remove any rmaps covering that range regardless of owner. This is used by the EFI recovery code to say "we're freeing this, it mustn't be owned by anything anymore", but for whatever reason xfs_free_ag_extent filters them out. Therefore, remove the filter and make xfs_rmap_unmap actually treat it as a wildcard owner -- free anything that's already there, and if there's no owner at all then that's fine too. There are two existing callers of bmap_add_free that take care the rmap deferred ops themselves and use OWN_UNKNOWN to skip the EFI-based rmap cleanup; convert these to use OWN_NULL (via helpers), and now we really require that an RUI (if any) gets added to the defer ops before any EFI. Lastly, now that xfs_free_extent filters out OWN_NULL rmap free requests, growfs will have to consult directly with the rmap to ensure that there aren't any rmaps in the grown region. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* | xfs: queue deferred rmap ops for cow staging extent alloc/free in the right ↵Darrick J. Wong2017-12-211-33/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | order Under the deferred rmap operation scheme, there's a certain order in which the rmap deferred ops have to be queued to maintain integrity during log replay. For alloc/map operations that order is cui -> rui; for free/unmap operations that order is cui -> rui -> efi. However, the initial refcount code got the ordering wrong in the free side of things because it queued refcount free op and an EFI and the refcount free op queued a rmap free op, resulting in the order cui -> efi -> rui. If we fail before the efd finishes, the efi recovery will try to do a wildcard rmap removal and the subsequent rui will fail to find the rmap and blow up. This didn't ever happen due to other screws up in handling unknown owner rmap removals, but those other screw ups broke recovery in other ways, so fix the ordering to follow the intended rules. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
OpenPOWER on IntegriCloud