summaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAgeFilesLines
* xfs: prevent dropping ioend completions during buftarg waitBrian Foster2016-09-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 800b2694f890cc35a1bda63501fc71c94389d517 upstream. xfs_wait_buftarg() waits for all pending I/O, drains the ioend completion workqueue and walks the LRU until all buffers in the cache have been released. This is traditionally an unmount operation` but the mechanism is also reused during filesystem freeze. xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce, which is intended more for a shutdown sequence in that it indicates to the queue that new operations are not expected once the drain has begun. New work jobs after this point result in a WARN_ON_ONCE() and are otherwise dropped. With filesystem freeze, however, read operations are allowed and can proceed during or after the workqueue drain. If such a read occurs during the drain sequence, the workqueue infrastructure complains about the queued ioend completion work item and drops it on the floor. As a result, the buffer remains on the LRU and the freeze never completes. Despite the fact that the overall buffer cache cleanup is not necessary during freeze, fix up this operation such that it is safe to invoke during non-unmount quiesce operations. Replace the drain_workqueue() call with flush_workqueue(), which runs a similar serialization on pending workqueue jobs without causing new jobs to be dropped. This is safe for unmount as unmount independently locks out new operations by the time xfs_wait_buftarg() is invoked. cc: <stable@vger.kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix superblock inprogress checkDave Chinner2016-09-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | commit f3d7ebdeb2c297bd26272384e955033493ca291c upstream. From inspection, the superblock sb_inprogress check is done in the verifier and triggered only for the primary superblock via a "bp->b_bn == XFS_SB_DADDR" check. Unfortunately, the primary superblock is an uncached buffer, and hence it is configured by xfs_buf_read_uncached() with: bp->b_bn = XFS_BUF_DADDR_NULL; /* always null for uncached buffers */ And so this check never triggers. Fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: handle dquot buffer readahead in log recovery correctlyDave Chinner2016-06-075-9/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7d6a13f023567d573ac362502bb702eda716e654 upstream. When we do dquot readahead in log recovery, we do not use a verifier as the underlying buffer may not have dquots in it. e.g. the allocation operation hasn't yet been replayed. Hence we do not want to fail recovery because we detect an operation to be replayed has not been run yet. This problem was addressed for inodes in commit d891400 ("xfs: inode buffers may not be valid during recovery readahead") but the problem was not recognised to exist for dquots and their buffers as the dquot readahead did not have a verifier. The result of not using a verifier is that when the buffer is then next read to replay a dquot modification, the dquot buffer verifier will only be attached to the buffer if *readahead is not complete*. Hence we can read the buffer, replay the dquot changes and then add it to the delwri submission list without it having a verifier attached to it. This then generates warnings in xfs_buf_ioapply(), which catches and warns about this case. Fix this and make it handle the same readahead verifier error cases as for inode buffers by adding a new readahead verifier that has a write operation as well as a read operation that marks the buffer as not done if any corruption is detected. Also make sure we don't run readahead if the dquot buffer has been marked as cancelled by recovery. This will result in readahead either succeeding and the buffer having a valid write verifier, or readahead failing and the buffer state requiring the subsequent read to resubmit the IO with the new verifier. In either case, this will result in the buffer always ending up with a valid write verifier on it. Note: we also need to fix the inode buffer readahead error handling to mark the buffer with EIO. Brian noticed the code I copied from there wrong during review, so fix it at the same time. Add comments linking the two functions that handle readahead verifier errors together so we don't forget this behavioural link in future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: print name of verifier if it failsEric Sandeen2016-06-0718-2/+24
| | | | | | | | | | | | | | | commit 233135b763db7c64d07b728a9c66745fb0376275 upstream. This adds a name to each buf_ops structure, so that if a verifier fails we can print the type of verifier that failed it. Should be a slight debugging aid, I hope. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: skip stale inodes in xfs_iflush_clusterDave Chinner2016-06-071-0/+1
| | | | | | | | | | | | | | commit 7d3aa7fe970791f1a674b14572a411accf2f4d4e upstream. We don't write back stale inodes so we should skip them in xfs_iflush_cluster, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix inode validity check in xfs_iflush_clusterDave Chinner2016-06-071-4/+4
| | | | | | | | | | | | | | | | | | | commit 51b07f30a71c27405259a0248206ed4e22adbee2 upstream. Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs: convert inode cache lookups to use RCU locking") back in late 2010, and so xfs_iflush_cluster checks the wrong inode for whether it is still valid under RCU protection. Fix it to lock and check the correct inode. (*) Careless-idiot: Dave Chinner <dchinner@redhat.com> Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: xfs_iflush_cluster fails to abort on errorDave Chinner2016-06-071-4/+13
| | | | | | | | | | | | | | | | | | | | | | commit b1438f477934f5a4d5a44df26f3079a7575d5946 upstream. When a failure due to an inode buffer occurs, the error handling fails to abort the inode writeback correctly. This can result in the inode being reclaimed whilst still in the AIL, leading to use-after-free situations as well as filesystems that cannot be unmounted as the inode log items left in the AIL never get removed. Fix this by ensuring fatal errors from xfs_imap_to_bp() result in the inode flush being aborted correctly. Reported-by: Shyam Kaushik <shyam@zadarastorage.com> Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com> Tested-by: Shyam Kaushik <shyam@zadarastorage.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: Don't wrap growfs AGFL indexesDave Chinner2016-06-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | commit ad747e3b299671e1a53db74963cc6c5f6cdb9f6d upstream. Commit 96f859d ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct") allowed the freelist to use the empty slot at the end of the freelist on 64 bit systems that was not being used due to sizeof() rounding up the structure size. This has caused versions of xfs_repair prior to 4.5.0 (which also has the fix) to report this as a corruption once the filesystem has been grown. Older kernels can also have problems (seen from a whacky container/vm management environment) mounting filesystems grown on a system with a newer kernel than the vm/container it is deployed on. To avoid this problem, change the initial free list indexes not to wrap across the end of the AGFL, hence avoiding the initialisation of agf_fllast to the last index in the AGFL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: disallow rw remount on fs with unknown ro-compat featuresEric Sandeen2016-06-071-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | commit d0a58e833931234c44e515b5b8bede32bd4e6eed upstream. Today, a kernel which refuses to mount a filesystem read-write due to unknown ro-compat features can still transition to read-write via the remount path. The old kernel is most likely none the wiser, because it's unaware of the new feature, and isn't using it. However, writing to the filesystem may well corrupt metadata related to that new feature, and moving to a newer kernel which understand the feature will have problems. Right now the only ro-compat feature we have is the free inode btree, which showed up in v3.16. It would be good to push this back to all the active stable kernels, I think, so that if anyone is using newer mkfs (which enables the finobt feature) with older kernel releases, they'll be protected. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix two memory leaks in xfs_attr_list.c error pathsMateusz Guzik2016-04-121-9/+10
| | | | | | | | | | | | | commit 2e83b79b2d6c78bf1b4aa227938a214dcbddc83f upstream. This plugs 2 trivial leaks in xfs_attr_shortform_list and xfs_attr3_leaf_list_int. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: log mount failures don't wait for buffers to be releasedDave Chinner2016-02-251-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 85bec5460ad8e05e0a8d70fb0f6750eb719ad092 upstream. Recently I've been seeing xfs/051 fail on 1k block size filesystems. Trying to trace the events during the test lead to the problem going away, indicating that it was a race condition that lead to this ASSERT failure: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156 ..... [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0 [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900 [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0 [<ffffffff811d8800>] mount_bdev+0x180/0x1b0 [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20 [<ffffffff811d90a8>] mount_fs+0x38/0x170 [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120 [<ffffffff811f7018>] do_mount+0x218/0xd60 [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0 When I finally caught it with tracing enabled, I saw that AG 2 had an elevated reference count and a buffer was responsible for it. I tracked down the specific buffer, and found that it was missing the final reference count release that would put it back on the LRU and hence be found by xfs_wait_buftarg() calls in the log mount failure handling. The last four traces for the buffer before the assert were (trimmed for relevance) kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5 mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC This is an async write that is completing, so there's nobody waiting for it directly. Hence we call xfs_buf_relse() once all the processing is complete. That does: static inline void xfs_buf_relse(xfs_buf_t *bp) { xfs_buf_unlock(bp); xfs_buf_rele(bp); } Now, it's clear that mount is waiting on the buffer lock, and that it has been released by xfs_buf_relse() and gained by mount. This is expected, because at this point the mount process is in xfs_buf_delwri_submit() waiting for all the IO it submitted to complete. The mount process, however, is waiting on the lock for the buffer because it is in xfs_buf_delwri_submit(). This waits for IO completion, but it doesn't wait for the buffer reference owned by the IO to go away. The mount process collects all the completions, fails the log recovery, and the higher level code then calls xfs_wait_buftarg() to free all the remaining buffers in the filesystem. The issue is that on unlocking the buffer, the scheduler has decided that the mount process has higher priority than the the kworker thread that is running the IO completion, and so immediately switched contexts to the mount process from the semaphore unlock code, hence preventing the kworker thread from finishing the IO completion and releasing the IO reference to the buffer. Hence by the time that xfs_wait_buftarg() is run, the buffer still has an active reference and so isn't on the LRU list that the function walks to free the remaining buffers. Hence we miss that buffer and continue onwards to tear down the mount structures, at which time we get find a stray reference count on the perag structure. On a non-debug kernel, this will be ignored and the structure torn down and freed. Hence when the kworker thread is then rescheduled and the buffer released and freed, it will access a freed perag structure. The problem here is that when the log mount fails, we still need to quiesce the log to ensure that the IO workqueues have returned to idle before we run xfs_wait_buftarg(). By synchronising the workqueues, we ensure that all IO completions are fully processed, not just to the point where buffers have been unlocked. This ensures we don't end up in the situation above. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"Dave Chinner2016-02-251-1/+0
| | | | | | | | | | | | | | | | | | | | | commit 3e85286e75224fa3f08bdad20e78c8327742634e upstream. This reverts commit 24ba16bb3d499c49974669cd8429c3e4138ab102 as it prevents machines from suspending. This regression occurs when the xfsaild is idle on entry to suspend, and so there s no activity to wake it from it's idle sleep and hence see that it is supposed to freeze. Hence the freezer times out waiting for it and suspend is cancelled. There is no obvious fix for this short of freezing the filesystem properly, so revert this change for now. Signed-off-by: Dave Chinner <david@fromorbit.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: inode recovery readahead can race with inode buffer creationDave Chinner2016-02-252-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b79f4a1c68bb99152d0785ee4ea3ab4396cdacc6 upstream. When we do inode readahead in log recovery, we do can do the readahead before we've replayed the icreate transaction that stamps the buffer with inode cores. The inode readahead verifier catches this and marks the buffer as !done to indicate that it doesn't yet contain valid inodes. In adding buffer error notification (i.e. setting b_error = -EIO at the same time as as we clear the done flag) to such a readahead verifier failure, we can then get subsequent inode recovery failing with this error: XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32 This occurs when readahead completion races with icreate item replay such as: inode readahead find buffer lock buffer submit RA io .... icreate recovery xfs_trans_get_buffer find buffer lock buffer <blocks on RA completion> ..... <ra completion> fails verifier clear XBF_DONE set bp->b_error = -EIO release and unlock buffer <icreate gains lock> icreate initialises buffer marks buffer as done adds buffer to delayed write queue releases buffer At this point, we have an initialised inode buffer that is up to date but has an -EIO state registered against it. When we finally get to recovering an inode in that buffer: inode item recovery xfs_trans_read_buffer find buffer lock buffer sees XBF_DONE is set, returns buffer sees bp->b_error is set fail log recovery! Essentially, we need xfs_trans_get_buf_map() to clear the error status of the buffer when doing a lookup. This function returns uninitialised buffers, so the buffer returned can not be in an error state and none of the code that uses this function expects b_error to be set on return. Indeed, there is an ASSERT(!bp->b_error); in the transaction case in xfs_trans_get_buf_map() that would have caught this if log recovery used transactions.... This patch firstly changes the inode readahead failure to set -EIO on the buffer, and secondly changes xfs_buf_get_map() to never return a buffer with an error state set so this first change doesn't cause unexpected log recovery failures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correctDarrick J. Wong2016-02-251-1/+1
| | | | | | | | | | | | | | | | | | | | commit 96f859d52bcb1c6ea6f3388d39862bf7143e2f30 upstream. Because struct xfs_agfl is 36 bytes long and has a 64-bit integer inside it, gcc will quietly round the structure size up to the nearest 64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE macro returning incorrect results for v5 filesystems on 64-bit machines (118 items instead of 119). As a result, a 32-bit xfs_repair will see garbage in AGFL item 119 and complain. Therefore, tell gcc not to pad the structure so that the AGFL size calculation is correct. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge branch 'for-linus-3' of ↵Linus Torvalds2015-11-131-4/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr cleanups from Al Viro. * 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: f2fs: xattr simplifications squashfs: xattr simplifications 9p: xattr simplifications xattr handlers: Pass handler to operations instead of flags jffs2: Add missing capability check for listing trusted xattrs hfsplus: Remove unused xattr handler list operations ubifs: Remove unused security xattr handler vfs: Fix the posix_acl_xattr_list return value vfs: Check attribute names in posix acl xattr handers
| * xattr handlers: Pass handler to operations instead of flagsAndreas Gruenbacher2015-11-131-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | The xattr_handler operations are currently all passed a file system specific flags value which the operations can use to disambiguate between different handlers; some file systems use that to distinguish the xattr namespace, for example. In some oprations, it would be useful to also have access to the handler prefix. To allow that, pass a pointer to the handler to operations instead of the flags value alone. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'xfs-for-linus-4.4' of ↵Linus Torvalds2015-11-1161-411/+989
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There is nothing really major here - the only significant addition is the per-mount operation statistics infrastructure. Otherwises there's various ACL, xattr, DAX, AIO and logging fixes, and a smattering of small cleanups and fixes elsewhere. Summary: - per-mount operational statistics in sysfs - fixes for concurrent aio append write submission - various logging fixes - detection of zeroed logs and invalid log sequence numbers on v5 filesystems - memory allocation failure message improvements - a bunch of xattr/ACL fixes - fdatasync optimisation - miscellaneous other fixes and cleanups" * tag 'xfs-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: give all workqueues rescuer threads xfs: fix log recovery op header validation assert xfs: Fix error path in xfs_get_acl xfs: optimise away log forces on timestamp updates for fdatasync xfs: don't leak uuid table on rmmod xfs: invalidate cached acl if set via ioctl xfs: Plug memory leak in xfs_attrmulti_attr_set xfs: Validate the length of on-disk ACLs xfs: invalidate cached acl if set directly via xattr xfs: xfs_filemap_pmd_fault treats read faults as write faults xfs: add ->pfn_mkwrite support for DAX xfs: DAX does not use IO completion callbacks xfs: Don't use unwritten extents for DAX xfs: introduce BMAPI_ZERO for allocating zeroed extents xfs: fix inode size update overflow in xfs_map_direct() xfs: clear PF_NOFREEZE for xfsaild kthread xfs: fix an error code in xfs_fs_fill_super() xfs: stats are no longer dependent on CONFIG_PROC_FS xfs: simplify /proc teardown & error handling xfs: per-filesystem stats counter implementation ...
| * Merge branch 'xfs-misc-fixes-for-4.4-3' into for-nextDave Chinner2015-11-104-5/+6
| |\
| | * xfs: give all workqueues rescuer threadsChris Mason2015-11-101-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're consistently hitting deadlocks here with XFS on recent kernels. After some digging through the crash files, it looks like everyone in the system is waiting for XFS to reclaim memory. Something like this: PID: 2733434 TASK: ffff8808cd242800 CPU: 19 COMMAND: "java" #0 [ffff880019c53588] __schedule at ffffffff818c4df2 #1 [ffff880019c535d8] schedule at ffffffff818c5517 #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348 #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453 #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7 #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433 #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9 #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73 #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6 #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8 #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671 #16 [ffff880019c53dd8] copy_process at ffffffff8104f781 #17 [ffff880019c53ec8] do_fork at ffffffff8105129c #18 [ffff880019c53f38] sys_clone at ffffffff810515b6 #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting for IO, which is waiting for workers to complete the IO which is waiting for worker threads that don't exist yet: PID: 2752451 TASK: ffff880bd6bdda00 CPU: 37 COMMAND: "kworker/37:1" #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2 #1 [ffff8808d20abc00] schedule at ffffffff818c5517 #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495 #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82 #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f #6 [ffff8808d20abe40] worker_thread at ffffffff810699be #7 [ffff8808d20abec0] kthread at ffffffff8106ef59 #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8 I think we should be using WQ_MEM_RECLAIM to make sure this thread pool makes progress when we're not able to allocate new workers. [dchinner: make all workqueues WQ_MEM_RECLAIM] Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: fix log recovery op header validation assertBrian Foster2015-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 89cebc84 ("xfs: validate transaction header length on log recovery") added additional validation of the on-disk op header length to protect from buffer overflow during log recovery. It accounts for the fact that the transaction header can be split across multiple op headers. It added an assert for when this occurs that verifies the length of the second part of a split transaction header is less than a full transaction header. In other words, it expects that the first op header of a split transaction header includes at least some portion of the transaction header. This expectation is not always valid as a zero-length op header can exist for the first op header of a split transaction header (see xlog_recover_add_to_trans() for details). This means that the second op header can have a valid, full length transaction header and thus the full header is copied in xlog_recover_add_to_cont_trans(). Fix the assert in xlog_recover_add_to_cont_trans() to handle this case correctly and require that the op header length is less than or equal to a full transaction header. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: Fix error path in xfs_get_aclAndreas Gruenbacher2015-11-102-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Error codes from xfs_attr_get other than -ENOATTR were not properly reported. Fix that. In addition, the declaration of struct xfs_inode in xfs_acl.h isn't needed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | Merge branch 'xfs-dax-updates' into for-nextDave Chinner2015-11-0311-65/+225
| |\ \
| | * | xfs: xfs_filemap_pmd_fault treats read faults as write faultsDave Chinner2015-11-031-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code initially committed didn't have the same checks for write faults as the dax_pmd_fault code and hence treats all faults as write faults. We can get read faults through this path because they is no pmd_mkwrite path for write faults similar to the normal page fault path. Hence we need to ensure that we only do c/mtime updates on write faults, and freeze protection is unnecessary for read faults. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: add ->pfn_mkwrite support for DAXDave Chinner2015-11-032-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->pfn_mkwrite support is needed so that when a page with allocated backing store takes a write fault we can check that the fault has not raced with a truncate and is pointing to a region beyond the current end of file. This also allows us to update the timestamp on the inode, too, which fixes a generic/080 failure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: DAX does not use IO completion callbacksDave Chinner2015-11-033-43/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For DAX, we are now doing block zeroing during allocation. This means we no longer need a special DAX fault IO completion callback to do unwritten extent conversion. Because mmap never extends the file size (it SEGVs the process) we don't need a callback to update the file size, either. Hence we can remove the completion callbacks from the __dax_fault and __dax_mkwrite calls. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: Don't use unwritten extents for DAXDave Chinner2015-11-032-6/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAX has a page fault serialisation problem with block allocation. Because it allows concurrent page faults and does not have a page lock to serialise faults to the same page, it can get two concurrent faults to the page that race. When two read faults race, this isn't a huge problem as the data underlying the page is not changing and so "detect and drop" works just fine. The issues are to do with write faults. When two write faults occur, we serialise block allocation in get_blocks() so only one faul will allocate the extent. It will, however, be marked as an unwritten extent, and that is where the problem lies - the DAX fault code cannot differentiate between a block that was just allocated and a block that was preallocated and needs zeroing. The result is that both write faults end up zeroing the block and attempting to convert it back to written. The problem is that the first fault can zero and convert before the second fault starts zeroing, resulting in the zeroing for the second fault overwriting the data that the first fault wrote with zeros. The second fault then attempts to convert the unwritten extent, which is then a no-op because it's already written. Data loss occurs as a result of this race. Because there is no sane locking construct in the page fault code that we can use for serialisation across the page faults, we need to ensure block allocation and zeroing occurs atomically in the filesystem. This means we can still take concurrent page faults and the only time they will serialise is in the filesystem mapping/allocation callback. The page fault code will always see written, initialised extents, so we will be able to remove the unwritten extent handling from the DAX code when all filesystems are converted. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner2015-11-036-8/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: fix inode size update overflow in xfs_map_direct()Dave Chinner2015-11-033-9/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both direct IO and DAX pass an offset and count into get_blocks that will overflow a s64 variable when an IO goes into the last supported block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen from the tracing: xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096 0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that overflows the s64 offset and we fail to detect the need for a filesize update and an ioend is not allocated. This is *mostly* avoided for direct IO because such extending IOs occur with full block allocation, and so the "IS_UNWRITTEN()" check still evaluates as true and we get an ioend that way. However, doing single sector extending IOs to this last block will expose the fact that file size updates will not occur after the first allocating direct IO as the overflow will then be exposed. There is one further complexity: the DAX page fault path also exposes the same issue in block allocation. However, page faults cannot extend the file size, so in this case we want to allocate the block but do not want to allocate an ioend to enable file size update at IO completion. Hence we now need to distinguish between the direct IO patch allocation and dax fault path allocation to avoid leaking ioend structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner2015-11-0314-16/+101
| |\ \ \ | | | |/ | | |/|
| | * | xfs: optimise away log forces on timestamp updates for fdatasyncDave Chinner2015-11-035-5/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs: timestamp updates cause excessive fdatasync log traffic Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. There are two parts to this problem. The first is that XFS relogs dirty regions in subsequent transactions, so it carries around the fields that have been dirtied since the last time the inode was written back to disk, not since the last time the inode was forced into the log. The second part is that on v5 filesystems, the inode change count update during inode dirtying also sets the XFS_ILOG_CORE flag, so on v5 filesystems this makes a timestamp update dirty the entire inode. As a result when fdatasync is run, it looks at the dirty fields in the inode, and sees more than just the timestamp flag, even though the only metadata change since the last fdatasync was just the timestamps. Hence we force the log on every subsequent fdatasync even though it is not needed. To fix this, add a new field to the inode log item that tracks changes since the last time fsync/fdatasync forced the log to flush the changes to the journal. This flag is updated when we dirty the inode, but we do it before updating the change count so it does not carry the "core dirty" flag from timestamp updates. The fields are zeroed when the inode is marked clean (due to writeback/freeing) or when an fsync/datasync forces the log. Hence if we only dirty the timestamps on the inode between fsync/fdatasync calls, the fdatasync will not trigger another log force. Over 100 runs of the test program: Ext4 baseline: runtime: 1.63s +/- 0.24s avg lat: 1.59ms +/- 0.24ms iops: ~2000 XFS, vanilla kernel: runtime: 2.45s +/- 0.18s avg lat: 2.39ms +/- 0.18ms log forces: ~400/s iops: ~1000 XFS, patched kernel: runtime: 1.49s +/- 0.26s avg lat: 1.46ms +/- 0.25ms log forces: ~30/s iops: ~1500 Reported-by: Sage Weil <sage@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: don't leak uuid table on rmmodDarrick J. Wong2015-11-033-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: invalidate cached acl if set via ioctlAndreas Gruenbacher2015-11-033-15/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs did until commit 6caa1056. Similar to that commit, invalidate cached acls when setting/removing them via the ioctl as well. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: Plug memory leak in xfs_attrmulti_attr_setAndreas Gruenbacher2015-11-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space buffer is copied into a new kernel-space buffer via memdup_user; that buffer then isn't freed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: Validate the length of on-disk ACLsAndreas Gruenbacher2015-11-032-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct, make sure that the length of the attributes is correct as well. Also, turn the aclp parameter into a const pointer. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: invalidate cached acl if set directly via xattrBrian Foster2015-11-031-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ACLs are stored as extended attributes of the inode to which they apply. XFS converts the standard "system.posix_acl_[access|default]" attribute names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored on-disk. These xattrs are directly exposed in on-disk format via getxattr/setxattr, without any ACL aware code in the path to perform validation, etc. This is partly historical and supports backup/restore applications such as xfsdump to back up and restore the binary blob that represents ACLs as-is. Andreas reports that the ACLs observed via the getfacl interface is not consistent when ACLs are set directly via the setxattr path. This occurs because the ACLs are cached in-core against the inode and the xattr path has no knowledge that the operation relates to ACLs. Update the xattr set codepath to trap writes of the special XFS ACL attributes and invalidate the associated cached ACL when this occurs. This ensures that the correct ACLs are used on a subsequent operation through the actual ACL interface. Note that this does not update or add support for setting the ACL xattrs directly beyond the restore use case that requires a correctly formatted binary blob and to restore a consistent i_mode at the same time. It is still possible for a root user to set an invalid or inconsistent (with i_mode) ACL blob on-disk and potentially cause corruption. [ With fixes from Andreas Gruenbacher. ] Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: clear PF_NOFREEZE for xfsaild kthreadJiri Kosina2015-11-021-0/+1
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since xfsaild has been converted to kthread in 0030807c, it calls try_to_freeze() during every AIL push iteration. It however doesn't set itself as freezable, and therefore this try_to_freeze() will never do anything. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark xfsaild freezable (as it can generate I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | Merge branch 'xfs-stats-fixes' into for-nextDave Chinner2015-10-193-2/+4
| |\ \
| | * | xfs: fix an error code in xfs_fs_fill_super()Dan Carpenter2015-10-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If alloc_percpu() fails, we accidentally return PTR_ERR(NULL), which means success, but we intended to return -ENOMEM. Fixes: 225e4635580c ('xfs: per-filesystem stats in sysfs') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: stats are no longer dependent on CONFIG_PROC_FSDave Chinner2015-10-192-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So we need to fix the makefile to understand this, otherwise build errors with CONFIG_PROC_FS=n occur. Reported-and-tested-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | Merge branch 'xfs-misc-fixes-for-4.4-1' into for-nextDave Chinner2015-10-129-13/+41
| |\ \ \
| | * | | xfs: more info from kmem deadlocks and high-level error msgsEric Sandeen2015-10-122-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In an effort to get more useful out of "possible memory allocation deadlock" messages, print the size of the requested allocation, and dump the stack if the xfs error level is tuned high. The stack dump is implemented in define_xfs_printk_level() for error levels >= LOGLEVEL_ERR, partly because it seems generically useful, and also because kmem.c has no knowledge of xfs error level tunables or other such bits, it's very kmem-specific. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: avoid dependency on Linux XATTR_SIZE_MAXJan Tulak2015-10-123-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we depends on Linux XATTR value for on disk definition. Which causes trouble on other platforms and maybe also if this value was to change. Fix it by creating a custom definition independent from those in Linux (although with the same values), so it is OK with the be16 fields used for holding these attributes. This patch reflects a change in xfsprogs. Signed-off-by: Jan Tulak <jtulak@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: prefix XATTR_LIST_MAX with XFS_Jan Tulak2015-10-123-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove a hard dependency of Linux XATTR_LIST_MAX value by using a prefixed version. This patch reflects the same change in xfsprogs. Signed-off-by: Jan Tulak <jtulak@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | libxfs: fix two comment typosGeliang Tang2015-10-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just fix two typos in code comments. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: Print name and pid when memory allocation loopsTetsuo Handa2015-10-122-5/+8
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds comm name and pid to warning messages printed by kmem_alloc(), kmem_zone_alloc() and xfs_buf_allocate_memory(). This will help telling which memory allocations (e.g. kernel worker threads, OOM victim tasks, neither) are stalling because these functions are passing __GFP_NOWARN which suppresses not only backtrace but comm name and pid. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | Merge branch 'xfs-io-fixes' into for-nextDave Chinner2015-10-125-20/+59
| |\ \ \
| | * | | xfs: add an xfs_zero_eof() tracepointBrian Foster2015-10-122-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tracepoint in xfs_zero_eof() to facilitate tracking and debugging EOF zeroing events. This has proven useful in the context of other direct I/O tracepoints to ensure EOF zeroing occurs within appropriate file ranges. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: always drain dio before extending aio write submissionBrian Foster2015-10-121-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS supports and typically allows concurrent asynchronous direct I/O submission to a single file. One exception to the rule is that file extending dio writes that start beyond the current EOF (e.g., potentially create a hole at EOF) require exclusive I/O access to the file. This is because such writes must zero any pre-existing blocks beyond EOF that are exposed by virtue of now residing within EOF as a result of the write about to be submitted. Before EOF zeroing can occur, the current file i_size must be stabilized to avoid data corruption. In this scenario, XFS upgrades the iolock to exclude any further I/O submission, waits on in-flight I/O to complete to ensure i_size is up to date (i_size is updated on dio write completion) and restarts the various checks against the state of the file. The problem is that this protection sequence is triggered only when the iolock is currently held shared. While this is true for async dio in most cases, the caller may upgrade the lock in advance based on arbitrary circumstances with respect to EOF zeroing. For example, the iolock is always acquired exclusively if the start offset is not block aligned. This means that even though the iolock is already held exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing can occur based on an unstable i_size. This problem has been reproduced as guest data corruption in virtual machines with file-backed qcow2 virtual disks hosted on an XFS filesystem. The virtual disks must be configured with aio=native mode and the must not be truncated out to the maximum file size (as some virt managers will do). Update xfs_file_aio_write_checks() to unconditionally drain in-flight dio before EOF zeroing can occur. Rather than trigger the wait based on iolock state, use a new flag and upgrade the iolock when necessary. Note that this results in a full restart of the inode checks even when the iolock was already held exclusive when technically it is only required to recheck i_size. This should be a rare enough occurrence that it is preferable to keep the code simple rather than create an alternate restart jump target. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: add missing ilock around dio write last extent alignmentBrian Foster2015-10-123-12/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The iomap codepath (via get_blocks()) acquires and release the inode lock in the case of a direct write that requires block allocation. This is because xfs_iomap_write_direct() allocates a transaction, which means the ilock must be dropped and reacquired after the transaction is allocated and reserved. xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before the transaction is created and thus before the ilock is reacquired. This can lead to calls to xfs_iread_extents() and reads of the in-core extent list without any synchronization (via xfs_bmap_eof() and xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock is not held, but this is not currently seen in practice as the current callers had already invoked xfs_bmapi_read(). What has been seen in practice are reports of crashes down in the xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer references from xfs_iext_get_ext(). While an explicit reproducer is not currently available to confirm the cause of the problem, crash analysis and code inspection from David Jeffrey had identified the insufficient locking. xfs_iomap_eof_align_last_fsb() is called from other contexts with the inode lock already held, so we cannot acquire it therein. __xfs_get_blocks() acquires and drops the ilock with variable flags to cover the event that the extent list must be read in. The common case is that __xfs_get_blocks() acquires the shared ilock. To provide locking around the last extent alignment call without adding more lock cycles to the dio path, update xfs_iomap_write_direct() to expect the shared ilock held on entry and do the extent alignment under its protection. Demote the lock, if necessary, from __xfs_get_blocks() and push the xfs_qm_dqattach() call outside of the shared lock critical section. Also, add an assert to document that the extent list is always expected to be present in this path. Otherwise, we risk a call to xfs_iread_extents() while under the shared ilock. This is safe as all current callers have executed an xfs_bmapi_read() call under the current iolock context. Reported-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | cancel the setfilesize transation when io error happenZhaohongjiang2015-10-121-2/+11
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I ran xfstest/073 case, the remount process was blocked to wait transactions to be zero. I found there was a io error happened, and the setfilesize transaction was not released properly. We should add the changes to cancel the io error in this case. Reproduction steps: 1. dd if=/dev/zero of=xfs1.img bs=1M count=2048 2. mkfs.xfs xfs1.img 3. losetup -f ./xfs1.img /dev/loop0 4. mount -t xfs /dev/loop0 /home/test_dir/ 5. mkdir /home/test_dir/test 6. mkfs.xfs -dfile,name=image,size=2g 7. mount -t xfs -o loop image /home/test_dir/test 8. cp a file bigger than 2g to /home/test_dir/test 9. mount -t xfs -o remount,ro /home/test_dir/test [ dchinner: moved io error detection to xfs_setfilesize_ioend() after transaction context restoration. ] Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
OpenPOWER on IntegriCloud