summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
...
* NFS: Fix I/O request leakagesTrond Myklebust2019-03-231-5/+21
| | | | | | | | | | | | | | | | | | | | | commit f57dcf4c72113c745d83f1c65f7291299f65c14f upstream. When we fail to add the request to the I/O queue, we currently leave it to the caller to free the failed request. However since some of the requests that fail are actually created by nfs_pageio_add_request() itself, and are not passed back the caller, this leads to a leakage issue, which can again cause page locks to leak. This commit addresses the leakage by freeing the created requests on error, using desc->pg_completion_ops->error_cleanup() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer") Cc: stable@vger.kernel.org # v4.0: c18b96a1b862: nfs: clean up rest of reqs Cc: stable@vger.kernel.org # v4.0: d600ad1f2bdb: NFS41: pop some layoutget Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* jbd2: fix compile warning when using JBUFFER_TRACEzhangyi (F)2019-03-231-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 01215d3edb0f384ddeaa5e4a22c1ae5ff634149f upstream. The jh pointer may be used uninitialized in the two cases below and the compiler complain about it when enabling JBUFFER_TRACE macro, fix them. In file included from fs/jbd2/transaction.c:19:0: fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’: ./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized] #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0) ^ fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here struct journal_head *jh; ^ In file included from fs/jbd2/transaction.c:19:0: fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’: ./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized] #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0) ^ fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here struct journal_head *jh; ^ Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* jbd2: clear dirty flag when revoking a buffer from an older transactionzhangyi (F)2019-03-231-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 904cdbd41d749a476863a0ca41f6f396774f26e4 upstream. Now, we capture a data corruption problem on ext4 while we're truncating an extent index block. Imaging that if we are revoking a buffer which has been journaled by the committing transaction, the buffer's jbddirty flag will not be cleared in jbd2_journal_forget(), so the commit code will set the buffer dirty flag again after refile the buffer. fsx kjournald2 jbd2_journal_commit_transaction jbd2_journal_revoke commit phase 1~5... jbd2_journal_forget belongs to older transaction commit phase 6 jbddirty not clear __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer test_clear_buffer_jbddirty mark_buffer_dirty Finally, if the freed extent index block was allocated again as data block by some other files, it may corrupt the file data after writing cached pages later, such as during unmount time. (In general, clean_bdev_aliases() related helpers should be invoked after re-allocation to prevent the above corruption, but unfortunately we missed it when zeroout the head of extra extent blocks in ext4_ext_handle_unwritten_extents()). This patch mark buffer as freed and set j_next_transaction to the new transaction when it already belongs to the committing transaction in jbd2_journal_forget(), so that commit code knows it should clear dirty bits when it is done with the buffer. This problem can be reproduced by xfstests generic/455 easily with seeds (3246 3247 3248 3249). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext2: Fix underflow in ext2_max_size()Jan Kara2019-03-231-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1c2d14212b15a60300a2d4f6364753e87394c521 upstream. When ext2 filesystem is created with 64k block size, ext2_max_size() will return value less than 0. Also, we cannot write any file in this fs since the sb->maxbytes is less than 0. The core of the problem is that the size of block index tree for such large block size is more than i_blocks can carry. So fix the computation to count with this possibility. File size limits computed with the new function for the full range of possible block sizes look like: bits file_size 10 17247252480 11 275415851008 12 2196873666560 13 2197948973056 14 2198486220800 15 2198754754560 16 2198888906752 CC: stable@vger.kernel.org Reported-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix crash during online resizingJan Kara2019-03-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f96c3ac8dfc24b4e38fc4c2eba5fea2107b929d1 upstream. When computing maximum size of filesystem possible with given number of group descriptor blocks, we forget to include s_first_data_block into the number of blocks. Thus for filesystems with non-zero s_first_data_block it can happen that computed maximum filesystem size is actually lower than current filesystem size which confuses the code and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on flex_gd->count == 0. The problem can be reproduced like: truncate -s 100g /tmp/image mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768 mount -t ext4 -o loop /tmp/image /mnt resize2fs /dev/loop0 262145 resize2fs /dev/loop0 300000 Fix the problem by properly including s_first_data_block into the computed number of filesystem blocks. Fixes: 1c6bd7173d66 "ext4: convert file system to meta_bg if needed..." Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: add mask of ext4 flags to swapyangerkun2019-03-232-1/+8
| | | | | | | | | | | | | | | | commit abdc644e8cbac2e9b19763680e5a7cf9bab2bee7 upstream. The reason is that while swapping two inode, we swap the flags too. Some flags such as EXT4_JOURNAL_DATA_FL can really confuse the things since we're not resetting the address operations structure. The simplest way to keep things sane is to restrict the flags that can be swapped. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: update quota information while swapping boot loader inodeyangerkun2019-03-231-13/+43
| | | | | | | | | | | | | | commit aa507b5faf38784defe49f5e64605ac3c4425e26 upstream. While do swap between two inode, they swap i_data without update quota information. Also, swap_inode_boot_loader can do "revert" somtimes, so update the quota while all operations has been finished. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: cleanup pagecache before swap i_datayangerkun2019-03-231-4/+12
| | | | | | | | | | | | | | | | | commit a46c68a318b08f819047843abf349aeee5d10ac2 upstream. While do swap, we should make sure there has no new dirty page since we should swap i_data between two inode: 1.We should lock i_mmap_sem with write to avoid new pagecache from mmap read/write; 2.Change filemap_flush to filemap_write_and_wait and move them to the space protected by inode lock to avoid new pagecache from buffer read/write. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix check of inode in swap_inode_boot_loaderyangerkun2019-03-231-9/+13
| | | | | | | | | | | | | | | | | commit 67a11611e1a5211f6569044fbf8150875764d1d0 upstream. Before really do swap between inode and boot inode, something need to check to avoid invalid or not permitted operation, like does this inode has inline data. But the condition check should be protected by inode lock to avoid change while swapping. Also some other condition will not change between swapping, but there has no problem to do this under inode lock. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix deadlock between clone/dedupe and renameFilipe Manana2019-03-231-18/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4ea748e1d2c9f8a27332b949e8210dbbf392987e upstream. Reflinking (clone/dedupe) and rename are operations that operate on two inodes and therefore need to lock them in the same order to avoid ABBA deadlocks. It happens that Btrfs' reflink implementation always locked them in a different order from VFS's lock_two_nondirectories() helper, which is used by the rename code in VFS, resulting in ABBA type deadlocks. Btrfs' locking order: static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2) { if (inode1 < inode2) swap(inode1, inode2); inode_lock_nested(inode1, I_MUTEX_PARENT); inode_lock_nested(inode2, I_MUTEX_CHILD); } VFS's locking order: void lock_two_nondirectories(struct inode *inode1, struct inode *inode2) { if (inode1 > inode2) swap(inode1, inode2); if (inode1 && !S_ISDIR(inode1->i_mode)) inode_lock(inode1); if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1) inode_lock_nested(inode2, I_MUTEX_NONDIR2); } Fix this by killing the btrfs helper function that does the double inode locking and replace it with VFS's helper lock_two_nondirectories(). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 416161db9b63e3 ("btrfs: offline dedupe") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix corruption reading shared and compressed extents after hole punchingFilipe Manana2019-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream. In the past we had data corruption when reading compressed extents that are shared within the same file and they are consecutive, this got fixed by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents"). However there was a case that was missing in those fixes, which is when the shared and compressed extents are referenced with a non-zero offset. The following shell script creates a reproducer for this issue: #!/bin/bash mkfs.btrfs -f /dev/sdc &> /dev/null mount -o compress /dev/sdc /mnt/sdc # Create a file with 3 consecutive compressed extents, each has an # uncompressed size of 128Kb and a compressed size of 4Kb. for ((i = 1; i <= 3; i++)); do head -c 4096 /dev/zero for ((j = 1; j <= 31; j++)); do head -c 4096 /dev/zero | tr '\0' "\377" done done > /mnt/sdc/foobar sync echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)" # Clone the first extent into offsets 128K and 256K. xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar sync echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)" # Punch holes into the regions that are already full of zeroes. xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar sync echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" echo "Dropping page cache..." sysctl -q vm.drop_caches=1 echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" umount /dev/sdc When running the script we get the following output: Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar linked 131072/131072 bytes at offset 131072 128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec) linked 131072/131072 bytes at offset 262144 128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec) Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Dropping page cache... Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar This happens because after reading all the pages of the extent in the range from 128K to 256K for example, we read the hole at offset 256K and then when reading the page at offset 260K we don't submit the existing bio, which is responsible for filling all the page in the range 128K to 256K only, therefore adding the pages from range 260K to 384K to the existing bio and submitting it after iterating over the entire range. Once the bio completes, the uncompressed data fills only the pages in the range 128K to 256K because there's no more data read from disk, leaving the pages in the range 260K to 384K unfilled. It is just a slightly different variant of what was solved by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents"). Fix this by forcing a bio submit, during readpages(), whenever we find a compressed extent map for a page that is different from the extent map for the previous page or has a different starting offset (in case it's the same compressed extent), instead of the extent map's original start offset. A test case for fstests follows soon. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents") Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") Cc: stable@vger.kernel.org # 4.3+ Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: init csum_list before possible freeDan Robertson2019-03-231-1/+1
| | | | | | | | | | | | | | | | | | commit e49be14b8d80e23bb7c53d78c21717a474ade76b upstream. The scrub_ctx csum_list member must be initialized before scrub_free_ctx is called. If the csum_list is not initialized beforehand, the list_empty call in scrub_free_csums will result in a null deref if the allocation fails in the for loop. Fixes: a2de733c78fa ("btrfs: scrub") CC: stable@vger.kernel.org # 3.0+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dan Robertson <dan@dlrobertson.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: ensure that a DUP or RAID1 block group has exactly two stripesJohannes Thumshirn2019-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream. We recently had a customer issue with a corrupted filesystem. When trying to mount this image btrfs panicked with a division by zero in calc_stripe_length(). The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length() takes this value and divides it by the number of copies the RAID profile is expected to have to calculate the amount of data stripes. As a DUP profile is expected to have 2 copies this division resulted in 1/2 = 0. Later then the 'data_stripes' variable is used as a divisor in the stripe length calculation which results in a division by 0 and thus a kernel panic. When encountering a filesystem with a DUP block group and a 'num_stripes' value unequal to 2, refuse mounting as the image is corrupted and will lead to unexpected behaviour. Code inspection showed a RAID1 block group has the same issues. Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: drop the lock on error in btrfs_dev_replace_cancelDan Carpenter2019-03-231-0/+1
| | | | | | | | | | | | | | | | | | | | | commit 669e859b5ea7c6f4fce0149d3907c64e550c294b upstream. We should drop the lock on this error path. This has been found by a static tool. The lock needs to be released, it's there to protect access to the dev_replace members and is not supposed to be left locked. The value of state that's being switched would need to be artifically changed to an invalid value so the default: branch is taken. Fixes: d189dd70e255 ("btrfs: fix use-after-free due to race between replace start and cancel") CC: stable@vger.kernel.org # 5.0+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: scrub: fix circular locking dependency warningAnand Jain2019-03-231-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1cec3f27168d7835ff3d23ab371cd548440131bb upstream. This fixes a longstanding lockdep warning triggered by fstests/btrfs/011. Circular locking dependency check reports warning[1], that's because the btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock held. The test case leading to this warning: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /btrfs $ btrfs scrub start -B /btrfs In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy of the scrub workers are needed. So once we have incremented and decremented the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this patch drops the scrub_lock before calling btrfs_destroy_workqueue(). [359.258534] ====================================================== [359.260305] WARNING: possible circular locking dependency detected [359.261938] 5.0.0-rc6-default #461 Not tainted [359.263135] ------------------------------------------------------ [359.264672] btrfs/20975 is trying to acquire lock: [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540 [359.268416] [359.268416] but task is already holding lock: [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs] [359.272418] [359.272418] which lock already depends on the new lock. [359.272418] [359.274692] [359.274692] the existing dependency chain (in reverse order) is: [359.276671] [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}: [359.278187] __mutex_lock+0x86/0x9c0 [359.279086] btrfs_scrub_pause+0x31/0x100 [btrfs] [359.280421] btrfs_commit_transaction+0x1e4/0x9e0 [btrfs] [359.281931] close_ctree+0x30b/0x350 [btrfs] [359.283208] generic_shutdown_super+0x64/0x100 [359.284516] kill_anon_super+0x14/0x30 [359.285658] btrfs_kill_super+0x12/0xa0 [btrfs] [359.286964] deactivate_locked_super+0x29/0x60 [359.288242] cleanup_mnt+0x3b/0x70 [359.289310] task_work_run+0x98/0xc0 [359.290428] exit_to_usermode_loop+0x83/0x90 [359.291445] do_syscall_64+0x15b/0x180 [359.292598] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.294011] [359.294011] -> #2 (sb_internal#2){.+.+}: [359.295432] __sb_start_write+0x113/0x1d0 [359.296394] start_transaction+0x369/0x500 [btrfs] [359.297471] btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs] [359.298629] normal_work_helper+0xcd/0x530 [btrfs] [359.299698] process_one_work+0x246/0x610 [359.300898] worker_thread+0x3c/0x390 [359.302020] kthread+0x116/0x130 [359.303053] ret_from_fork+0x24/0x30 [359.304152] [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}: [359.306100] process_one_work+0x21f/0x610 [359.307302] worker_thread+0x3c/0x390 [359.308465] kthread+0x116/0x130 [359.309357] ret_from_fork+0x24/0x30 [359.310229] [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}: [359.311812] lock_acquire+0x90/0x180 [359.312929] flush_workqueue+0xaa/0x540 [359.313845] drain_workqueue+0xa1/0x180 [359.314761] destroy_workqueue+0x17/0x240 [359.315754] btrfs_destroy_workqueue+0x57/0x200 [btrfs] [359.317245] scrub_workers_put+0x2c/0x60 [btrfs] [359.318585] btrfs_scrub_dev+0x336/0x590 [btrfs] [359.319944] btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs] [359.321622] btrfs_ioctl+0x28a4/0x2e40 [btrfs] [359.322908] do_vfs_ioctl+0xa2/0x6d0 [359.324021] ksys_ioctl+0x3a/0x70 [359.325066] __x64_sys_ioctl+0x16/0x20 [359.326236] do_syscall_64+0x54/0x180 [359.327379] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.328772] [359.328772] other info that might help us debug this: [359.328772] [359.330990] Chain exists of: [359.330990] (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock [359.330990] [359.334376] Possible unsafe locking scenario: [359.334376] [359.336020] CPU0 CPU1 [359.337070] ---- ---- [359.337821] lock(&fs_info->scrub_lock); [359.338506] lock(sb_internal#2); [359.339506] lock(&fs_info->scrub_lock); [359.341461] lock((wq_completion)"%s-%s""btrfs", name); [359.342437] [359.342437] *** DEADLOCK *** [359.342437] [359.343745] 1 lock held by btrfs/20975: [359.344788] #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs] [359.346778] [359.346778] stack backtrace: [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461 [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [359.350501] Call Trace: [359.350931] dump_stack+0x67/0x90 [359.351676] print_circular_bug.isra.37.cold.56+0x15c/0x195 [359.353569] check_prev_add.constprop.44+0x4f9/0x750 [359.354849] ? check_prev_add.constprop.44+0x286/0x750 [359.356505] __lock_acquire+0xb84/0xf10 [359.357505] lock_acquire+0x90/0x180 [359.358271] ? flush_workqueue+0x87/0x540 [359.359098] flush_workqueue+0xaa/0x540 [359.359912] ? flush_workqueue+0x87/0x540 [359.360740] ? drain_workqueue+0x1e/0x180 [359.361565] ? drain_workqueue+0xa1/0x180 [359.362391] drain_workqueue+0xa1/0x180 [359.363193] destroy_workqueue+0x17/0x240 [359.364539] btrfs_destroy_workqueue+0x57/0x200 [btrfs] [359.365673] scrub_workers_put+0x2c/0x60 [btrfs] [359.366618] btrfs_scrub_dev+0x336/0x590 [btrfs] [359.367594] ? start_transaction+0xa1/0x500 [btrfs] [359.368679] btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs] [359.369545] btrfs_ioctl+0x28a4/0x2e40 [btrfs] [359.370186] ? __lock_acquire+0x263/0xf10 [359.370777] ? kvm_clock_read+0x14/0x30 [359.371392] ? kvm_sched_clock_read+0x5/0x10 [359.372248] ? sched_clock+0x5/0x10 [359.372786] ? sched_clock_cpu+0xc/0xc0 [359.373662] ? do_vfs_ioctl+0xa2/0x6d0 [359.374552] do_vfs_ioctl+0xa2/0x6d0 [359.375378] ? do_sigaction+0xff/0x250 [359.376233] ksys_ioctl+0x3a/0x70 [359.376954] __x64_sys_ioctl+0x16/0x20 [359.377772] do_syscall_64+0x54/0x180 [359.378841] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.380422] RIP: 0033:0x7f5429296a97 Backporting to older kernels: scrub_nocow_workers must be freed the same way as the others. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: setup a nofs context for memory allocation at __btrfs_set_aclFilipe Manana2019-03-231-0/+9
| | | | | | | | | | | | | | | | | commit a0873490660246db587849a9e172f2b7b21fa88a upstream. We are holding a transaction handle when setting an acl, therefore we can not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock if reclaim is triggered by the allocation, therefore setup a nofs context. Fixes: 39a27ec1004e8 ("btrfs: use GFP_KERNEL for xattr and acl allocations") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()Filipe Manana2019-03-231-0/+8
| | | | | | | | | | | | | | | | | commit b89f6d1fcb30a8cbdc18ce00c7d93792076af453 upstream. We are holding a transaction handle when creating a tree, therefore we can not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is triggered by the allocation, therefore setup a nofs context. Fixes: 74e4d82757f74 ("btrfs: let callers of btrfs_alloc_root pass gfp flags") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ovl: Do not lose security.capability xattr over metadata file copy-upVivek Goyal2019-03-233-22/+63
| | | | | | | | | | | | | | | | | | | | | commit 993a0b2aec52754f0897b1dab4c453be8217cae5 upstream. If a file has been copied up metadata only, and later data is copied up, upper loses any security.capability xattr it has (underlying filesystem clears it as upon file write). From a user's point of view, this is just a file copy-up and that should not result in losing security.capability xattr. Hence, before data copy up, save security.capability xattr (if any) and restore it on upper after data copy up is complete. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: 0c2888749363 ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper") Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ovl: During copy up, first copy up data and then xattrsVivek Goyal2019-03-231-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f32879ea35523b9842bdbdc0065e13635caada2 upstream. If a file with capability set (and hence security.capability xattr) is written kernel clears security.capability xattr. For overlay, during file copy up if xattrs are copied up first and then data is, copied up. This means data copy up will result in clearing of security.capability xattr file on lower has. And this can result into surprises. If a lower file has CAP_SETUID, then it should not be cleared over copy up (if nothing was actually written to file). This also creates problems with chown logic where it first copies up file and then tries to clear setuid bit. But by that time security.capability xattr is already gone (due to data copy up), and caller gets -ENODATA. This has been reported by Giuseppe here. https://github.com/containers/libpod/issues/2015#issuecomment-447824842 Fix this by copying up data first and then metadta. This is a regression which has been introduced by my commit as part of metadata only copy up patches. TODO: There will be some corner cases where a file is copied up metadata only and later data copy up happens and that will clear security.capability xattr. Something needs to be done about that too. Fixes: bd64e57586d3 ("ovl: During copy up, first copy up metadata and then data") Cc: <stable@vger.kernel.org> # v4.19+ Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* splice: don't merge into linked buffersJann Horn2019-03-232-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a0ce2f0aa6ad97c3d4927bf2ca54bcebdf062d55 upstream. Before this patch, it was possible for two pipes to affect each other after data had been transferred between them with tee(): ============ $ cat tee_test.c int main(void) { int pipe_a[2]; if (pipe(pipe_a)) err(1, "pipe"); int pipe_b[2]; if (pipe(pipe_b)) err(1, "pipe"); if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write"); if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee"); if (write(pipe_b[1], "xx", 2) != 2) err(1, "write"); char buf[5]; if (read(pipe_a[0], buf, 4) != 4) err(1, "read"); buf[4] = 0; printf("got back: '%s'\n", buf); } $ gcc -o tee_test tee_test.c $ ./tee_test got back: 'abxx' $ ============ As suggested by Al Viro, fix it by creating a separate type for non-mergeable pipe buffers, then changing the types of buffers in splice_pipe_to_pipe() and link_pipe(). Cc: <stable@vger.kernel.org> Fixes: 7c77f0b3f920 ("splice: implement pipe to pipe splicing") Fixes: 70524490ee2e ("[PATCH] splice: add support for sys_tee()") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs/devpts: always delete dcache dentry-s in dput()Varad Gautam2019-03-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 73052b0daee0b750b39af18460dfec683e4f5887 upstream. d_delete only unhashes an entry if it is reached with dentry->d_lockref.count != 1. Prior to commit 8ead9dd54716 ("devpts: more pty driver interface cleanups"), d_delete was called on a dentry from devpts_pty_kill with two references held, which would trigger the unhashing, and the subsequent dputs would release it. Commit 8ead9dd54716 reworked devpts_pty_kill to stop acquiring the second reference from d_find_alias, and the d_delete call left the dentries still on the hashed list without actually ever being dropped from dcache before explicit cleanup. This causes the number of negative dentries for devpts to pile up, and an `ls /dev/pts` invocation can take seconds to return. Provide always_delete_dentry() from simple_dentry_operations as .d_delete for devpts, to make the dentry be dropped from dcache. Without this cleanup, the number of dentries in /dev/pts/ can be grown arbitrarily as: `python -c 'import pty; pty.spawn(["ls", "/dev/pts"])'` A systemtap probe on dcache_readdir to count d_subdirs shows this count to increase with each pty spawn invocation above: probe kernel.function("dcache_readdir") { subdirs = &@cast($file->f_path->dentry, "dentry")->d_subdirs; p = subdirs; p = @cast(p, "list_head")->next; i = 0 while (p != subdirs) { p = @cast(p, "list_head")->next; i = i+1; } printf("number of dentries: %d\n", i); } Fixes: 8ead9dd54716 ("devpts: more pty driver interface cleanups") Signed-off-by: Varad Gautam <vrd@amazon.de> Reported-by: Zheng Wang <wanz@amazon.de> Reported-by: Brandon Schwartz <bsschwar@amazon.de> Root-caused-by: Maximilian Heyne <mheyne@amazon.de> Root-caused-by: Nicolas Pernas Maradei <npernas@amazon.de> CC: David Woodhouse <dwmw@amazon.co.uk> CC: Maximilian Heyne <mheyne@amazon.de> CC: Stefan Nuernberger <snu@amazon.de> CC: Amit Shah <aams@amazon.de> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Al Viro <viro@ZenIV.linux.org.uk> CC: Christian Brauner <christian.brauner@ubuntu.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: Matthew Wilcox <willy@infradead.org> CC: Eric Biggers <ebiggers@google.com> CC: <stable@vger.kernel.org> # 4.9+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* smb3: make default i/o size for smb3 mounts largerSteve French2019-03-235-3/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit e8506d25f740fd058791cc12a6dfa9386ada6b96 upstream. We negotiate rsize mounts (and it can be overridden by user) to typically 4MB, so using larger default I/O sizes from userspace (changing to 1MB default i/o size returned by stat) the performance is much better (and not just for long latency network connections) in most use cases for SMB3 than the default I/O size (which ends up being 128K for cp and can be even smaller for cp). This can be 4x slower or worse depending on network latency. By changing inode->blocksize from 32K (which was perhaps ok for very old SMB1/CIFS) to a larger value, 1MB (but still less than max size negotiated with the server which is 4MB, in order to minimize risk) it significantly increases performance for the noncached case, and slightly increases it for the cached case. This can be changed by the user on mount (specifying bsize= values from 16K to 16MB) to tune better for performance for applications that depend on blocksize. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* CIFS: Fix read after write for files with read cachingPavel Shilovsky2019-03-231-5/+7
| | | | | | | | | | | | | | | | | | | | | | commit 6dfbd84684700cb58b34e8602c01c12f3d2595c8 upstream. When we have a READ lease for a file and have just issued a write operation to the server we need to purge the cache and set oplock/lease level to NONE to avoid reading stale data. Currently we do that only if a write operation succedeed thus not covering cases when a request was sent to the server but a negative error code was returned later for some other reasons (e.g. -EIOCBQUEUED or -EINTR). Fix this by turning off caching regardless of the error code being returned. The patches fixes generic tests 075 and 112 from the xfs-tests. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* CIFS: Do not skip SMB2 message IDs on send failuresPavel Shilovsky2019-03-234-3/+49
| | | | | | | | | | | | | | | | | | | | commit c781af7e0c1fed9f1d0e0ec31b86f5b21a8dca17 upstream. When we hit failures during constructing MIDs or sending PDUs through the network, we end up not using message IDs assigned to the packet. The next SMB packet will skip those message IDs and continue with the next one. This behavior may lead to a server not granting us credits until we use the skipped IDs. Fix this by reverting the current ID to the original value if any errors occur before we push the packet through the network stack. This patch fixes the generic/310 test from the xfs-tests. Cc: <stable@vger.kernel.org> # 4.19.x Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* CIFS: Do not reset lease state to NONE on lease breakPavel Shilovsky2019-03-232-6/+26
| | | | | | | | | | | | | | | | | | | | | | commit 7b9b9edb49ad377b1e06abf14354c227e9ac4b06 upstream. Currently on lease break the client sets a caching level twice: when oplock is detected and when oplock is processed. While the 1st attempt sets the level to the value provided by the server, the 2nd one resets the level to None unconditionally. This happens because the oplock/lease processing code was changed to avoid races between page cache flushes and oplock breaks. The commit c11f1df5003d534 ("cifs: Wait for writebacks to complete before attempting write.") fixed the races for oplocks but didn't apply the same changes for leases resulting in overwriting the server granted value to None. Fix this by properly processing lease breaks. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* CIFS: Fix leaking locked VFS cache pages in writeback retryPavel Shilovsky2019-03-231-4/+13
| | | | | | | | | | | | | | | commit 165df9a080b6863ae286fa01780c13d87cd81076 upstream. If we don't find a writable file handle when retrying writepages we break of the loop and do not unlock and put pages neither from wdata2 nor from the original wdata. Fix this by walking through all the remaining pages and cleanup them properly. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fix cgroup_do_mount() handling of failure exitsAl Viro2019-03-231-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream. same story as with last May fixes in sysfs (7b745a4e4051 "unfuck sysfs_mount()"); new_sb is left uninitialized in case of early errors in kernfs_mount_ns() and papering over it by treating any error from kernfs_mount_ns() as equivalent to !new_ns ends up conflating the cases when objects had never been transferred to a superblock with ones when that has happened and resulting new superblock had been dropped. Easily fixed (same way as in sysfs case). Additionally, there's a superblock leak on kernfs_node_dentry() failure *and* a dentry leak inside kernfs_node_dentry() itself - the latter on probably impossible errors, but the former not impossible to trigger (as the matter of fact, injecting allocation failures at that point *does* trigger it). Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dax: Flush partial PMDs correctlyMatthew Wilcox2019-03-231-10/+9
| | | | | | | | | | | | | | | | | | | | commit e4b3448bc346fedf36db64124a664a959995b085 upstream. The radix tree would rewind the index in an iterator to the lowest index of a multi-slot entry. The XArray iterators instead leave the index unchanged, but I overlooked that when converting DAX from the radix tree to the XArray. Adjust the index that we use for flushing to the start of the PMD range. Fixes: c1901cd33cf4 ("page cache: Convert find_get_entries_tag to XArray") Cc: <stable@vger.kernel.org> Reported-by: Piotr Balcer <piotr.balcer@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* 9p: use inode->i_lock to protect i_size_write() under 32-bitHou Tao2019-03-235-30/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5e3cc1ee1405a7eb3487ed24f786dec01b4cbe1f upstream. Use inode->i_lock to protect i_size_write(), else i_size_read() in generic_fillattr() may loop infinitely in read_seqcount_begin() when multiple processes invoke v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under 32-bit SMP environment, and a soft lockup will be triggered as show below: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217] Modules linked in: CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4 Hardware name: Generic DT based system PC is at generic_fillattr+0x104/0x108 LR is at 0xec497f00 pc : [<802b8898>] lr : [<ec497f00>] psr: 200c0013 sp : ec497e20 ip : ed608030 fp : ec497e3c r10: 00000000 r9 : ec497f00 r8 : ed608030 r7 : ec497ebc r6 : ec497f00 r5 : ee5c1550 r4 : ee005780 r3 : 0000052d r2 : 00000000 r1 : ec497f00 r0 : ed608030 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: ac48006a DAC: 00000051 CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4 Hardware name: Generic DT based system Backtrace: [<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24) [<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc) [<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20) [<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8) [<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380) [<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0) [<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64) [<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c) [<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc) [<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48) [<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240) [<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44) [<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4) [<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88) [<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98) [<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4) [<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c) [<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48) [<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec) [<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78) [<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28) [dominique.martinet@cea.fr: updated comment to not refer to a function in another subsystem] Link: http://lkml.kernel.org/r/20190124063514.8571-2-houtao1@huawei.com Cc: stable@vger.kernel.org Fixes: 7549ae3e81cc ("9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly.") Reported-by: Xing Gaopeng <xingaopeng@huawei.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* f2fs: wait on atomic writes to count F2FS_CP_WB_DATAJaegeuk Kim2019-03-191-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 31867b23d7d1ee3535136c6a410a6cf56f666bfc upstream. Otherwise, we can get wrong counts incurring checkpoint hang. IO_W (CP: -24, Data: 24, Flush: ( 0 0 1), Discard: ( 0 0)) Thread A Thread B - f2fs_write_data_pages - __write_data_page - f2fs_submit_page_write - inc_page_count(F2FS_WB_DATA) type is F2FS_WB_DATA due to file is non-atomic one - f2fs_ioc_start_atomic_write - set_inode_flag(FI_ATOMIC_FILE) - f2fs_write_end_io - dec_page_count(F2FS_WB_CP_DATA) type is F2FS_WB_DATA due to file becomes atomic one Cc: <stable@vger.kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* gfs2: Fix missed wakeups in find_insert_glockAndreas Gruenbacher2019-03-131-1/+1
| | | | | | | | | | | | | | | | | | | commit 605b0487f0bc1ae9963bf52ece0f5c8055186f81 upstream. Mark Syms has reported seeing tasks that are stuck waiting in find_insert_glock. It turns out that struct lm_lockname contains four padding bytes on 64-bit architectures that function glock_waitqueue doesn't skip when hashing the glock name. As a result, we can end up waking up the wrong waitqueue, and the waiting tasks may be stuck forever. Fix that by using ht_parms.key_len instead of sizeof(struct lm_lockname) for the key length. Reported-by: Mark Syms <mark.syms@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* exec: Fix mem leak in kernel_read_fileYueHaibing2019-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f612acfae86af7ecad754ae6a46019be9da05b8e upstream. syzkaller report this: BUG: memory leak unreferenced object 0xffffc9000488d000 (size 9195520): comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s) hex dump (first 32 bytes): ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................ 02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z..... backtrace: [<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline] [<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline] [<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831 [<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924 [<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993 [<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895 [<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 [<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<00000000241f889b>] 0xffffffffffffffff It should goto 'out_free' lable to free allocated buf while kernel_read fails. Fixes: 39d637af5aa7 ("vfs: forbid write access when reading a file into memory") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Thibaut Sautereau <thibaut@sautereau.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* aio: Fix locking in aio_poll()Bart Van Assche2019-03-101-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d3d6a18d7d351cbcc9b33dbedf710e65f8ce1595 upstream. wake_up_locked() may but does not have to be called with interrupts disabled. Since the fuse filesystem calls wake_up_locked() without disabling interrupts aio_poll_wake() may be called with interrupts enabled. Since the kioctx.ctx_lock may be acquired from IRQ context, all code that acquires that lock from thread context must disable interrupts. Hence change the spin_trylock() call in aio_poll_wake() into a spin_trylock_irqsave() call. This patch fixes the following lockdep complaint: ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 5.0.0-rc4-next-20190131 #23 Not tainted ----------------------------------------------------- syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: 0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline] 0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline] 0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline] 0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 and this task is already holding: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline] 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline] 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline] 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908 which would create a new lock dependency: (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.} but this new dependency connects a SOFTIRQ-irq-safe lock: (&(&ctx->ctx_lock)->rlock){..-.} ... which became SOFTIRQ-irq-safe at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160 spin_lock_irq include/linux/spinlock.h:354 [inline] free_ioctx_users+0x2d/0x4a0 fs/aio.c:610 percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline] percpu_ref_put include/linux/percpu-refcount.h:301 [inline] percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline] percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2486 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline] rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780 __do_softirq+0x266/0x95a kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:654 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646 smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164 kthread+0x357/0x430 kernel/kthread.c:247 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 to a SOFTIRQ-irq-unsafe lock: (&fiq->waitq){+.+.} ... which became SOFTIRQ-irq-unsafe at: ... lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fiq->waitq); local_irq_disable(); lock(&(&ctx->ctx_lock)->rlock); lock(&fiq->waitq); <Interrupt> lock(&(&ctx->ctx_lock)->rlock); *** DEADLOCK *** 1 lock held by syz-executor2/13779: #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline] #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline] #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline] #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908 the dependencies between SOFTIRQ-irq-safe lock and the holding lock: -> (&(&ctx->ctx_lock)->rlock){..-.} { IN-SOFTIRQ-W at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160 spin_lock_irq include/linux/spinlock.h:354 [inline] free_ioctx_users+0x2d/0x4a0 fs/aio.c:610 percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline] percpu_ref_put include/linux/percpu-refcount.h:301 [inline] percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline] percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2486 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline] rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780 __do_softirq+0x266/0x95a kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:654 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646 smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164 kthread+0x357/0x430 kernel/kthread.c:247 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 INITIAL USE at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160 spin_lock_irq include/linux/spinlock.h:354 [inline] __do_sys_io_cancel fs/aio.c:2052 [inline] __se_sys_io_cancel fs/aio.c:2035 [inline] __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe } ... key at: [<ffffffff8a574140>] __key.52370+0x0/0x40 ... acquired at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] aio_poll fs/aio.c:1772 [inline] __io_submit_one fs/aio.c:1875 [inline] io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 __do_sys_io_submit fs/aio.c:1953 [inline] __se_sys_io_submit fs/aio.c:1923 [inline] __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: -> (&fiq->waitq){+.+.} { HARDIRQ-ON-W at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe SOFTIRQ-ON-W at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe INITIAL USE at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe } ... key at: [<ffffffff8a60dec0>] __key.43450+0x0/0x40 ... acquired at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] aio_poll fs/aio.c:1772 [inline] __io_submit_one fs/aio.c:1875 [inline] io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 __do_sys_io_submit fs/aio.c:1953 [inline] __se_sys_io_submit fs/aio.c:1923 [inline] __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe stack backtrace: CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline] check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605 check_irq_usage kernel/locking/lockdep.c:1650 [inline] check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline] check_prev_add kernel/locking/lockdep.c:1860 [inline] check_prevs_add kernel/locking/lockdep.c:1968 [inline] validate_chain kernel/locking/lockdep.c:2339 [inline] __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] aio_poll fs/aio.c:1772 [inline] __io_submit_one fs/aio.c:1875 [inline] io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 __do_sys_io_submit fs/aio.c:1953 [inline] __se_sys_io_submit fs/aio.c:1923 [inline] __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19 Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> [ bvanassche: added a comment ] Reluctantly-Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-03-011-0/+12
|\ | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "2 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: hugetlbfs: fix races and page leaks during migration kasan: turn off asan-stack for clang-8 and earlier
| * hugetlbfs: fix races and page leaks during migrationMike Kravetz2019-03-011-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hugetlb pages should only be migrated if they are 'active'. The routines set/clear_page_huge_active() modify the active state of hugetlb pages. When a new hugetlb page is allocated at fault time, set_page_huge_active is called before the page is locked. Therefore, another thread could race and migrate the page while it is being added to page table by the fault code. This race is somewhat hard to trigger, but can be seen by strategically adding udelay to simulate worst case scheduling behavior. Depending on 'how' the code races, various BUG()s could be triggered. To address this issue, simply delay the set_page_huge_active call until after the page is successfully added to the page table. Hugetlb pages can also be leaked at migration time if the pages are associated with a file in an explicitly mounted hugetlbfs filesystem. For example, consider a two node system with 4GB worth of huge pages available. A program mmaps a 2G file in a hugetlbfs filesystem. It then migrates the pages associated with the file from one node to another. When the program exits, huge page counts are as follows: node0 1024 free_hugepages 1024 nr_hugepages node1 0 free_hugepages 1024 nr_hugepages Filesystem Size Used Avail Use% Mounted on nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool That is as expected. 2G of huge pages are taken from the free_hugepages counts, and 2G is the size of the file in the explicitly mounted filesystem. If the file is then removed, the counts become: node0 1024 free_hugepages 1024 nr_hugepages node1 1024 free_hugepages 1024 nr_hugepages Filesystem Size Used Avail Use% Mounted on nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool Note that the filesystem still shows 2G of pages used, while there actually are no huge pages in use. The only way to 'fix' the filesystem accounting is to unmount the filesystem If a hugetlb page is associated with an explicitly mounted filesystem, this information in contained in the page_private field. At migration time, this information is not preserved. To fix, simply transfer page_private from old to new page at migration time if necessary. There is a related race with removing a huge page from a file and migration. When a huge page is removed from the pagecache, the page_mapping() field is cleared, yet page_private remains set until the page is actually freed by free_huge_page(). A page could be migrated while in this state. However, since page_mapping() is not set the hugetlbfs specific routine to transfer page_private is not called and we leak the page count in the filesystem. To fix that, check for this condition before migrating a huge page. If the condition is detected, return EBUSY for the page. Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> [mike.kravetz@oracle.com: v2] Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com [mike.kravetz@oracle.com: update comment and changelog] Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'for-linus-5.0-ofs1' of ↵Linus Torvalds2019-02-281-4/+0
|\ \ | |/ |/| | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fixlet from Mike Marshall: "Remove two un-needed BUG_ONs" * tag 'for-linus-5.0-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: remove two un-needed BUG_ONs...
| * orangefs: remove two un-needed BUG_ONs...Mike Marshall2019-02-201-4/+0
| | | | | | | | Signed-off-by: Mike Marshall <hubcap@omnibond.com>
* | afs: Fix manually set volume location server listDavid Howells2019-02-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a cell with a volume location server list is added manually by echoing the details into /proc/net/afs/cells, a record is added but the flag saying it has been looked up isn't set. This causes the VL server rotation code to wait forever, with the top of /proc/pid/stack looking like: afs_select_vlserver+0x3a6/0x6f3 afs_vl_lookup_vldb+0x4b/0x92 afs_create_volume+0x25/0x1b9 ... with the thread stuck in afs_start_vl_iteration() waiting for AFS_CELL_FL_NO_LOOKUP_YET to be cleared. Fix this by clearing AFS_CELL_FL_NO_LOOKUP_YET when setting up a record if that record's details were supplied manually. Fixes: 0a5143f2f89c ("afs: Implement VL server rotation") Reported-by: Dave Botsch <dwb7@cornell.edu> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"Linus Torvalds2019-02-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 9da3f2b74054406f87dff7101a569217ffceb29b. It was well-intentioned, but wrong. Overriding the exception tables for instructions for random reasons is just wrong, and that is what the new code did. It caused problems for tracing, and it caused problems for strncpy_from_user(), because the new checks made perfectly valid use cases break, rather than catch things that did bad things. Unchecked user space accesses are a problem, but that's not a reason to add invalid checks that then people have to work around with silly flags (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an odd way to say "this commit was wrong" and was sprinked into random places to hide the wrongness). The real fix to unchecked user space accesses is to get rid of the special "let's not check __get_user() and __put_user() at all" logic. Make __{get|put}_user() be just aliases to the regular {get|put}_user() functions, and make it impossible to access user space without having the proper checks in places. The raison d'être of the special double-underscore versions used to be that the range check was expensive, and if you did multiple user accesses, you'd do the range check up front (like the signal frame handling code, for example). But SMAP (on x86) and PAN (on ARM) have made that optimization pointless, because the _real_ expense is the "set CPU flag to allow user space access". Do let's not break the valid cases to catch invalid cases that shouldn't even exist. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'ceph-for-5.0-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds2019-02-211-1/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fixes from Ilya Dryomov: "Two bug fixes for old issues, both marked for stable" * tag 'ceph-for-5.0-rc8' of git://github.com/ceph/ceph-client: ceph: avoid repeatedly adding inode to mdsc->snap_flush_list libceph: handle an empty authorize reply
| * | ceph: avoid repeatedly adding inode to mdsc->snap_flush_listYan, Zheng2019-02-181-1/+2
| |/ | | | | | | | | | | | | | | | | Otherwise, mdsc->snap_flush_list may get corrupted. Cc: stable@vger.kernel.org Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | proc, oom: do not report alien mms when setting oom_score_adjMichal Hocko2019-02-211-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tetsuo has reported that creating a thousands of processes sharing MM without SIGHAND (aka alien threads) and setting /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1] to finish. This is especially worrisome that all that printing is done under RCU lock and this can potentially trigger RCU stall or softlockup detector. The primary reason for the printk was to catch potential users who might depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") but after more than 2 years without a single report I guess it is safe to simply remove the printk altogether. The next step should be moving oom_score_adj over to the mm struct and remove all the tasks crawling as suggested by [2] [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'fixes-v5.1-rc6' of ↵Linus Torvalds2019-02-201-14/+17
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull keys fixes from James Morris: - Handle quotas better, allowing full quota to be reached. - Fix the creation of shortcuts in the assoc_array internal representation when the index key needs to be an exact multiple of the machine word size. - Fix a dependency loop between the request_key contruction record and the request_key authentication key. The construction record isn't really necessary and can be dispensed with. - Set the timestamp on a new key rather than leaving it as 0. This would ordinarily be fine - provided the system clock is never set to a time before 1970 * 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: keys: Timestamp new keys keys: Fix dependency loop between construction record and auth key assoc_array: Fix shortcut creation KEYS: allow reaching the keys quotas exactly
| * | keys: Fix dependency loop between construction record and auth keyDavid Howells2019-02-151-14/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the request_key() upcall mechanism there's a dependency loop by which if a key type driver overrides the ->request_key hook and the userspace side manages to lose the authorisation key, the auth key and the internal construction record (struct key_construction) can keep each other pinned. Fix this by the following changes: (1) Killing off the construction record and using the auth key instead. (2) Including the operation name in the auth key payload and making the payload available outside of security/keys/. (3) The ->request_key hook is given the authkey instead of the cons record and operation name. Changes (2) and (3) allow the auth key to naturally be cleaned up if the keyring it is in is destroyed or cleared or the auth key is unlinked. Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.morris@microsoft.com>
* | | exec: load_script: Do not exec truncated interpreter pathKees Cook2019-02-181-9/+48
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8099b047ecc4 ("exec: load_script: don't blindly truncate shebang string") was trying to protect against a confused exec of a truncated interpreter path. However, it was overeager and also refused to truncate arguments as well, which broke userspace, and it was reverted. This attempts the protection again, but allows arguments to remain truncated. In an effort to improve readability, helper functions and comments have been added. Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Samuel Dionne-Riel <samuel@dionne-riel.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Graham Christensen <graham@grahamc.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2019-02-161-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more nfsd fixes from Bruce Fields: "Two small fixes, one for crashes using nfs/krb5 with older enctypes, one that could prevent clients from reclaiming state after a kernel upgrade" * tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux: sunrpc: fix 4 more call sites that were using stack memory with a scatterlist Revert "nfsd4: return default lease period"
| * | Revert "nfsd4: return default lease period"J. Bruce Fields2019-02-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit d6ebf5088f09472c1136cd506bdc27034a6763f8. I forgot that the kernel's default lease period should never be decreased! After a kernel upgrade, the kernel has no way of knowing on its own what the previous lease time was. Unless userspace tells it otherwise, it will assume the previous lease period was the same. So if we decrease this value in a kernel upgrade, we end up enforcing a grace period that's too short, and clients will fail to reclaim state in time. Symptoms may include EIO and log messages like "NFS: nfs4_reclaim_open_state: Lock reclaim failed!" There was no real justification for the lease period decrease anyway. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Fixes: d6ebf5088f09 "nfsd4: return default lease period" Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | Merge tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2019-02-161-5/+6
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more NFS client fixes from Anna Schumaker: "Three fixes this time. Nicolas's is for xprtrdma completion vector allocation on single-core systems. Greg's adds an error check when allocating a debugfs dentry. And Ben's is an additional fix for nfs_page_async_flush() to prevent pages from accidentally getting truncated. Summary: - Make sure Send CQ is allocated on an existing compvec - Properly check debugfs dentry before using it - Don't use page_file_mapping() after removing a page" * tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Don't use page_file_mapping after removing the page rpc: properly check debugfs dentry before using it xprtrdma: Make sure Send CQ is allocated on an existing compvec
| * | NFS: Don't use page_file_mapping after removing the pageBenjamin Coddington2019-02-121-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If nfs_page_async_flush() removes the page from the mapping, then we can't use page_file_mapping() on it as nfs_updatepate() is wont to do when receiving an error. Instead, push the mapping to the stack before the page is possibly truncated. Fixes: 8fc75bed96bb ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | Revert "exec: load_script: don't blindly truncate shebang string"Linus Torvalds2019-02-141-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 8099b047ecc431518b9bb6bdbba3549bbecdc343. It turns out that people do actually depend on the shebang string being truncated, and on the fact that an interpreter (like perl) will often just re-interpret it entirely to get the full argument list. Reported-by: Samuel Dionne-Riel <samuel@dionne-riel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud