| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 4328daa2e79ed904a42ce00a9f38b9c36b44b21a upstream.
Using request-based DM mpath configured with the following stacking
(.request_fn DM mpath ontop of scsi-mq paths):
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq
'struct dm_rq_target_io' would leak if a request is requeued before a
blk-mq clone is allocated (or fails to allocate). free_rq_tio()
wasn't being called.
kmemleak reported:
unreferenced object 0xffff8800b90b98c0 (size 112):
comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s)
hex dump (first 32 bytes):
00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff ..\,....@.......
e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00 ...4............
backtrace:
[<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0
[<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20
[<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170
[<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod]
[<ffffffff812fbd78>] blk_peek_request+0x168/0x290
[<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod]
[<ffffffff812f66e3>] __blk_run_queue+0x33/0x40
[<ffffffff812f9585>] blk_delay_work+0x25/0x40
[<ffffffff81096fff>] process_one_work+0x14f/0x3d0
[<ffffffff81097715>] worker_thread+0x125/0x4b0
[<ffffffff8109ce88>] kthread+0xd8/0xf0
[<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70
[<ffffffffffffffff>] 0xffffffffffffffff
crash> struct -o dm_rq_target_io
struct dm_rq_target_io {
...
}
SIZE: 112
Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 385277bfb57faac44e92497104ba542cdd82d5fe upstream.
When there is an error copying a chunk dm-snapshot can incorrectly hold
associated bios indefinitely, resulting in hung IO.
The function copy_callback sets pe->error if there was error copying the
chunk, and then calls complete_exception. complete_exception calls
pending_complete on error, otherwise it calls commit_exception with
commit_callback (and commit_callback calls complete_exception).
The persistent exception store (dm-snap-persistent.c) assumes that calls
to prepare_exception and commit_exception are paired.
persistent_prepare_exception increases ps->pending_count and
persistent_commit_exception decreases it.
If there is a copy error, persistent_prepare_exception is called but
persistent_commit_exception is not. This results in the variable
ps->pending_count never returning to zero and that causes some pending
exceptions (and their associated bios) to be held forever.
Fix this by unconditionally calling commit_exception regardless of
whether the copy was successful. A new "valid" parameter is added to
commit_exception -- when the copy fails this parameter is set to zero so
that the chunk that failed to copy (and all following chunks) is not
recorded in the snapshot store. Also, remove commit_callback now that
it is merely a wrapper around pending_complete.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 512167788a6fe9481a33a3cce5f80b684631a1bb upstream.
Remove the unused struct block_op pointer that was inadvertantly
introduced, via cut-and-paste of previous brb_op() code, as part of
commit 50dd842ad.
(Cc'ing stable@ because commit 50dd842ad did)
Fixes: 50dd842ad ("dm space map metadata: fix ref counting bug when bootstrapping a new space map")
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 18d03e8c25f173f4107a40d0b8c24defb6ed69f3 upstream.
When a thin pool is being destroyed delayed work items are
cancelled using cancel_delayed_work(), which doesn't guarantee that on
return the delayed item isn't running. This can cause the work item to
requeue itself on an already destroyed workqueue. Fix this by using
cancel_delayed_work_sync() which guarantees that on return the work item
is not running anymore.
Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second")
Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever")
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 627ccd20b4ad3ba836472468208e2ac4dfadbf03 upstream.
Previously, it would only scan the entire disk if it was starting from
the very start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so
that refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if
necessary, regardless of what last_scanned was, the code gets cleaner
and we fix that bug too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 8d16ce540c94c9d366eb36fc91b7154d92d6397b upstream.
Added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change adds the same check before trying to
wake up the thread for that case.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit d7076f21629f8f329bca4a44dc408d94670f49e2 upstream.
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.
See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.
Cc: Denis Bychkov <manover@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2ecf0cdb2b437402110ab57546e02abfa68a716b upstream.
In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device. This commit fixes this.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 4d4d8573a8451acc9f01cbea24b7e55f04a252fe upstream.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit fecaee6f20ee122ad75402c53d8278f9bb142ddc upstream.
This bug can be reproduced by the following script:
#!/bin/bash
bcache_sysfs="/sys/fs/bcache"
function clear_cache()
{
if [ ! -e $bcache_sysfs ]; then
echo "no bcache sysfs"
exit
fi
cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
sleep 5
sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
}
for ((i=0;i<10;i++)); do
clear_cache
done
The warning messages look like below:
[ 275.948611] ------------[ cut here ]------------
[ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W
--------------- )
[ 275.979253] Hardware name: Tecal RH2285
[ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
[ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1
[ 276.089315] Call Trace:
[ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
[ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
[ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
[ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
[ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90
[ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
[ 276.338241] ------------[ cut here ]------------
[ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- )
[ 276.386017] Hardware name: Tecal RH2285
[ 276.401430] Couldn't create device <-> cache set symlinks
[ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1
[ 276.482169] Call Trace:
[ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
[ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90
[ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---
We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time. After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink(). Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().
So the fix is trival and we clear this flag in bcache_device_link().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit c5f1e5adf956e3ba82d204c7c141a75da9fa449a upstream.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2ef9ccbfcb90cf84bdba320a571b18b05c41101b upstream.
Subject : [PATCH v2] bcache: fix a livelock in btree lock
Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
This commit tries to fix a livelock in bcache. This livelock might
happen when we causes a huge number of cache misses simultaneously.
When we get a cache miss, bcache will execute the following path.
->cached_dev_make_request()
->cached_dev_read()
->cached_lookup()
->bch->btree_map_keys()
->btree_root() <------------------------
->bch_btree_map_keys_recurse() |
->cache_lookup_fn() |
->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
[If btree->seq is not equal to seq + 1, we should return
EINTR and traverse btree again.]
In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock. During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others. Thus no
one can make progress.
This commit will try to take write btree->lock if it encounters a race
when we traverse btree. Although it sacrifice the scalability but we
can ensure that only one can modify the btree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Joshua Schmid <jschmid@suse.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1501efadc524a0c99494b576923091589a52d2a4 upstream.
It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue. Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.
The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a671 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.
This policy of disallowing changes the integrity profile once one has
been established is shared with DM.
Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277]
[ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[ 59.078302] md: data integrity enabled on md0
[..]
[ 90.489209] md0: incompatible integrity profile for pmem1m
[..]
[ 205.671277] md: super_written gets error=-5
[ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[ 205.677386] md/raid1:md0: Operation continuing on 1 devices.
[ 205.683037] RAID1 conf printout:
[ 205.684699] --- wd:1 rd:2
[ 205.685972] disk 0, wo:0, o:1, dev:pmem0s
[ 205.687562] disk 1, wo:1, o:1, dev:pmem1s
[ 205.691717] md: recovery of RAID array md0
Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
md currently doesn't allow a 'sync_action' such as 'reshape' to be set
while MD_RECOVERY_NEEDED is set.
This s a problem, particularly since commit 738a273806ee as that can
cause ->check_shape to call mddev_resume() which sets
MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is
very likely that MD_RECOVERY_NEEDED is still set.
Testing for this flag is not really needed and is in any case very
racy as it can be set at any moment - asynchronously. Any race
between setting a sync_action and setting MD_RECOVERY_NEEDED must
already be handled properly in some locked code, probably
md_check_recovery(), so remove the test here.
The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
so we should test it again after getting mddev_lock().
As this fixes a race and a regression which can cause 'reshape' to
fail, it is suitable for -stable kernels since 4.1
Reported-by: Xiao Ni <xni@redhat.com>
Fixes: 738a273806ee ("md/raid5: fix allocation of 'scribble' array.")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6
introduced a regression which would remove a recently added spare via
slot_store. Revert part of the patch which touches slot_store() and add
the disk directly using pers->hot_add_disk()
Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific
rdev")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc
causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
this BUG, the reason is that we attempt to suspend a device that is
already suspended. See also
https://bugzilla.redhat.com/show_bug.cgi?id=1283491
This patch fixes the bug by changing functions mddev_suspend and
mddev_resume to always nest.
The number of nested calls to mddev_nested_suspend is kept in the
variable mddev->suspended.
[neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]
kernel BUG at drivers/md/md.c:317!
CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001000000000000001111 Not tainted
r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001
sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748
CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000
ORIG_R28: 00000000000000b0
IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
Backtrace:
[<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
[<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
[<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
[<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
[<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
[<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
[<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
[<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558
Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
|
|
|
|
|
|
|
|
| |
Neil pointed out setting journal disk role to raid_disks will confuse
reshape if we support reshape eventually. Switching the role to 0 (we
should be fine as long as the value >=0) and skip sysfs file creation to
avoid error.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The commit c31df25f20e3 ("md/raid10: make sync_request_write() call
bio_copy_data()") replaced manual data copying with bio_copy_data() but
it doesn't work as intended. The source bio (fbio) is already processed,
so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt. Because of
this, bio_copy_data() either does not copy anything, or worse, copies
data from the ->bi_next bio if it is set. This causes wrong data to be
written to drives during resync and sometimes lockups/crashes in
bio_copy_data():
[ 517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
[ 517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[ 517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
[ 517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
[ 517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
[ 517.468529] RIP: 0010:[<ffffffff812e1888>] [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
[ 517.478164] RSP: 0018:ffff880150dfbc98 EFLAGS: 00000246
[ 517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
[ 517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
[ 517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
[ 517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
[ 517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
[ 517.525412] FS: 0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
[ 517.534844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
[ 517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 517.566144] Stack:
[ 517.568626] ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
[ 517.577659] 0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
[ 517.586715] ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
[ 517.595773] Call Trace:
[ 517.598747] [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
[ 517.605610] [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
[ 517.611987] [<ffffffff814ff206>] md_thread+0x106/0x140
[ 517.618072] [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
[ 517.624252] [<ffffffff814ff100>] ? super_1_load+0x520/0x520
[ 517.630817] [<ffffffff8109ef89>] kthread+0xc9/0xe0
[ 517.636506] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
[ 517.643653] [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
[ 517.649929] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Shaohua Li <shli@kernel.org>
Cc: stable@vger.kernel.org (v4.2+)
Fixes: c31df25f20e3 ("md/raid10: make sync_request_write() call bio_copy_data()")
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If dm_btree_del()'s call to push_frame() fails, e.g. due to
btree_node_validator finding invalid metadata, the dm_btree_del() error
path must unlock all frames (which have active dm-bufio buffers) that
were pushed onto the del_stack.
Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio
buffers have leaked, e.g.:
device-mapper: bufio: leaked buffer 3, hold count 1, list 0
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When applying block operations (BOPs) do not remove them from the
uncommitted BOP ring-buffer until after they've been applied -- in case
we recurse.
Also, perform BOP_INC operation, in dm_sm_metadata_create() and
sm_metadata_extend(), in terms of the uncommitted BOP ring-buffer rather
than using direct calls to sm_ll_inc().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When you take a metadata snapshot the btree roots for the mapping and
details tree need to have their reference counts incremented so they
persist for the lifetime of the metadata snap.
The roots being incremented were those currently written in the
superblock, which could possibly be out of date if concurrent IO is
triggering new mappings, breaking of sharing, etc.
Fix this by performing a commit with the metadata lock held while taking
a metadata snapshot.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dm_btree_remove_leaves() only unmaps a contiguous region so we need a
loop, in __remove_range(), to handle ranges that contain multiple
regions.
A new btree function, dm_btree_lookup_next(), is introduced which is
more efficiently able to skip over regions of the thin device which
aren't mapped. __remove_range() uses dm_btree_lookup_next() for each
iteration of __remove_range()'s loop.
Also, improve description of dm_btree_remove_leaves().
Fixes: 6550f075 ("dm thin metadata: add dm_thin_remove_range()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
|
|
|
|
|
|
|
|
|
|
|
|
| |
The block allocated at the start of btree_split_sibling() is never
released if later insert_at() fails.
Fix this by releasing the previously allocated bufio block using
unlock_block().
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When establishing a thin device's discard limits we cannot rely on the
underlying thin-pool device's discard capabilities (which are inherited
from the thin-pool's underlying data device) given that DM thin devices
must provide discard support even when the thin-pool's underlying data
device doesn't support discards.
Users were exposed to this thin device discard limits regression if
their thin-pool's underlying data device does _not_ support discards.
This regression caused all upper-layers that called the
blkdev_issue_discard() interface to not be able to issue discards to
thin devices (because discard_granularity was 0). This regression
wasn't caught earlier because the device-mapper-test-suite's extensive
'thin-provisioning' discard tests are only ever performed against
thin-pool's with data devices that support discards.
Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
rather than inferring whether or not a thin device's discard support
should be enabled by looking at the thin-pool's discard_granularity.
Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled")
Reported-by: Mike Gerber <mike@sprachgewalt.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
__add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
It is possible that the processor reorders memory accesses so that
kthread_should_stop() is executed before __set_current_state(). If such
reordering happens, there is a possible race on thread termination:
CPU 0:
calls kthread_should_stop()
it tests KTHREAD_SHOULD_STOP bit, returns false
CPU 1:
calls kthread_stop(cc->write_thread)
sets the KTHREAD_SHOULD_STOP bit
calls wake_up_process on the kernel thread, that sets the thread
state to TASK_RUNNING
CPU 0:
sets __set_current_state(TASK_INTERRUPTIBLE)
spin_unlock_irq(&cc->write_thread_wait.lock)
schedule() - and the process is stuck and never terminates, because the
state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already
terminated
Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to
signal that the kernel thread should exit. The flag is set and tested
while holding cc->write_thread_wait.lock, so there is no possibility of
racy access to the flag.
Also, remove the unnecessary set_task_state(current, TASK_RUNNING)
following the schedule() call. When the process was woken up, its state
was already set to TASK_RUNNING. Other kernel code also doesn't set the
state to TASK_RUNNING following schedule() (for example,
do_wait_for_common in completion.c doesn't do it).
Fixes: dc2676210c42 ("dm crypt: offload writes to thread")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In multipath_prepare_ioctl(),
- pgpath is a path selected from available paths
- m->queue_io is true if we cannot send a request immediately to
paths, either because:
* there is no available path
* the path group needs activation (pg_init)
- pg_init is not started
- pg_init is still running
- m->queue_if_no_path is true if the device is configured to queue
I/O if there are no available paths
If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
However in the course of refactoring the condition check has broken
and returns success in that case. Since bdev points to the dm device
itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
recurses until crash.
You could reproduce the problem like this:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# sg_inq /dev/mapper/mp
<crash>
[ 172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
[ 172.662843] PGD 19dd067 PUD 0
[ 172.666269] Thread overran stack, or stack corrupted
[ 172.671808] Oops: 0000 [#1] SMP
...
Fix the condition check with some clarifications.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
(Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for
targets' .prepare_ioctl to fail if they go on to check the bdev for
!NULL.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dm-mpath retries ioctl, when no path is readily available and the device
is configured to queue I/O in such a case. If you want to stop the retry
before multipathd decides to turn off queueing mode, you could send
signal for the process to exit from the loop.
However the check of fatal signal has not carried along when commit
6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths") moved the
loop from dm-mpath to dm core. As a result, we can't terminate such
a process in the retry loop.
Easy reproducer of the situation is:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# dmsetup message mp 0 'queue_if_no_path'
# sg_inq /dev/mapper/mp
then you should be able to terminate sg_inq by pressing Ctrl+C.
Fixes: 6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
transition
A thin-pool that is in out-of-data-space (OODS) mode may transition back
to write mode -- without the admin adding more space to the thin-pool --
if/when blocks are released (either by deleting thin devices or
discarding provisioned blocks).
But as part of the thin-pool's earlier transition to out-of-data-space
mode the thin-pool may have set the 'error_if_no_space' flag to true if
the no_space_timeout expires without more space having been made
available. That implementation detail, of changing the pool's
error_if_no_space setting, needs to be reset back to the default that
the user specified when the thin-pool's table was loaded.
Otherwise we'll drop the user requested behaviour on the floor when this
out-of-data-space to write mode transition occurs.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Fixes: 2c43fd26e4 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
Cc: stable@vger.kernel.org
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.
Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"
* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull config fix for md from Neil Brown:
"New config dependency needed as md/raid5 now uses crc32c"
* tag 'md/4.4-rc0-fix' of git://neil.brown.name/md:
raid5-cache: add crc32c Kconfig dependency
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The recent change of the raid5-cache code to use crc32c instead
of crc32 causes link errors when CONFIG_LIBCRC32C is disabled:
drivers/built-in.o: In function crc32c'
core.c:(.text+0x1c6060): undefined reference to `crc32c'
This adds an explicit 'select' statement like all other users
of this function do.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5cb2fbd6ea0d ("raid5-cache: use crc32c checksum")
Signed-off-by: NeilBrown <neilb@suse.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
"Trivial stuff from trivial tree that can be trivially summed up as:
- treewide drop of spurious unlikely() before IS_ERR() from Viresh
Kumar
- cosmetic fixes (that don't really affect basic functionality of the
driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek
- various comment / printk fixes and updates all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
bcache: Really show state of work pending bit
hwmon: applesmc: fix comment typos
Kconfig: remove comment about scsi_wait_scan module
class_find_device: fix reference to argument "match"
debugfs: document that debugfs_remove*() accepts NULL and error values
net: Drop unlikely before IS_ERR(_OR_NULL)
mm: Drop unlikely before IS_ERR(_OR_NULL)
fs: Drop unlikely before IS_ERR(_OR_NULL)
drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
UBI: Update comments to reflect UBI_METAONLY flag
pktcdvd: drop null test before destroy functions
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
WORK_STRUCT_PENDING is a mask for testing the pending bit.
test_bit() expects the number of the bit and we need to
use WORK_STRUCT_PENDING_BIT there.
Also work_data_bits() is defined in workqueues.h now.
I have noticed this just by chance when looking how
WORK_STRUCT_PENDING_BIT is used. The change is compile
tested.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
"Smaller set of DM changes for this merge. I've based these changes on
Jens' for-4.4/reservations branch because the associated DM changes
required it.
- Revert a dm-multipath change that caused a regression for
unprivledged users (e.g. kvm guests) that issued ioctls when a
multipath device had no available paths.
- Include Christoph's refactoring of DM's ioctl handling and add
support for passing through persistent reservations with DM
multipath.
- All other changes are very simple cleanups"
* tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm switch: simplify conditional in alloc_region_table()
dm delay: document that offsets are specified in sectors
dm delay: capitalize the start of an delay_ctr() error message
dm delay: Use DM_MAPIO macros instead of open-coded equivalents
dm linear: remove redundant target name from error messages
dm persistent data: eliminate unnecessary return values
dm: eliminate unused "bioset" process for each bio-based DM device
dm: convert ffs to __ffs
dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()
dm: add support for passing through persistent reservations
dm: refactor ioctl handling
Revert "dm mpath: fix stalls when handling invalid ioctls"
dm: initialize non-blk-mq queue data before queue is used
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The variable sctx->nr_regions has type unsigned long and the variable
nr_regions has type sector_t.
Thus the variables may be different when overflow happens.
Changed the conditional to "if (nr_regions >= ULONG_MAX)".
Also move the assignment of nr_regions after sector_div()
and the sanity check which looks more sane.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Only delay params are mentioned in delay.txt.
Mention offsets just like documents for linear and flakey do.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
All other error messages start capitalized.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
.map function of dm-delay returns return value of delay_bio(),
hence it's supposed to return using a defined DM_MAPIO macro.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Acked-By: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 72d94861 back in 2006 should have consistently removed
"dm-linear: " from all error messages.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
dm_bm_unlock and dm_tm_unlock return an integer value but the returned
value is always 0. The calling code sometimes checks the return value
and sometimes doesn't.
Eliminate these unnecessary return values and also the checks for them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
generic_make_request handle arbitrarily sized bios") makes it possible
for block devices to process large bios. In doing so that commit
allocates a new queue->bio_split bioset for each block device, this
bioset is used for allocating bios when the driver needs to split large
bios.
Each bioset allocates a workqueue process, thus the above commit
increases the number of processes allocated per block device.
DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
This reduces the number of allocated processes per bio-based DM device
from 3 to 2. Also remove the call to blk_queue_split(), it is not
needed because DM does its own splitting.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
ffs counts bit starting with 1 (for the least significant bit), __ffs
counts bits starting with 0. This patch changes various occurrences of ffs
to __ffs and removes subtraction of 1 from the result.
Note that __ffs (unlike ffs) is not defined when called with zero
argument, but it is not called with zero argument in any of these cases.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Remove DM's unneeded NULL tests before calling these destroy functions,
now that they check for NULL, thanks to these v4.3 commits:
3942d2991 ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()")
4e3ca3e03 ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()")
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This adds support to pass through persistent reservation requests
similar to the existing ioctl handling, and with the same limitations,
e.g. devices may only have a single target attached.
This is mostly intended for multipathing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This moves the call to blkdev_ioctl and the argument checking to DM core
code, and only leaves a callout to find the block device to operate on
in the targets. This simplifies the code and allows us to pass through
ioctl-like command using other methods in the next patch.
Also split out a helper around calling the prepare_ioctl method that
will be reused for persistent reservation handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This reverts commit a1989b330093578ea5470bea0a00f940c444c466.
That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.
That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.
More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).
Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).
1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)
$ cat sg_simple0.c
... see [3] ...
$ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
$ gcc sgio_inquiry.c -o sgio_inquiry
2) The ioctl() works fine with active paths present.
# multipath -l 85ag56
85ag56 (...) dm-19 IBM ,2145
size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 8:0:11:0 sdz 65:144 active undef running
| `- 9:0:9:0 sdbf 67:144 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 8:0:12:0 sdae 65:224 active undef running
`- 9:0:12:0 sdbo 68:32 active undef running
$ ./sgio_inquiry /dev/mapper/85ag56
Some of the INQUIRY command's response:
IBM 2145 0000
INQUIRY duration=0 millisecs, resid=0
3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
for unprivileged users (rather than blocking due to queue_if_no_path).
# for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
do multipathd -k"fail path $path"; done
# multipath -l 85ag56
85ag56 (...) dm-19 IBM ,2145
size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 8:0:11:0 sdz 65:144 failed undef running
| `- 9:0:9:0 sdbf 67:144 failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 8:0:12:0 sdae 65:224 failed undef running
`- 9:0:12:0 sdbo 68:32 failed undef running
$ ./sgio_inquiry /dev/mapper/85ag56
sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
$ dmesg
<...>
[] device-mapper: multipath: Failing path 65:144.
[] device-mapper: multipath: Failing path 67:144.
[] device-mapper: multipath: Failing path 65:224.
[] device-mapper: multipath: Failing path 68:32.
[] sgio_inquiry: sending ioctl 2285 to a partition!
5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
(then queueing happens -- in this example, queue_if_no_path is set);
this is due to a conditional check in scsi_verify_blk_ioctl().
# capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
# ./sgio_inquiry /dev/mapper/85ag56 &
[1] 72830
# cat /proc/72830/stack
[<c00000171c0df700>] 0xc00000171c0df700
[<c000000000015934>] __switch_to+0x204/0x350
[<c000000000152d4c>] msleep+0x5c/0x80
[<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
[<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
[<c0000000003128e4>] block_ioctl+0x64/0xd0
[<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
[<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
[<c000000000009358>] system_call+0x38/0xd0
6) This is the function call chain exercised in this analysis:
SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
-> do_vfs_ioctl()
-> vfs_ioctl()
...
error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
...
-> dm_blk_ioctl() @ drivers/md/dm.c
-> multipath_ioctl() @ drivers/md/dm-mpath.c
...
(bdev = NULL, due to no active paths)
...
if (!bdev || <...>) {
int err = scsi_verify_blk_ioctl(NULL, cmd);
if (err)
r = err;
}
...
-> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
...
if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
return 0;
...
if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
return 0;
...
printk_ratelimited(KERN_WARNING
"%s: sending ioctl %x to a partition!\n" <...>);
return -ENOIOCTLCMD;
<-
...
return r ? : <...>
<-
...
if (error == -ENOIOCTLCMD)
error = -ENOTTY;
out:
return error;
...
Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|