| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull MD fixes from Shaohua Li:
"There are several bug fixes queued:
- fix raid5-cache recovery bugs
- fix discard IO error handling for raid1/10
- fix array sync writes bogus position to superblock
- fix IO error handling for raid array with external metadata"
* tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: be careful not lot leak internal curr_resync value into metadata. -- (all)
raid1: handle read error also in readonly mode
raid5-cache: correct condition for empty metadata write
md: report 'write_pending' state when array in sync
md/raid5: write an empty meta-block when creating log super-block
md/raid5: initialize next_checkpoint field before use
RAID10: ignore discard error
RAID1: ignore discard error
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.
1 - means that the array is trying to start a resync, but has yielded
to another array which shares physical devices, and also needs to
start a resync
2 - means the array is trying to start resync, but has found another
array which shares physical devices and has already started resync.
3 - means that resync has commensed, but it is possible that nothing
has actually been resynced yet.
It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint. In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.
There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.
Change them to only propagate the value if it is > 3.
As this can cause an array to fail, the patch is suitable for -stable.
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Viswesh <viswesh.vichu@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If write is the first operation on a disk and it happens not to be
aligned to page size, block layer sends read request first. If read
operation fails, the disk is set as failed as no attempt to fix the
error is made because array is in auto-readonly mode. Similarily, the
disk is set as failed for read-only array.
Take the same approach as in raid10. Don't fail the disk if array is in
readonly or auto-readonly mode. Try to redirect the request first and if
unsuccessful, return a read error.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
As long as we recover one metadata block, we should write the empty metadata
write. The original code could make recovery corrupted if only one meta is
valid.
Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cda162
("md/raid5: ensure device failure recorded before write request
returns.").
The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd7103 ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.
Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If superblock points to an invalid meta block, r5l_load_log will set
create_super with true and create an new superblock, this runtime path
would always happen if we do no writing I/O to this array since it was
created. Writing an empty meta block could avoid this unnecessary
action at the first time we created log superblock.
Another reason is for the corretness of log recovery. Currently we have
bellow code to guarantee log revocery to be correct.
if (ctx.seq > log->last_cp_seq + 1) {
int ret;
ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
if (ret)
return ret;
log->seq = ctx.seq + 11;
log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
r5l_write_super(log, ctx.pos);
} else {
log->log_start = ctx.pos;
log->seq = ctx.seq;
}
If we just created a array with a journal device, log->log_start and
log->last_checkpoint should all be 0, then we write three meta block
which are valid except mid one and supposed crash happened. The ctx.seq
would equal to log->last_cp_seq + 1 and log->log_start would be set to
position of mid invalid meta block after we did a recovery, this will
lead to problems which could be avoided with this patch.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
No initial operation was done to this field when we
load/recovery the log, it got assignment only when IO
to raid disk was finished. So r5l_quiesce may use wrong
next_checkpoint to reclaim log space, that would make
reclaimable space calculation confused.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This is the counterpart of raid10 fix. If a write error occurs, raid10
will try to rewrite the bio in small chunk size. If the rewrite fails,
raid10 will record the error in bad block. narrow_write_error will
always use WRITE for the bio, but actually it could be a discard. Since
discard bio hasn't payload, write the bio will cause different issues.
But discard error isn't fatal, we can safely ignore it. This is what
this patch does.
This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.
Cc: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a write error occurs, raid1 will try to rewrite the bio in small
chunk size. If the rewrite fails, raid1 will record the error in bad
block. narrow_write_error will always use WRITE for the bio, but
actually it could be a discard. Since discard bio hasn't payload, write
the bio will cause different issues. But discard error isn't fatal, we
can safely ignore it. This is what this patch does.
This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.
Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: Shaohua Li <shli@fb.com>
|
|\ \
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a couple DM raid and DM mirror fixes
- a couple .request_fn request-based DM NULL pointer fixes
- a fix for a DM target reference count leak, on target load error,
that prevented associated DM target kernel module(s) from being
removed
* tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: fix missing dm_put_target_type() in dm_table_add_target()
dm rq: clear kworker_task if kthread_run() returned an error
dm: free io_barrier after blk_cleanup_queue call
dm raid: fix activation of existing raid4/10 devices
dm mirror: use all available legs on multiple failures
dm mirror: fix read error on recovery after default leg failure
dm raid: fix compat_features validation
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
dm_get_target_type() was previously called so any error returned from
dm_table_add_target() must first call dm_put_target_type(). Otherwise
the DM target module's reference count will leak and the associated
kernel module will be unable to be removed.
Also, leverage the fact that r is already -EINVAL and remove an extra
newline.
Fixes: 36a0456 ("dm table: add immutable feature")
Fixes: cc6cbe1 ("dm table: add always writeable feature")
Fixes: 3791e2f ("dm table: add singleton feature")
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
cleanup_mapped_device() calls kthread_stop() if kworker_task is
non-NULL. Currently the assigned value could be a valid task struct or
an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if
kthread_run() returned an erorr.
Fixes: 7193a9defc ("dm rq: check kthread_run return for .request_fn request-based DM")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
dm_old_request_fn() has paths that access md->io_barrier. The party
destroying io_barrier should ensure that no future execution of
dm_old_request_fn() is possible. Move io_barrier destruction to below
blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during
request-based DM device shutdown.
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the
old superblock format (which does not have takeover/reshaping support
that was added via commit 33e53f06850f).
Fix validation path for old superblocks by reverting to the old raid4
layout and basing checks on mddev->new_{level,layout,...} members in
super_init_validation().
Cc: stable@vger.kernel.org # 4.8
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When any leg(s) have failed, any read will cause a new operational
default leg to be selected and the read is resubmitted to it. If that
new default leg fails the read too, no other still accessible legs are
used to resubmit the read again -- thus failing the io.
Fix by allowing the read to get resubmitted until all operational legs
have been exhausted. Also, remove any details.bi_dev use as a flag.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a default leg has failed, any read will cause a new operational
default leg to be selected and the read is resubmitted. But until now
the read will return failure even though it was successful due to
resubmission. The reason for this is bio->bi_error was not being
cleared before resubmitting the bio.
Fix by clearing bio->bi_error before resubmission.
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In ecbfb9f118bce4 ("dm raid: add raid level takeover support") a new
compatible feature flag was added. Validation for these compat_features
was added but this only passes for new raid mappings with this feature
flag. This causes previously created raid mappings to be failed at
import.
Check compat_features for the only valid combination.
Fixes: ecbfb9f118bce4 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.
This provides better code generation, and better debugability as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |\
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-4.9/msi-irq
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- various fixes and cleanups for request-based DM core
- add support for delaying the requeue of requests; used by DM
multipath when all paths have failed and 'queue_if_no_path' is
enabled
- DM cache improvements to speedup the loading metadata and the writing
of the hint array
- fix potential for a dm-crypt crash on device teardown
- remove dm_bufio_cond_resched() and just using cond_resched()
- change DM multipath to return a reservation conflict error
immediately; rather than failing the path and retrying (potentially
indefinitely)
* tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
dm mpath: always return reservation conflict without failing over
dm bufio: remove dm_bufio_cond_resched()
dm crypt: fix crash on exit
dm cache metadata: switch to using the new cursor api for loading metadata
dm array: introduce cursor api
dm btree: introduce cursor api
dm cache policy smq: distribute entries to random levels when switching to smq
dm cache: speed up writing of the hint array
dm array: add dm_array_new()
dm mpath: delay the requeue of blk-mq requests while all paths down
dm mpath: use dm_mq_kick_requeue_list()
dm rq: introduce dm_mq_kick_requeue_list()
dm rq: reduce arguments passed to map_request() and dm_requeue_original_request()
dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests
dm: convert wait loops to use autoremove_wake_function()
dm: use signal_pending_state() in dm_wait_for_completion()
dm: rename task state function arguments
dm: add two lockdep_assert_held() statements
dm rq: simplify dm_old_stop_queue()
dm mpath: check if path's request_queue is dying in activate_path()
...
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If dm-mpath encounters an reservation conflict it should not fail the
path (as communication with the target is not affected) but should
rather retry on another path. However, in doing so we might be inducing
a ping-pong between paths, with no guarantee of any forward progress.
And arguably a reservation conflict is an unexpected error, so we should
be passing it upwards to allow the application to take appropriate
steps.
This change resolves a show-stopper problem seen with the pNFS SCSI
layout because it is trivial to hit reservation conflict based failover
loops without it.
Doubts were raised about the implications of this change relative to
products like IBM's SVC. But there is little point withholding a fix
for Linux because a proprietary product may or may not have some issues
in its implementation of how it interfaces with Linux. In the future,
if there is glaring evidence that this change is certainly problematic
we can revisit it.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Use cond_resched() like everybody else.
Mikulas explained why dm_bufio_cond_resched() was introduced to begin
with (hopefully cond_resched can be improved accordingly) here:
https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
As the documentation for kthread_stop() says, "if threadfn() may call
do_exit() itself, the caller must ensure task_struct can't go away".
dm-crypt does not ensure this and therefore crashes when crypt_dtr()
calls kthread_stop(). The crash is trivially reproducible by adding a
delay before the call to kthread_stop() and just opening and closing a
dm-crypt device.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
task: ffff88003bd0df40 task.stack: ffff8800375b4000
RIP: 0010: kthread_stop+0x52/0x300
Call Trace:
crypt_dtr+0x77/0x120
dm_table_destroy+0x6f/0x120
__dm_destroy+0x130/0x250
dm_destroy+0x13/0x20
dev_remove+0xe6/0x120
? dev_suspend+0x250/0x250
ctl_ioctl+0x1fc/0x530
? __lock_acquire+0x24f/0x1b10
dm_ctl_ioctl+0x13/0x20
do_vfs_ioctl+0x91/0x6a0
? ____fput+0xe/0x10
? entry_SYSCALL_64_fastpath+0x5/0xbd
? trace_hardirqs_on_caller+0x151/0x1e0
SyS_ioctl+0x41/0x70
entry_SYSCALL_64_fastpath+0x1f/0xbd
This problem was introduced by bcbd94ff481e ("dm crypt: fix a possible
hang due to race condition on exit").
Looking at the description of that patch (excerpted below), it seems
like the problem it addresses can be solved by just using
set_current_state instead of __set_current_state, since we obviously
need the memory barrier.
| dm crypt: fix a possible hang due to race condition on exit
|
| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
| It is possible that the processor reorders memory accesses so that
| kthread_should_stop() is executed before __set_current_state(). If
| such reordering happens, there is a possible race on thread
| termination: [...]
So this patch just reverts the aforementioned patch and changes the
__set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...). This
fixes the crash and should also fix the potential hang.
Fixes: bcbd94ff481e ("dm crypt: fix a possible hang due to race condition on exit")
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This change offers a pretty significant performance improvement.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
More efficient way to iterate an array due to prefetching (makes use of
the new dm_btree_cursor_* api).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This uses prefetching to speed up iteration through a btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
For smq the 32 bit 'hint' stores the multiqueue level that the entry
should be stored in. If a different policy has been used previously,
and then switched to smq, the hints will be invalid. In which case we
used to put all entries in the bottom level of the multiqueue, and then
redistribute. Redistribution is faster if we put entries with invalid
hints in random levels initially.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
It's far quicker to always delete the hint array and recreate with
dm_array_new() because we avoid the copying caused by mutation.
Also simplifies the policy interface, replacing the walk_hints() with
the simpler get_hint().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
dm_array_new() creates a new, populated array more efficiently than
starting with an empty one and resizing.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Return DM_MAPIO_DELAY_REQUEUE from .clone_and_map_rq. Also, return
false from .busy, if all paths are down, so that blk-mq requests get
mapped via .clone_and_map_rq -- which results in DM_MAPIO_DELAY_REQUEUE
being returned to dm-rq.
This change allows for a noticeable reduction in cpu utilization
(reduced kworker load) while all paths are down, e.g.:
system CPU idleness (as measured by fio's --idle-prof=system):
before: system: 86.58%
after: system: 98.60%
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When reinstating a path the blk-mq request_queue's requeue_list should
get kicked. It makes sense to kick the requeue_list as part of the
existing hook (previously only used by bio-based support).
Rename process_queued_bios_list to process_queued_io_list.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Make it possible for a request-based target to kick the DM device's
blk-mq request_queue's requeue_list.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
dm_requeue_original_request()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Otherwise blk-mq will immediately dispatch requests that are requeued
via a BLK_MQ_RQ_QUEUE_BUSY return from blk_mq_ops .queue_rq.
Delayed requeue is implemented using blk_mq_delay_kick_requeue_list()
with a delay of 5 secs. In the context of DM multipath (all paths down)
it doesn't make any sense to requeue more quickly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Use autoremove_wake_function() instead of default_wake_function()
to make the dm wait loops more similar to other wait loops in the
kernel. This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Use signal_pending_state() instead of open-coding it. This patch does
not change any functionality but makes it possible to pass TASK_KILLABLE
as the second argument of dm_wait_for_completion(). See also commit
16882c1e962b ("sched: fix TASK_WAKEKILL vs SIGKILL race").
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Rename 'interruptible' into 'task_state' to make it clear that this
argument is a task state instead of a boolean. Also, change type from
int to long.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Document the locking assumptions for the __bind() and __dm_suspend()
functions.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If pg_init_retries is set and a request is queued against a multipath
device with all underlying block device request_queues in the "dying"
state then an infinite loop is triggered because activate_path() never
succeeds and hence never calls pg_init_done().
This change avoids that device removal triggers an infinite loop by
failing the activate_path() which causes the "dying" path to be failed.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Every call of queue_flag_clear_unlocked() after block device
initialization has finished is wrong if blk_cleanup_queue() can be
called concurrently. Convert queue_flag_clear_unlocked() into
queue_flag_clear() and protect it by the block layer queue lock.
Also, factor out dm_mq_start_queue().
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Also, check that the blk-mq request_queue isn't already stopped.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This avoids that new requests are queued while __dm_destroy() is in
progress.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
dm_resume() will return success (0) rather than -EINVAL if
!dm_suspended_md() upon retry within dm_resume().
Reset the error code at the start of dm_resume()'s retry loop.
Also, remove a useless assignment at the end of dm_resume().
Fixes: ffcc393641 ("dm: enhance internal suspend and resume interface")
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.
As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.
This pull request contains:
- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.
- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.
- General fixes from Bart, Baoyou, Guoqing, and Linus W.
- CFQ async write starvation fix from Glauber.
- Add supprot for delayed kick of the requeue list, from Mike.
- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.
- bdev thaw error progagation from Pierre.
- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.
- Set of minor cleanups from Christoph in block/blk-mq.
- Set of cleanups and optimizations from me for block/blk-mq.
- Various nvme/nvmet/nvmeof fixes from the various folks"
* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
bio_free_pages is introduced in commit 1dfa0f68c040
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|