summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* md/raid5: add thread_group worker async_tx_issue_pending_allOfer Heifetz2017-08-061-0/+2
| | | | | | | | | | | | | | | | | | | | commit 7e96d559634b73a8158ee99a7abece2eacec2668 upstream. Since thread_group worker and raid5d kthread are not in sync, if worker writes stripe before raid5d then requests will be waiting for issue_pendig. Issue observed when building raid5 with ext4, in some build runs jbd2 would get hung and requests were waiting in the HW engine waiting to be issued. Fix this by adding a call to async_tx_issue_pending_all in the raid5_do_work. Signed-off-by: Ofer Heifetz <oferh@marvell.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/raid1: fix writebehind bio cloneShaohua Li2017-08-061-21/+13
| | | | | | | | | | | | | | | | | | commit 16d56e2fcc1fc15b981369653c3b41d7ff0b443d upstream. After bio is submitted, we should not clone it as its bi_iter might be invalid by driver. This is the case of behind_master_bio. In certain situration, we could dispatch behind_master_bio immediately for the first disk and then clone it for other disks. https://bugzilla.kernel.org/show_bug.cgi?id=196383 Reported-and-tested-by: Markus <m4rkusxxl@web.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fix: 841c1316c7da(md: raid1: improve write behind) Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: remove 'idx' from 'struct resync_pages'Ming Lei2017-08-063-7/+6
| | | | | | | | | | | | | | | | | | | | | commit 022e510fcbda79183fd2cdc01abb01b4be80d03f upstream. bio_add_page() won't fail for resync bio, and the page index for each bio is same, so remove it. More importantly the 'idx' of 'struct resync_pages' is initialized in mempool allocator function, the current way is wrong since mempool is only responsible for allocation, we can't use that for initialization. Suggested-by: NeilBrown <neilb@suse.com> Reported-by: NeilBrown <neilb@suse.com> Reported-and-tested-by: Patrick <dto@gmx.net> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm integrity: test for corrupted disk format during table loadMikulas Patocka2017-08-061-0/+5
| | | | | | | | | | | | | | | | commit bc86a41e96c5b6f07453c405e036d95acc673389 upstream. If the dm-integrity superblock was corrupted in such a way that the journal_sections field was zero, the integrity target would deadlock because it would wait forever for free space in the journal. Detect this situation and refuse to activate the device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm integrity: fix inefficient allocation of journal spaceMikulas Patocka2017-08-061-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | commit 9dd59727dbc21d33a9add4c5b308a5775cd5a6ef upstream. When using a block size greater than 512 bytes, the dm-integrity target allocates journal space inefficiently. It allocates one journal entry for each 512-byte chunk of data, fills an entry for each block of data and leaves the remaining entries unused. This issue doesn't cause data corruption, but all the unused journal entries degrade performance severely. For example, with 4k blocks and an 8k bio, it would allocate 16 journal entries but only use 2 entries. The remaining 14 entries were left unused. Fix this by adding the missing 'log2_sectors_per_block' shifts that are required to have each journal entry map to a full block. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Raid5 should update rdev->sectors after reshapeXiao Ni2017-07-271-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b5d27718f38843a74552e9a93d32e2391fd3999f upstream. The raid5 md device is created by the disks which we don't use the total size. For example, the size of the device is 5G and it just uses 3G of the devices to create one raid5 device. Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid and assemble it again. It fails. mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean mdadm /dev/md0 --grow --chunk=64 wait reshape to finish mdadm -S /dev/md0 mdadm -As The error messages: [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing! [197519.821686] md: md_import_device returned -22 After reshape the data offset is changed. It selects backwards direction in this condition. In function super_1_load it compares the available space of the underlying device with sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL. rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based on rdev->sectors. So add md_finish_reshape in end_reshape. Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm raid: stop using BUG() in __rdev_sectors()Heinz Mauelshagen2017-07-271-3/+10
| | | | | | | | | | | | | commit 4d49f1b4a1fcab16b6dd1c79ef14f2b6531d50a6 upstream. Return 0 rather than BUG() if __rdev_sectors() fails and catch invalid rdev size in the constructor. Reported-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: fix deadlock between mddev_suspend() and md_write_start()NeilBrown2017-07-279-35/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit cc27b0c78c79680d128dbac79de0d40556d041bb upstream. If mddev_suspend() races with md_write_start() we can deadlock with mddev_suspend() waiting for the request that is currently in md_write_start() to complete the ->make_request() call, and md_write_start() waiting for the metadata to be updated to mark the array as 'dirty'. As metadata updates done by md_check_recovery() only happen then the mddev_lock() can be claimed, and as mddev_suspend() is often called with the lock held, these threads wait indefinitely for each other. We fix this by having md_write_start() abort if mddev_suspend() is happening, and ->make_request() aborts if md_write_start() aborted. md_make_request() can detect this abort, decrease the ->active_io count, and wait for mddev_suspend(). Reported-by: Nix <nix@esperi.org.uk> Fix: 68866e425be2(MD: no sync IO while suspended) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: don't use flush_signals in userspace processesMikulas Patocka2017-07-272-2/+8
| | | | | | | | | | | | | | | | | | | | | | commit f9c79bc05a2a91f4fba8bfd653579e066714b1ec upstream. The function flush_signals clears all pending signals for the process. It may be used by kernel threads when we need to prepare a kernel thread for responding to signals. However using this function for an userspaces processes is incorrect - clearing signals without the program expecting it can cause misbehavior. The raid1 and raid5 code uses flush_signals in its request routine because it wants to prepare for an interruptible wait. This patch drops flush_signals and uses sigprocmask instead to block all signals (including SIGKILL) around the schedule() call. The signals are not lost, but the schedule() call won't respond to them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm thin: do not queue freed thin mapping for next stage processingVallish Vaidyeshwara2017-06-271-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | process_prepared_discard_passdown_pt1() should cleanup dm_thin_new_mapping in cases of error. dm_pool_inc_data_range() can fail trying to get a block reference: metadata operation 'dm_pool_inc_data_range' failed: error = -61 When dm_pool_inc_data_range() fails, dm thin aborts current metadata transaction and marks pool as PM_READ_ONLY. Memory for thin mapping is released as well. However, current thin mapping will be queued onto next stage as part of queue_passdown_pt2() or passdown_endio(). This dangling thin mapping memory when processed and accessed in next stage will lead to device mapper crashing. Code flow without fix: -> process_prepared_discard_passdown_pt1(m) -> dm_thin_remove_range() -> discard passdown --> passdown_endio(m) queues m onto next stage -> dm_pool_inc_data_range() fails, frees memory m but does not remove it from next stage queue -> process_prepared_discard_passdown_pt2(m) -> processes freed memory m and crashes One such stack: Call Trace: [<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison] [<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool] [<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool] [<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool] [<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool] [<ffffffff8152bf54>] ? __schedule+0x244/0x680 [<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0 [<ffffffff81089f53>] process_one_work+0x153/0x3f0 [<ffffffff8108a71b>] worker_thread+0x12b/0x4b0 [<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350 [<ffffffff8108fd6a>] kthread+0xca/0xe0 [<ffffffff8108fca0>] ? kthread_park+0x60/0x60 [<ffffffff81530b45>] ret_from_fork+0x25/0x30 The fix is to first take the block ref count for discarded block and then do a passdown discard of this block. If block ref count fails, then bail out aborting current metadata transaction, mark pool as PM_READ_ONLY and also free current thin mapping memory (existing error handling code) without queueing this thin mapping onto next stage of processing. If block ref count succeeds, then passdown discard of this block. Discard callback of passdown_endio() will queue this thin mapping onto next stage of processing. Code flow with fix: -> process_prepared_discard_passdown_pt1(m) -> dm_thin_remove_range() -> dm_pool_inc_data_range() --> if fails, free memory m and bail out -> discard passdown --> passdown_endio(m) queues m onto next stage Cc: stable <stable@vger.kernel.org> # v4.9+ Reviewed-by: Eduardo Valentin <eduval@amazon.com> Reviewed-by: Cristian Gafton <gafton@amazon.com> Reviewed-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Vallish Vaidyeshwara <vallish@amazon.com> Reviewed-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: fix oops on upgrading to extended superblock formatHeinz Mauelshagen2017-06-231-3/+14
| | | | | | | | | | | | | | | | | When a RAID set was created on dm-raid version < 1.9.0 (old RAID superblock format), all of the new 1.9.0 members of the superblock are uninitialized (zero) -- including the device sectors member needed to support shrinking. All the other accesses to superblock fields new in 1.9.0 were reviewed and verified to be properly guarded against invalid use. The 'sectors' member was the only one used when the superblock version is < 1.9. Don't access the superblock's >= 1.9.0 'sectors' member unconditionally. Also add respective comments. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm io: fix duplicate bio completion due to missing ref countMike Snitzer2017-06-211-2/+2
| | | | | | | | | | | | | | | If only a subset of the devices associated with multiple regions support a given special operation (eg. DISCARD) then the dec_count() that is used to set error for the region must increment the io->count. Otherwise, when the dec_count() is called it can cause the dm-io caller's bio to be completed multiple times. As was reported against the dm-mirror target that had mirror legs with a mix of discard capabilities. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=196077 Reported-by: Zhang Yi <yizhan@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm integrity: fix to not disable/enable interrupts from interrupt contextMike Snitzer2017-06-211-2/+5
| | | | | | | | | | | | | | Use spin_lock_irqsave and spin_unlock_irqrestore rather than spin_{lock,unlock}_irq in submit_flush_bio(). Otherwise lockdep issues the following warning: DEBUG_LOCKS_WARN_ON(current->hardirq_context) WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 trace_hardirqs_on_caller+0x107/0x180 Reported-by: Ondrej Kozina <okozina@redhat.com> Tested-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
* Revert "dm mirror: use all available legs on multiple failures"Mike Snitzer2017-06-151-2/+19
| | | | | | | | | | | | This reverts commit 12a7cf5ba6c776a2621d8972c7d42e8d3d959d20. This commit apparently attempted to fix an issue that didn't really exist, furthermore: this commit is the source of deadlocks and crashes seen in multiple cases related to failing the primary mirror dev while syncing. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm integrity: reject mappings too large for deviceOndrej Mosnáček2017-06-121-0/+5
| | | | | | | | | | | | | | | dm-integrity would successfully create mappings with the number of sectors greater than the provided data sector count. Attempts to read sectors of this mapping that were beyond the provided data sector count would then yield run-time messages of the form "device-mapper: integrity: Too big sector number: ...". Fix this by emitting an error when the requested mapping size is bigger than the provided data sector count. Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* md: initialise ->writes_pending in personality modules.NeilBrown2017-06-055-4/+21
| | | | | | | | | | | | | | | | The new per-cpu counter for writes_pending is initialised in md_alloc(), which is not called by dm-raid. So dm-raid fails when md_write_start() is called. Move the initialization to the personality modules that need it. This way it is always initialised when needed, but isn't unnecessarily initialized (requiring memory allocation) when the personality doesn't use writes_pending. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge tag 'for-4.12/dm-fixes-3' of ↵Linus Torvalds2017-06-027-30/+18
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a DM verity fix for a mode when no salt is used - a fix to DM to account for the possibility that PREFLUSH or FUA are used without the SYNC flag if the underlying storage doesn't have a volatile write-cache - a DM ioctl memory allocation flag fix to use __GFP_HIGH to allow emergency forward progress (by using memory reserves as last resort) - a small DM integrity cleanup to use kvmalloc() instead of duplicating the same * tag 'for-4.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: make flush bios explicitly sync dm ioctl: restore __GFP_HIGH in copy_params() dm integrity: use kvmalloc() instead of dm_integrity_kvmalloc() dm verity: fix no salt use case
| * dm: make flush bios explicitly syncJan Kara2017-05-315-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b685d3d65ac7 ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous") removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions. Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. Fixes: b685d3d65ac7 ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous") Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm ioctl: restore __GFP_HIGH in copy_params()Junaid Shahid2017-05-221-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d224e9381897 ("drivers/md/dm-ioctl.c: use kvmalloc rather than opencoded variant") left out the __GFP_HIGH flag when converting from __vmalloc to kvmalloc. This can cause the DM ioctl to fail in some low memory situations where it wouldn't have failed earlier. Add __GFP_HIGH back to avoid any potential regression. Fixes: d224e9381897 ("drivers/md/dm-ioctl.c: use kvmalloc rather than opencoded variant") Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm integrity: use kvmalloc() instead of dm_integrity_kvmalloc()Mikulas Patocka2017-05-221-21/+6
| | | | | | | | | | Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm verity: fix no salt use caseGilad Ben-Yossef2017-05-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DM-Verity has an (undocumented) mode where no salt is used. This was never handled directly by the DM-Verity code, instead working due to the fact that calling crypto_shash_update() with a zero length data is an implicit noop. This is no longer the case now that we have switched to crypto_ahash_update(). Fix the issue by introducing explicit handling of the no salt use case to DM-Verity. Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Reported-by: Marian Csontos <mcsontos@redhat.com> Fixes: d1ac3ff ("dm verity: switch to using asynchronous hash crypto API") Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | md: Make flush bios explicitely syncJan Kara2017-05-313-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: linux-raid@vger.kernel.org CC: Shaohua Li <shli@kernel.org> Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Shaohua Li <shli@fb.com>
* | md: report sector of stripes with check mismatchesNix2017-05-241-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes it possible, with appropriate filesystem support, for a sysadmin to tell what is affected by the mismatch, and whether it should be ignored (if it's inside a swap partition, for instance). We ratelimit to prevent log flooding: if there are so many mismatches that ratelimiting is necessary, the individual messages are relatively unlikely to be important (either the machine is swapping like crazy or something is very wrong with the disk). Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | md: uuid debug statement now in processor byte order.Kyungchan Koh2017-05-241-4/+4
| | | | | | | | | | | | | | | | | | | | Previously, the uuid debug statements were printed in little-endian format, which wasn't consistent in machines that might not be in little-endian byte order. With this change, the output will be consistent for all machines with different byte-ordering. Signed-off-by: Kyungchan Koh <kkc6196@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | md-cluster: fix potential lock issue in add_new_diskGuoqing Jiang2017-05-211-1/+3
|/ | | | | | | | | | | The add_new_disk returns with communication locked if __sendmsg returns failure, fix it with call unlock_comm before return. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge tag 'md/4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2017-05-188-86/+209
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MD fixes from Shaohua Li: - Several bug fixes for raid5-cache from Song Liu, mainly handle journal disk error - Fix bad block handling in choosing raid1 disk from Tomasz Majchrzak - Simplify external metadata array sysfs handling from Artur Paszkiewicz - Optimize raid0 discard handling from me, now raid0 will dispatch large discard IO directly to underlayer disks. * tag 'md/4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid1: prefer disk without bad blocks md/r5cache: handle sync with data in write back cache md/r5cache: gracefully handle journal device errors for writeback mode md/raid1/10: avoid unnecessary locking md/raid5-cache: in r5l_do_submit_io(), submit io->split_bio first md/md0: optimize raid0 discard handling md: don't return -EAGAIN in md_allow_write for external metadata arrays md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
| * raid1: prefer disk without bad blocksTomasz Majchrzak2017-05-121-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an array consists of two drives and the first drive has the bad block, the read request to the region overlapping the bad block chooses the same disk (with bad block) as device to read from over and over and the request gets stuck. If the first disk only partially overlaps with bad block, it becomes a candidate ("best disk") for shorter range of sectors. The second disk is capable of reading the entire requested range and it is updated accordingly, however it is not recorded as a best device for the request. In the end the request is sent to the first disk to read entire range of sectors. It fails and is re-tried in a moment but with the same outcome. Actually it is quite likely scenario but it had little exposure in my test until commit 715d40b93b10 ("md/raid1: add failfast handling for reads.") removed preference for idle disk. Such scenario had been passing as second disk was always chosen when idle. Reset a candidate ("best disk") to read from if disk can read entire range. Do it only if other disk has already been chosen as a candidate for a smaller range. The head position / disk type logic will select the best disk to read from - it is fine as disk with bad block won't be considered for it. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/r5cache: handle sync with data in write back cacheSong Liu2017-05-112-7/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, sync of raid456 array cannot make progress when hitting data in writeback r5cache. This patch fixes this issue by flushing cached data of the stripe before processing the sync request. This is achived by: 1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is in write back cache; 2. In r5c_try_caching_write(), handle the stripe in sync with write through; 3. In do_release_stripe(), make stripe in sync write out and send it to the state machine. Shaohua: explictly set STRIPE_HANDLE after write out completed Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/r5cache: gracefully handle journal device errors for writeback modeSong Liu2017-05-113-9/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the raid456 with writeback cache, when journal device failed during normal operation, it is still possible to persist all data, as all pending data is still in stripe cache. However, it is necessary to handle journal failure gracefully. During journal failures, the following logic handles the graceful shutdown of journal: 1. raid5_error() marks the device as Faulty and schedules async work log->disable_writeback_work; 2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is suspended, set to write through, and then resumed. mddev_suspend() flushes all cached stripes; 3. All cached stripes need to be flushed carefully to the RAID array. This patch fixes issues within the process above: 1. In r5c_update_on_rdev_error() schedule disable_writeback_work for journal failures; 2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING, since raid5_error() updates superblock. 3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0) to make progress during log_failed; 4. In delay_towrite(), if log failed only process data in the cache (skip new writes in dev->towrite); 5. In __get_priority_stripe(), process loprio_list during journal device failures. 6. In raid5_remove_disk(), wait for all cached stripes are flushed before calling log_exit(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1/10: avoid unnecessary lockingShaohua Li2017-05-112-8/+6
| | | | | | | | | | | | | | | | If we add bios to block plugging list, locking is unnecessry, since the block unplug is guaranteed not to run at that time. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid5-cache: in r5l_do_submit_io(), submit io->split_bio firstSong Liu2017-05-101-9/+19
| | | | | | | | | | | | | | | | | | In r5l_do_submit_io(), it is necessary to check io->split_bio before submit io->current_bio. This is because, endio of current_bio may free the whole IO unit, and thus change io->split_bio. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/md0: optimize raid0 discard handlingShaohua Li2017-05-081-14/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are complaints that raid0 discard handling is slow. Currently we divide discard request into chunks and dispatch to underlayer disks. The block layer will do merge to form big requests. This causes a lot of request split/merge and uses significant CPU time. A simple idea is to calculate the range for each raid disk for an IO request and send a discard request to raid disks, which will avoid the split/merge completely. Previously Coly tried the approach, but the implementation was too complex because of raid0 zones. This patch always split bio in zone boundary and handle bio within one zone. It simplifies the implementation a lot. Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: don't return -EAGAIN in md_allow_write for external metadata arraysArtur Paszkiewicz2017-05-084-28/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This essentially reverts commit b5470dc5fc18 ("md: resolve external metadata handling deadlock in md_allow_write") with some adjustments. Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") changing array_state to 'active' does not use mddev_lock() and will not cause a deadlock with md_allow_write(). This revert simplifies userspace tools that write to sysfs attributes like "stripe_cache_size" or "consistency_policy" because it removes the need for special handling for external metadata arrays, checking for EAGAIN and retrying the write. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lockJulia Cartwright2017-05-041-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On mainline, there is no functional difference, just less code, and symmetric lock/unlock paths. On PREEMPT_RT builds, this fixes the following warning, seen by Alexander GQ Gerasiov, due to the sleeping nature of spinlocks. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993 in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1 CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1 Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015 Workqueue: writeback wb_workfn (flush-253:0) Call Trace: dump_stack+0x47/0x68 ? migrate_enable+0x4a/0xf0 ___might_sleep+0x101/0x180 rt_spin_lock+0x17/0x40 add_stripe_bio+0x4e3/0x6c0 [raid456] ? preempt_count_add+0x42/0xb0 raid5_make_request+0x737/0xdd0 [raid456] Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru> Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | dm cache: handle kmalloc failure allocating background_tracker structColin Ian King2017-05-171-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | Currently there is no kmalloc failure check on the allocation of the background_tracker struct in btracker_create(), and so a NULL return will lead to a NULL pointer dereference. Add a NULL check. Detected by CoverityScan, CID#1416587 ("Dereference null return value") Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm bufio: make the parameter "retain_bytes" unsigned longMikulas Patocka2017-05-161-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the type of the parameter "retain_bytes" from unsigned to unsigned long, so that on 64-bit machines the user can set more than 4GiB of data to be retained. Also, change the type of the variable "count" in the function "__evict_old_buffers" to unsigned long. The assignment "count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];" could result in unsigned long to unsigned overflow and that could result in buffers not being freed when they should. While at it, avoid division in get_retain_buffers(). Division is slow, we can change it to shift because we have precalculated the log2 of block size. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm mpath: multipath_clone_and_map must not return -EIOChristoph Hellwig2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | Since 412445ac ("dm: introduce a new DM_MAPIO_KILL return value"), the clone_and_map_rq methods must not return errno values, so fix it up to properly return DM_MAPIO_KILL, instead of the -EIO value that snuck in due to a conflict between two patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm mpath: don't return -EIO from dm_report_EIOChristoph Hellwig2017-05-151-8/+11
| | | | | | | | | | | | | | | | | | Instead just turn the macro into a helper for the warning message. This removes an unnecessary assignment and will allow the next commit to fix a place where -EIO is the wrong return value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm rq: add a missing break to map_requestChristoph Hellwig2017-05-151-0/+1
| | | | | | | | | | | | | | | | We don't want to bug when receiving a DM_MAPIO_KILL value.. Fixes: 412445ac ("dm: introduce a new DM_MAPIO_KILL return value") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm space map disk: fix some book keeping in the disk space mapJoe Thornber2017-05-151-1/+14
| | | | | | | | | | | | | | | | | | When decrementing the reference count for a block, the free count wasn't being updated if the reference count went to zero. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm thin metadata: call precommit before saving the rootsJoe Thornber2017-05-151-2/+2
| | | | | | | | | | | | | | | | These calls were the wrong way round in __write_initial_superblock. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache policy smq: don't do any writebacks unless IDLEJoe Thornber2017-05-141-5/+4
| | | | | | | | | | | | | | | | | | If there are no clean blocks to be demoted the writeback will be triggered at that point. Preemptively writing back can hurt high IO load scenarios. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache: simplify the IDLE vs BUSY state calculationJoe Thornber2017-05-141-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | Drop the MODERATE state since it wasn't buying us much. Also, in check_migrations(), prepare for the next commit ("dm cache policy smq: don't do any writebacks unless IDLE") by deferring to the policy to make the final decision on whether writebacks can be serviced. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache: track all IO to the cache rather than just the origin device's IOJoe Thornber2017-05-141-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | IO tracking used to throttle writebacks when the origin device is busy. Even if all the IO is going to the fast device, writebacks can significantly degrade performance. So track all IO to gauge whether the cache is busy or not. Otherwise, synthetic IO tests (e.g. fio) that might send all IO to the fast device wouldn't cause writebacks to get throttled. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache policy smq: stop preemptively demoting blocksJoe Thornber2017-05-141-12/+5
| | | | | | | | | | | | | | | | It causes a lot of churn if the working set's size is close to the fast device's size. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache policy smq: put newly promoted entries at the top of the multiqueueJoe Thornber2017-05-141-0/+1
| | | | | | | | | | | | | | This stops entries bouncing in and out of the cache quickly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache policy smq: be more aggressive about triggering a writebackJoe Thornber2017-05-141-1/+1
| | | | | | | | | | | | | | | | If there are no clean entries to demote we really want to writeback immediately. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache policy smq: only demote entries in bottom half of the clean multiqueueJoe Thornber2017-05-141-1/+1
| | | | | | | | | | | | | | | | Heavy IO load may mean there are very few clean blocks in the cache, and we risk demoting entries that get hit a lot. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm cache: fix incorrect 'idle_time' reset in IO trackerJoe Thornber2017-05-141-0/+3
| | | | | | | | | | | | | | | | Some bios have no payload (eg, a FLUSH), don't reset the idle_time when these come in. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | mm, vmalloc: use __GFP_HIGHMEM implicitlyMichal Hocko2017-05-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud