summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | md-cluster: update document for raid10Guoqing Jiang2017-11-012-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: remove redundant variable qColin Ian King2017-11-011-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pointer q is assigned but never read; it is redundant and can be removed. Cleans up clang warning: drivers/md/md-multipath.c:260:4: warning: Value stored to 'q' is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | raid1: remove obsolete code in raid1_write_requestGuoqing Jiang2017-11-011-13/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some lines could be removed due to recent change for raid1 such as commit 3956df15d634 ("md: move suspend_hi/lo handling into core md code"). Also, seems some comments are put to wrong place, move them before wait_barrier. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md-cluster: Use a small window for raid10 resyncGuoqing Jiang2017-11-012-1/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window is maintained in r10conf as cluster_sync_low and cluster_sync_high, and processed in raid10's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Set cluster_sync_high to cluster_sync_low + stripe size. 3. Send a message to all nodes so they may add it in their suspension list. Note: We only support "near" raid10 so far, resync a far or offset raid10 array could have trouble. So raid10_run checks the layout of clustered raid10, it will refuse to run if the layout is not correct. With the "near" layout we process one stripe at a time progressing monotonically through the address space. So we can have a sliding window of whole-stripes which moves through the array suspending IO on other nodes, and both resync which uses array addresses and recovery which uses device addresses can stay within this window. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md-cluster: Suspend writes in RAID10 if within rangeGuoqing Jiang2017-11-011-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is a resync going on, all nodes must suspend writes to the range. This is recorded in suspend_info and suspend_list. If there is an I/O within the ranges of any of the suspend_info, area_resyncing will return 1. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md-cluster/raid10: set "do_balance = 0" if area is resyncingGuoqing Jiang2017-11-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like clustered raid1, it is impossible for cluster raid10 to choose the best device for read balance when the area of array is resyncing. Because we cannot trust the data to be the same on all devices at that time, so we choose just the first one to use, so set do_balance to 0. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: use lockdep_assert_heldShaohua Li2017-11-013-13/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lockdep_assert_held is a better way to assert lock held, and it works for UP. Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | raid1: prevent freeze_array/wait_all_barriers deadlockNate Dailey2017-11-011-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If freeze_array is attempted in the middle of close_sync/ wait_all_barriers, deadlock can occur. freeze_array will wait for nr_pending and nr_queued to line up. wait_all_barriers increments nr_pending for each barrier bucket, one at a time, but doesn't actually issue IO that could be counted in nr_queued. So freeze_array is blocked until wait_all_barriers completes and allow_all_barriers runs. At the same time, when _wait_barrier sees array_frozen == 1, it stops and waits for freeze_array to complete. Prevent the deadlock by making close_sync call _wait_barrier and _allow_barrier for one bucket at a time, instead of deferring the _allow_barrier calls until after all _wait_barriers are complete. Signed-off-by: Nate Dailey <nate.dailey@stratus.com> Fix: fd76863e37fe(RAID1: a new I/O barrier implementation to remove resync window) Reviewed-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org (v4.11) Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: use TASK_IDLE instead of blocking signalsMikulas Patocka2017-11-012-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi - I submit this patch for the next merge window: Some times ago, I made a patch f9c79bc05a2a that blocks signals around the schedule() calls in MD. The MD subsystem needs to do an uninterruptible sleep that is not accounted in load average - so we block signals and use interruptible sleep. The kernel has a special TASK_IDLE state for this purpose, so we can use it instead of blocking signals. This patch doesn't fix any bug, it just makes the code simpler. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: remove special meaning of ->quiesce(.., 2)NeilBrown2017-11-019-69/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: allow metadata update while suspending.NeilBrown2017-11-012-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are various deadlocks that can occur when a thread holds reconfig_mutex and calls ->quiesce(mddev, 1). As some write request block waiting for metadata to be updated (e.g. to record device failure), and as the md thread updates the metadata while the reconfig mutex is held, holding the mutex can stop write requests completing, and this prevents ->quiesce(mddev, 1) from completing. ->quiesce() is now usually called from mddev_suspend(), and it is always called with reconfig_mutex held. So at this time it is safe for the thread to update metadata without explicitly taking the lock. So add 2 new flags, one which says the unlocked updates is allowed, and one which ways it is happening. Then allow it while the quiesce completes, and then wait for it to finish. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: use mddev_suspend/resume instead of ->quiesce()NeilBrown2017-11-011-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mddev_suspend() is a more general interface than calling ->quiesce() and is so more extensible. A future patch will make use of this. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: move suspend_hi/lo handling into core md codeNeilBrown2017-11-013-37/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | responding to ->suspend_lo and ->suspend_hi is similar to responding to ->suspended. It is best to wait in the common core code without incrementing ->active_io. This allows mddev_suspend()/mddev_resume() to work while requests are waiting for suspend_lo/hi to change. This is will be important after a subsequent patch which uses mddev_suspend() to synchronize updating for suspend_lo/hi. So move the code for testing suspend_lo/hi out of raid1.c and raid5.c, and place it in md.c Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: don't call bitmap_create() while array is quiesced.NeilBrown2017-11-011-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bitmap_create() allocates memory with GFP_KERNEL and so can wait for IO. If called while the array is quiesced, it could wait indefinitely for write out to the array - deadlock. So call bitmap_create() before quiescing the array. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: always hold reconfig_mutex when calling mddev_suspend()NeilBrown2017-11-013-7/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most often mddev_suspend() is called with reconfig_mutex held. Make this a requirement in preparation a subsequent patch. Also require reconfig_mutex to be held for mddev_resume(), partly for symmetry and partly to guarantee no races with incr/decr of mddev->suspend. Taking the mutex in r5c_disable_writeback_async() is a little tricky as this is called from a work queue via log->disable_writeback_work, and flush_work() is called on that while holding ->reconfig_mutex. If the work item hasn't run before flush_work() is called, the work function will not be able to get the mutex. So we use mddev_trylock() inside the wait_event() call, and have that abort when conf->log is set to NULL, which happens before flush_work() is called. We wait in mddev->sb_wait and ensure this is woken when any of the conditions change. This requires waking mddev->sb_wait in mddev_unlock(). This is only like to trigger extra wake_ups of threads that needn't be woken when metadata is being written, and that doesn't happen often enough that the cost would be noticeable. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: forbid a RAID5 from having both a bitmap and a journal.NeilBrown2017-11-013-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having both a bitmap and a journal is pointless. Attempting to do so can corrupt the bitmap if the journal replay happens before the bitmap is initialized. Rather than try to avoid this corruption, simply refuse to allow arrays with both a bitmap and a journal. So: - if raid5_run sees both are present, fail. - if adding a bitmap finds a journal is present, fail - if adding a journal finds a bitmap is present, fail. Cc: stable@vger.kernel.org (4.10+) Signed-off-by: NeilBrown <neilb@suse.com> Tested-by: Joshua Kinard <kumba@gentoo.org> Acked-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | raid5: Set R5_Expanded on parity devices as well as data.NeilBrown2017-10-181-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reshaping a fully degraded raid5/raid6 to a larger nubmer of devices, the new device(s) are not in-sync and so that can make the newly grown stripe appear to be "failed". To avoid this, we set the R5_Expanded flag to say "Even though this device is not fully in-sync, this block is safe so don't treat the device as failed for this stripe". This flag is set for data devices, not not for parity devices. Consequently, if you have a RAID6 with two devices that are partly recovered and a spare, and start a reshape to include the spare, then when the reshape gets past the point where the recovery was up to, it will think the stripes are failed and will get into an infinite loop, failing to make progress. So when contructing parity on an EXPAND_READY stripe, set R5_Expanded. Reported-by: Curt <lightspd@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: raid10: remove a couple of redundant variables and initializationsColin Ian King2017-10-161-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variables dev and bio_last_sector are assigned values that are never read and hence these are redundant variables and can be removed. Also remove the duplicated initialization of sectors, the latter assignment is identical to the first and can be removed. Cleans up 3 clang build warnings: Value stored to 'dev' is never read Value stored to 'bio_last_sector' is never read Value stored to 'sectors' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: rename some drivers/md/ files to have an "md-" prefixMike Snitzer2017-10-1615-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Motivated by the desire to illiminate the imprecise nature of DM-specific patches being unnecessarily sent to both the MD maintainer and mailing-list. Which is born out of the fact that DM files also reside in drivers/md/ Now all MD-specific files in drivers/md/ start with either "raid" or "md-" and the MAINTAINERS file has been updated accordingly. Shaohua: don't change module name Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: raid10: remove VLAISMatthias Kaehlcke2017-10-161-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The raid10 driver can't be built with clang since it uses a variable length array in a structure (VLAIS): drivers/md/raid10.c:4583:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported Allocate the r10bio struct with kmalloc instead of using the VLAIS construct. Shaohua: set the MD_RECOVERY_INTR bit Neil Brown: use GFP_NOIO Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md-cluster: make function cluster_check_sync_size staticColin Ian King2017-10-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function cluster_check_sync_size is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol 'cluster_check_sync_size' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | raid5-ppl: check recovery_offset when performing ppl recoveryArtur Paszkiewicz2017-10-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If starting an array that is undergoing rebuild, make ppl recovery honor the recovery_offset of a member disk and don't read data that is not yet in-sync. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | raid5-ppl: don't resync after rebuildArtur Paszkiewicz2017-10-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for degraded array is unnecessary and causes a resync to be performed after ppl recovery and rebuild when restarting an array during rebuilding after unclean shutdown. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md-cluster: fix wrong condition check in raid1_write_requestGuoqing Jiang2017-10-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check used here is to avoid conflict between write and resync, however we used the wrong logic, it should be the inverse of the checking inside "if". Fixes: 589a1c4 ("Suspend writes in RAID1 if within range") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md/bitmap: revert a patchShaohua Li2017-10-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 8031c3ddc70a. That patches doesn't work well if PAGE_SIZE > 4k. We will fix the original problem with a different approach. Fix: 8031c3ddc70a(md/bitmap: copy correct data for bitmap super) Reported-by: Joshua Kinard <kumba@gentoo.org> Cc: stable@vger.kernel.org (4.10+) Suggested-by: Neil Brown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: always set THREAD_WAKEUP and wake up wqueue if thread existedGuoqing Jiang2017-10-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), the wait_queue is only got invoked if THREAD_WAKEUP is not set previously. With above change, I can see process_metadata_update could always hang on the wait queue, because mddev->thread could stay on 'D' status and the THREAD_WAKEUP flag is not cleared since there are lots of place to wake up mddev->thread. Then deadlock happened as follows: linux175:~ # ps aux|grep md|grep D root 20117 0.0 0.0 0 0 ? D 03:45 0:00 [md0_raid1] root 20125 0.0 0.0 0 0 ? D 03:45 0:00 [md0_cluster_rec] linux175:~ # cat /proc/20117/stack [<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster] [<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster] [<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster] [<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod] [<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod] [<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod] [<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 linux175:~ # cat /proc/20125/stack [<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 So let's revert the part of code in the commit to resovle the problem since we can't get lots of benefits of previous change. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | | md: fix deadlock error in recent patch.NeilBrown2017-10-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent patch aimed to cause md_write_start() to fail (rather than block) when the mddev was suspending, so as to avoid deadlocks. Unfortunately the test in wait_event() was wrong, and it didn't change behaviour at all. We wait_event() must wait until the metadata is written OR the array is suspending. Fixes: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") Cc: stable@vger.kernel.org Reported-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | | | | Merge tag 'for-4.15/dm' of ↵Linus Torvalds2017-11-1414-262/+408
|\ \ \ \ \ | | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - a few conversions from atomic_t to ref_count_t - a DM core fix for a race during device destruction that could result in a BUG_ON - a stable@ fix for a DM cache race condition that could lead to data corruption when operating in writeback mode (writethrough is default) - various DM cache cleanups and improvements - add DAX support to the DM log-writes target - a fix for the DM zoned target's ability to deal with the last zone of the drive being smaller than all others - a stable@ DM crypt and DM integrity fix for a negative check that was to restrictive (prevented slab debug with XFS ontop of DM crypt from working) - a DM raid target fix for a panic that can occur when forcing a raid to sync * tag 'for-4.15/dm' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (25 commits) dm cache: lift common migration preparation code to alloc_migration() dm cache: remove usused deferred_cells member from struct cache dm cache policy smq: allocate cache blocks in order dm cache policy smq: change max background work from 10240 to 4096 blocks dm cache background tracker: limit amount of background work that may be issued at once dm cache policy smq: take origin idle status into account when queuing writebacks dm cache policy smq: handle races with queuing background_work dm raid: fix panic when attempting to force a raid to sync dm integrity: allow unaligned bv_offset dm crypt: allow unaligned bv_offset dm: small cleanup in dm_get_md() dm: fix race between dm_get_from_kobject() and __dm_destroy() dm: allocate struct mapped_device with kvzalloc dm zoned: ignore last smaller runt zone dm space map metadata: use ARRAY_SIZE dm log writes: add support for DAX dm log writes: add support for inline data buffers dm cache: simplify get_per_bio_data() by removing data_size argument dm cache: remove all obsolete writethrough-specific code dm cache: submit writethrough writes in parallel to origin and cache ...
| * | | | dm cache: lift common migration preparation code to alloc_migration()Mike Snitzer2017-11-101-10/+7
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: remove usused deferred_cells member from struct cacheJoe Thornber2017-11-101-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: allocate cache blocks in orderJoe Thornber2017-11-101-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, cache blocks were being allocated in reverse order. Fix this by pulling the block off the head of the free list. Shouldn't have any impact on performance or latency but it is more correct to have the cache blocks allocated/mapped in ascending order. This fix will slightly increase the chances of two adjacent oblocks being in adjacent cblocks. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: change max background work from 10240 to 4096 blocksJoe Thornber2017-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 10240 blocks was too much, lowering this reduces the latency of copying and consumes less memory. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache background tracker: limit amount of background work that may be ↵Joe Thornber2017-11-101-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | issued at once On large systems the cache policy can be over enthusiastic and queue far too much dirty data to be written back. This consumes memory. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: take origin idle status into account when queuing ↵Joe Thornber2017-11-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | writebacks If the origin device is idle try and writeback more data. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: handle races with queuing background_workJoe Thornber2017-11-101-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The background_tracker holds a set of promotions/demotions that the cache policy wishes the core target to implement. When adding a new operation to the tracker it's possible that an operation on the same block is already present (but in practise this doesn't appear to be happening). Catch these situations and do the appropriate cleanup. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm raid: fix panic when attempting to force a raid to syncHeinz Mauelshagen2017-11-101-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Requesting a sync on an active raid device via a table reload (see 'sync' parameter in Documentation/device-mapper/dm-raid.txt) skips the super_load() call that defines the superblock size (rdev->sb_size) -- resulting in an oops if/when super_sync()->memset() is called. Fix by moving the initialization of the superblock start and size out of super_load() to the caller (analyse_superblocks). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm integrity: allow unaligned bv_offsetMikulas Patocka2017-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When slub_debug is enabled kmalloc returns unaligned memory. XFS uses this unaligned memory for its buffers (if an unaligned buffer crosses a page, XFS frees it and allocates a full page instead - see the function xfs_buf_allocate_memory). dm-integrity checks if bv_offset is aligned on page size and this check fail with slub_debug and XFS. Fix this bug by removing the bv_offset check, leaving only the check for bv_len. Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Bruno Prémont <bonbons@sysophe.eu> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm crypt: allow unaligned bv_offsetMikulas Patocka2017-11-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When slub_debug is enabled kmalloc returns unaligned memory. XFS uses this unaligned memory for its buffers (if an unaligned buffer crosses a page, XFS frees it and allocates a full page instead - see the function xfs_buf_allocate_memory). dm-crypt checks if bv_offset is aligned on page size and these checks fail with slub_debug and XFS. Fix this bug by removing the bv_offset checks. Switch to checking if bv_len is aligned instead of bv_offset (this check should be sufficient to prevent overruns if a bio with too small bv_len is received). Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Bruno Prémont <bonbons@sysophe.eu> Tested-by: Bruno Prémont <bonbons@sysophe.eu> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm: small cleanup in dm_get_md()Mike Snitzer2017-11-101-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes dm_get_md() and dm_get_from_kobject() have similar code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm: fix race between dm_get_from_kobject() and __dm_destroy()Hou Tao2017-11-101-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following BUG_ON was hit when testing repeat creation and removal of DM devices: kernel BUG at drivers/md/dm.c:2919! CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44 Call Trace: [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44 [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf [<ffffffff811de257>] kernfs_seq_show+0x23/0x25 [<ffffffff81199118>] seq_read+0x16f/0x325 [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f [<ffffffff8117b625>] __vfs_read+0x26/0x9d [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44 [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9 [<ffffffff8117be9d>] vfs_read+0x8f/0xcf [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41 [<ffffffff8117c686>] SyS_read+0x4b/0x76 [<ffffffff817b606e>] system_call_fastpath+0x12/0x71 The bug can be easily triggered, if an extra delay (e.g. 10ms) is added between the test of DMF_FREEING & DMF_DELETING and dm_get() in dm_get_from_kobject(). To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and dm_get() are done in an atomic way, so _minor_lock is used. The other callers of dm_get() have also been checked to be OK: some callers invoke dm_get() under _minor_lock, some callers invoke it under _hash_lock, and dm_start_request() invoke it after increasing md->open_count. Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm: allocate struct mapped_device with kvzallocMikulas Patocka2017-11-102-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The structure srcu_struct can be very big, its size is proportional to the value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field io_barrier in the struct mapped_device has 84kB in the debugging kernel and 50kB in the non-debugging kernel. The large size may result in failure of the function kzalloc_node. In order to avoid the allocation failure, we use the function kvzalloc_node, this function falls back to vmalloc if a large contiguous chunk of memory is not available. This patch also moves the field io_barrier to the last position of struct mapped_device - the reason is that on many processor architectures, short memory offsets result in smaller code than long memory offsets - on x86-64 it reduces code size by 320 bytes. Note to stable kernel maintainers - the kernels 4.11 and older don't have the function kvzalloc_node, you can use the function vzalloc_node instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm zoned: ignore last smaller runt zoneDamien Le Moal2017-11-101-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SCSI layer allows ZBC drives to have a smaller last runt zone. For such a device, specifying the entire capacity for a dm-zoned target table entry fails because the specified capacity is not aligned on a device zone size indicated in the request queue structure of the device. Fix this problem by ignoring the last runt zone in the entry length when seting up the dm-zoned target (ctr method) and when iterating table entries of the target (iterate_devices method). This allows dm-zoned users to still easily setup a target using the entire device capacity (as mandated by dm-zoned) or the aligned capacity excluding the last runt zone. While at it, replace direct references to the device queue chunk_sectors limit with calls to the accessor blk_queue_zone_sectors(). Reported-by: Peter Desnoyers <pjd@ccs.neu.edu> Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm space map metadata: use ARRAY_SIZEJérémy Lefaure2017-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using the ARRAY_SIZE macro improves the readability of the code. Found with Coccinelle with the following semantic patch: @r depends on (org || report)@ type T; T[] E; position p; @@ ( (sizeof(E)@p /sizeof(*E)) | (sizeof(E)@p /sizeof(E[...])) | (sizeof(E)@p /sizeof(T)) ) Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm log writes: add support for DAXRoss Zwisler2017-11-101-1/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have the ability log filesystem writes using a flat buffer, add support for DAX. The motivation for this support is the need for an xfstest that can test the new MAP_SYNC DAX flag. By logging the filesystem activity with dm-log-writes we can show that the MAP_SYNC page faults are writing out their metadata as they happen, instead of requiring an explicit msync/fsync. Unfortunately we can't easily track data that has been written via mmap() now that the dax_flush() abstraction was removed by commit c3ca015fab6d ("dax: remove the pmem_dax_ops->flush abstraction"). Otherwise we could just treat each flush as a big write, and store the data that is being synced to media. It may be worthwhile to add the dax_flush() entry point back, just as a notifier so we can do this logging. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm log writes: add support for inline data buffersRoss Zwisler2017-11-101-3/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently dm-log-writes supports writing filesystem data via BIOs, and writing internal metadata from a flat buffer via write_metadata(). For DAX writes, though, we won't have a BIO, but will instead have an iterator that we'll want to use to fill a flat data buffer. So, create write_inline_data() which allows us to write filesystem data using a flat buffer as a source, and wire it up in log_one_block(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: simplify get_per_bio_data() by removing data_size argumentMike Snitzer2017-11-101-39/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is only one per_bio_data size now that writethrough-specific data was removed from the per_bio_data structure. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: remove all obsolete writethrough-specific codeMike Snitzer2017-11-101-81/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the writethrough code is much simpler there is no need to track so much state or cascade bio submission (as was done, via writethrough_endio(), to issue origin then cache IO in series). As such the obsolete writethrough list and workqueue is also removed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: submit writethrough writes in parallel to origin and cacheMike Snitzer2017-11-101-17/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Discontinue issuing writethrough write IO in series to the origin and then cache. Use bio_clone_fast() to create a new origin clone bio that will be mapped to the origin device and then bio_chain() it to the bio that gets remapped to the cache device. The origin clone bio does _not_ have a copy of the per_bio_data -- as such check_if_tick_bio_needed() will not be called. The cache bio (parent bio) will not complete until the origin bio has completed -- this fulfills bio_clone_fast()'s requirements as well as the requirement to not complete the original IO until the write IO has completed to both the origin and cache device. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: pass cache structure to mode functionsMike Snitzer2017-11-101-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No functional changes, just a bit cleaner than passing cache_features structure. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: fix race condition in the writeback mode overwrite_bio optimisationJoe Thornber2017-11-101-33/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a DM cache in writeback mode moves data between the slow and fast device it can often avoid a copy if the triggering bio either: i) covers the whole block (no point copying if we're about to overwrite it) ii) the migration is a promotion and the origin block is currently discarded Prior to this fix there was a race with case (ii). The discard status was checked with a shared lock held (rather than exclusive). This meant another bio could run in parallel and write data to the origin, removing the discard state. After the promotion the parallel write would have been lost. With this fix the discard status is re-checked once the exclusive lock has been aquired. If the block is no longer discarded it falls back to the slower full copy path. Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
OpenPOWER on IntegriCloud