summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'modules-for-v4.15' of ↵Linus Torvalds2017-11-151-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module updates from Jessica Yu: "Summary of modules changes for the 4.15 merge window: - treewide module_param_call() cleanup, fix up set/get function prototype mismatches, from Kees Cook - minor code cleanups" * tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: Do not paper over type mismatches in module_param_call() treewide: Fix function prototypes for module_param_call() module: Prepare to convert all module_param_call() prototypes kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
| * treewide: Fix function prototypes for module_param_call()Kees Cook2017-10-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several function prototypes for the set/get functions defined by module_param_call() have a slightly wrong argument types. This fixes those in an effort to clean up the calls when running under type-enforced compiler instrumentation for CFI. This is the result of running the following semantic patch: @match_module_param_call_function@ declarer name module_param_call; identifier _name, _set_func, _get_func; expression _arg, _mode; @@ module_param_call(_name, _set_func, _get_func, _arg, _mode); @fix_set_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._set_func; identifier _val, _param; type _val_type, _param_type; @@ int _set_func( -_val_type _val +const char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } @fix_get_prototype depends on match_module_param_call_function@ identifier match_module_param_call_function._get_func; identifier _val, _param; type _val_type, _param_type; @@ int _get_func( -_val_type _val +char * _val , -_param_type _param +const struct kernel_param * _param ) { ... } Two additional by-hand changes are included for places where the above Coccinelle script didn't notice them: drivers/platform/x86/thinkpad_acpi.c fs/lockd/svc.c Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@kernel.org>
* | Merge tag 'gpio-v4.15-1' of ↵Linus Torvalds2017-11-141-15/+7
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for the v4.15 kernel cycle: Core: - Fix the semantics of raw GPIO to actually be raw. No inversion semantics as before, but also no open draining, and allow the raw operations to affect lines used for interrupts as the caller supposedly knows what they are doing if they are getting the big hammer. - Rewrote the __inner_function() notation calls to names that make more sense. I just find this kind of code disturbing. - Drop the .irq_base() field from the gpiochip since now all IRQs are mapped dynamically. This is nice. - Support for .get_multiple() in the core driver API. This allows us to read several GPIO lines with a single register read. This has high value for some usecases: it can be used to create oscilloscopes and signal analyzers and other things that rely on reading several lines at exactly the same instant. Also a generally nice optimization. This uses the new assign_bit() macro from the bitops lib that was ACKed by Andrew Morton and is implemented for two drivers, one of them being the generic MMIO driver so everyone using that will be able to benefit from this. - Do not allow requests of Open Drain and Open Source setting of a GPIO line simultaneously. If the hardware actually supports enabling both at the same time the electrical result would be disastrous. - A new interrupt chip core helper. This will be helpful to deal with "banked" GPIOs, which means GPIO controllers with several logical blocks of GPIO inside them. This is several gpiochips per device in the device model, in contrast to the case when there is a 1-to-1 relationship between a device and a gpiochip. New drivers: - Maxim MAX3191x industrial serializer, a very interesting piece of professional I/O hardware. - Uniphier GPIO driver. This is the GPIO block from the recent Socionext (ex Fujitsu and Panasonic) platform. - Tegra 186 driver. This is based on the new banked GPIO infrastructure. Other improvements: - Some documentation improvements. - Wakeup support for the DesignWare DWAPB GPIO controller. - Reset line support on the DesignWare DWAPB GPIO controller. - Several non-critical bug fixes and improvements for the Broadcom BRCMSTB driver. - Misc non-critical bug fixes like exotic errorpaths, removal of dead code etc. - Explicit comments on fall-through switch() statements" * tag 'gpio-v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (65 commits) gpio: tegra186: Remove tegra186_gpio_lock_class gpio: rcar: Add r8a77995 (R-Car D3) support pinctrl: bcm2835: Fix some merge fallout gpio: Fix undefined lock_dep_class gpio: Automatically add lockdep keys gpio: Introduce struct gpio_irq_chip.first gpio: Disambiguate struct gpio_irq_chip.nested gpio: Add Tegra186 support gpio: Export gpiochip_irq_{map,unmap}() gpio: Implement tighter IRQ chip integration gpio: Move lock_key into struct gpio_irq_chip gpio: Move irq_valid_mask into struct gpio_irq_chip gpio: Move irq_nested into struct gpio_irq_chip gpio: Move irq_chained_parent to struct gpio_irq_chip gpio: Move irq_default_type to struct gpio_irq_chip gpio: Move irq_handler to struct gpio_irq_chip gpio: Move irqdomain into struct gpio_irq_chip gpio: Move irqchip into struct gpio_irq_chip gpio: Introduce struct gpio_irq_chip pinctrl: armada-37xx: remove unused variable ...
| * | bitops: Introduce assign_bit()Lukas Wunner2017-10-191-15/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A common idiom is to assign a value to a bit with: if (value) set_bit(nr, addr); else clear_bit(nr, addr); Likewise common is the one-line expression variant: value ? set_bit(nr, addr) : clear_bit(nr, addr); Commit 9a8ac3ae682e ("dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()") introduced assign_bit() to the md subsystem for brevity. Make it available to others, specifically gpiolib and the upcoming driver for Maxim MAX3191x industrial serializer chips. As requested by Peter Zijlstra, change the argument order to reflect traditional "dst = src" in C, hence "assign_bit(nr, addr, value)". Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Neil Brown <neilb@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
* | | Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2017-11-1421-220/+400
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MD update from Shaohua Li: "This update mostly includes bug fixes: - md-cluster now supports raid10 from Guoqing - raid5 PPL fixes from Artur - badblock regression fix from Bo - suspend hang related fixes from Neil - raid5 reshape fixes from Neil - raid1 freeze deadlock fix from Nate - memleak fixes from Zdenek - bitmap related fixes from Me and Tao - other fixes and cleanups" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits) md: free unused memory after bitmap resize md: release allocated bitset sync_set md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb md: be cautious about using ->curr_resync_completed for ->recovery_offset badblocks: fix wrong return value in badblocks_set if badblocks are disabled md: don't check MD_SB_CHANGE_CLEAN in md_allow_write md-cluster: update document for raid10 md: remove redundant variable q raid1: remove obsolete code in raid1_write_request md-cluster: Use a small window for raid10 resync md-cluster: Suspend writes in RAID10 if within range md-cluster/raid10: set "do_balance = 0" if area is resyncing md: use lockdep_assert_held raid1: prevent freeze_array/wait_all_barriers deadlock md: use TASK_IDLE instead of blocking signals md: remove special meaning of ->quiesce(.., 2) md: allow metadata update while suspending. md: use mddev_suspend/resume instead of ->quiesce() md: move suspend_hi/lo handling into core md code md: don't call bitmap_create() while array is quiesced. ...
| * | | md: free unused memory after bitmap resizeZdenek Kabelac2017-11-101-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When bitmap is resized, the old kalloced chunks just are not released once the resized bitmap starts to use new space. This fixes in particular kmemleak reports like this one: unreferenced object 0xffff8f4311e9c000 (size 4096): comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s) hex dump (first 32 bytes): 02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................ 02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................ backtrace: [<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0 [<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0 [<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110 [<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0 [<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0 [<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530 [<ffffffffa676f1ae>] bitmap_load+0xbe/0x160 [<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid] [<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0 [<ffffffffa6774b52>] dm_resume+0x122/0x140 [<ffffffffa6779b9f>] dev_suspend+0x18f/0x290 [<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560 [<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20 [<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750 [<ffffffffa62d7269>] SyS_ioctl+0x79/0x90 [<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: release allocated bitset sync_setZdenek Kabelac2017-11-101-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch fixes kmemleak on md_stop() path used likely only by dm-raid wrapper. Code of md is using mddev_put() where both bitsets are released however this freeing is not shared. Also set NULL to bio_set and sync_set pointers just like mddev_put is doing. Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sbHou Tao2017-11-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a RAID1 device using a file-based bitmap, if a bitmap write error occurs but the later writes succeed, it's possible both BITMAP_STALE and BITMAP_WRITE_ERROR bits will be written to the bitmap super block, the BITMAP_STALE bit will be handled properly and be cleared, but the BITMAP_WRITE_ERROR bit in sb->flags will make bitmap_create() to fail. So clear it to protect against the write failure-and-then-recovery case. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: be cautious about using ->curr_resync_completed for ->recovery_offsetNeilBrown2017-11-092-11/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ->recovery_offset shows how much of a non-InSync device is actually in sync - how much has been recoveryed. When performing a recovery, ->curr_resync and ->curr_resync_completed follow the device address being recovered and so can be used to update ->recovery_offset. When performing a reshape, ->curr_resync* might follow the device addresses (raid5) or might follow array addresses (raid10), so cannot in general be used to set ->recovery_offset. When reshaping backwards, ->curre_resync* measures from the *end* of the array-or-device, so is particularly unhelpful. So change the common code in md.c to only use ->curr_resync_complete for the simple recovery case, and add code to raid5.c to update ->recovery_offset during a forwards reshape. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: don't check MD_SB_CHANGE_CLEAN in md_allow_writeArtur Paszkiewicz2017-11-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only MD_SB_CHANGE_PENDING should be used to wait for transition from clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can race with e.g. md_do_sync(). This sporadically causes a hang when changing consistency policy during resync: INFO: task mdadm:6183 blocked for more than 30 seconds. Not tainted 4.14.0-rc3+ #391 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdadm D12752 6183 6022 0x00000000 Call Trace: __schedule+0x93f/0x990 schedule+0x6b/0x90 md_allow_write+0x100/0x130 [md_mod] ? do_wait_intr_irq+0x90/0x90 resize_stripes+0x3a/0x5b0 [raid456] ? kernfs_fop_write+0xbe/0x180 raid5_change_consistency_policy+0xa6/0x200 [raid456] consistency_policy_store+0x2e/0x70 [md_mod] md_attr_store+0x90/0xc0 [md_mod] sysfs_kf_write+0x42/0x50 kernfs_fop_write+0x119/0x180 __vfs_write+0x28/0x110 ? rcu_sync_lockdep_assert+0x12/0x60 ? __sb_start_write+0x15a/0x1c0 ? vfs_write+0xa3/0x1a0 vfs_write+0xb4/0x1a0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x18/0xad Fixes: 2214c260c72b ("md: don't return -EAGAIN in md_allow_write for external metadata arrays") Cc: <stable@vger.kernel.org> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md-cluster: update document for raid10Guoqing Jiang2017-11-012-3/+4
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: remove redundant variable qColin Ian King2017-11-011-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pointer q is assigned but never read; it is redundant and can be removed. Cleans up clang warning: drivers/md/md-multipath.c:260:4: warning: Value stored to 'q' is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid1: remove obsolete code in raid1_write_requestGuoqing Jiang2017-11-011-13/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some lines could be removed due to recent change for raid1 such as commit 3956df15d634 ("md: move suspend_hi/lo handling into core md code"). Also, seems some comments are put to wrong place, move them before wait_barrier. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md-cluster: Use a small window for raid10 resyncGuoqing Jiang2017-11-012-1/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window is maintained in r10conf as cluster_sync_low and cluster_sync_high, and processed in raid10's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Set cluster_sync_high to cluster_sync_low + stripe size. 3. Send a message to all nodes so they may add it in their suspension list. Note: We only support "near" raid10 so far, resync a far or offset raid10 array could have trouble. So raid10_run checks the layout of clustered raid10, it will refuse to run if the layout is not correct. With the "near" layout we process one stripe at a time progressing monotonically through the address space. So we can have a sliding window of whole-stripes which moves through the array suspending IO on other nodes, and both resync which uses array addresses and recovery which uses device addresses can stay within this window. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md-cluster: Suspend writes in RAID10 if within rangeGuoqing Jiang2017-11-011-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is a resync going on, all nodes must suspend writes to the range. This is recorded in suspend_info and suspend_list. If there is an I/O within the ranges of any of the suspend_info, area_resyncing will return 1. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md-cluster/raid10: set "do_balance = 0" if area is resyncingGuoqing Jiang2017-11-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like clustered raid1, it is impossible for cluster raid10 to choose the best device for read balance when the area of array is resyncing. Because we cannot trust the data to be the same on all devices at that time, so we choose just the first one to use, so set do_balance to 0. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: use lockdep_assert_heldShaohua Li2017-11-013-13/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | lockdep_assert_held is a better way to assert lock held, and it works for UP. Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid1: prevent freeze_array/wait_all_barriers deadlockNate Dailey2017-11-011-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If freeze_array is attempted in the middle of close_sync/ wait_all_barriers, deadlock can occur. freeze_array will wait for nr_pending and nr_queued to line up. wait_all_barriers increments nr_pending for each barrier bucket, one at a time, but doesn't actually issue IO that could be counted in nr_queued. So freeze_array is blocked until wait_all_barriers completes and allow_all_barriers runs. At the same time, when _wait_barrier sees array_frozen == 1, it stops and waits for freeze_array to complete. Prevent the deadlock by making close_sync call _wait_barrier and _allow_barrier for one bucket at a time, instead of deferring the _allow_barrier calls until after all _wait_barriers are complete. Signed-off-by: Nate Dailey <nate.dailey@stratus.com> Fix: fd76863e37fe(RAID1: a new I/O barrier implementation to remove resync window) Reviewed-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org (v4.11) Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: use TASK_IDLE instead of blocking signalsMikulas Patocka2017-11-012-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi - I submit this patch for the next merge window: Some times ago, I made a patch f9c79bc05a2a that blocks signals around the schedule() calls in MD. The MD subsystem needs to do an uninterruptible sleep that is not accounted in load average - so we block signals and use interruptible sleep. The kernel has a special TASK_IDLE state for this purpose, so we can use it instead of blocking signals. This patch doesn't fix any bug, it just makes the code simpler. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: remove special meaning of ->quiesce(.., 2)NeilBrown2017-11-019-69/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: allow metadata update while suspending.NeilBrown2017-11-012-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are various deadlocks that can occur when a thread holds reconfig_mutex and calls ->quiesce(mddev, 1). As some write request block waiting for metadata to be updated (e.g. to record device failure), and as the md thread updates the metadata while the reconfig mutex is held, holding the mutex can stop write requests completing, and this prevents ->quiesce(mddev, 1) from completing. ->quiesce() is now usually called from mddev_suspend(), and it is always called with reconfig_mutex held. So at this time it is safe for the thread to update metadata without explicitly taking the lock. So add 2 new flags, one which says the unlocked updates is allowed, and one which ways it is happening. Then allow it while the quiesce completes, and then wait for it to finish. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: use mddev_suspend/resume instead of ->quiesce()NeilBrown2017-11-011-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mddev_suspend() is a more general interface than calling ->quiesce() and is so more extensible. A future patch will make use of this. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: move suspend_hi/lo handling into core md codeNeilBrown2017-11-013-37/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | responding to ->suspend_lo and ->suspend_hi is similar to responding to ->suspended. It is best to wait in the common core code without incrementing ->active_io. This allows mddev_suspend()/mddev_resume() to work while requests are waiting for suspend_lo/hi to change. This is will be important after a subsequent patch which uses mddev_suspend() to synchronize updating for suspend_lo/hi. So move the code for testing suspend_lo/hi out of raid1.c and raid5.c, and place it in md.c Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: don't call bitmap_create() while array is quiesced.NeilBrown2017-11-011-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bitmap_create() allocates memory with GFP_KERNEL and so can wait for IO. If called while the array is quiesced, it could wait indefinitely for write out to the array - deadlock. So call bitmap_create() before quiescing the array. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: always hold reconfig_mutex when calling mddev_suspend()NeilBrown2017-11-013-7/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most often mddev_suspend() is called with reconfig_mutex held. Make this a requirement in preparation a subsequent patch. Also require reconfig_mutex to be held for mddev_resume(), partly for symmetry and partly to guarantee no races with incr/decr of mddev->suspend. Taking the mutex in r5c_disable_writeback_async() is a little tricky as this is called from a work queue via log->disable_writeback_work, and flush_work() is called on that while holding ->reconfig_mutex. If the work item hasn't run before flush_work() is called, the work function will not be able to get the mutex. So we use mddev_trylock() inside the wait_event() call, and have that abort when conf->log is set to NULL, which happens before flush_work() is called. We wait in mddev->sb_wait and ensure this is woken when any of the conditions change. This requires waking mddev->sb_wait in mddev_unlock(). This is only like to trigger extra wake_ups of threads that needn't be woken when metadata is being written, and that doesn't happen often enough that the cost would be noticeable. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: forbid a RAID5 from having both a bitmap and a journal.NeilBrown2017-11-013-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having both a bitmap and a journal is pointless. Attempting to do so can corrupt the bitmap if the journal replay happens before the bitmap is initialized. Rather than try to avoid this corruption, simply refuse to allow arrays with both a bitmap and a journal. So: - if raid5_run sees both are present, fail. - if adding a bitmap finds a journal is present, fail - if adding a journal finds a bitmap is present, fail. Cc: stable@vger.kernel.org (4.10+) Signed-off-by: NeilBrown <neilb@suse.com> Tested-by: Joshua Kinard <kumba@gentoo.org> Acked-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid5: Set R5_Expanded on parity devices as well as data.NeilBrown2017-10-181-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reshaping a fully degraded raid5/raid6 to a larger nubmer of devices, the new device(s) are not in-sync and so that can make the newly grown stripe appear to be "failed". To avoid this, we set the R5_Expanded flag to say "Even though this device is not fully in-sync, this block is safe so don't treat the device as failed for this stripe". This flag is set for data devices, not not for parity devices. Consequently, if you have a RAID6 with two devices that are partly recovered and a spare, and start a reshape to include the spare, then when the reshape gets past the point where the recovery was up to, it will think the stripes are failed and will get into an infinite loop, failing to make progress. So when contructing parity on an EXPAND_READY stripe, set R5_Expanded. Reported-by: Curt <lightspd@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: raid10: remove a couple of redundant variables and initializationsColin Ian King2017-10-161-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variables dev and bio_last_sector are assigned values that are never read and hence these are redundant variables and can be removed. Also remove the duplicated initialization of sectors, the latter assignment is identical to the first and can be removed. Cleans up 3 clang build warnings: Value stored to 'dev' is never read Value stored to 'bio_last_sector' is never read Value stored to 'sectors' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: rename some drivers/md/ files to have an "md-" prefixMike Snitzer2017-10-1615-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Motivated by the desire to illiminate the imprecise nature of DM-specific patches being unnecessarily sent to both the MD maintainer and mailing-list. Which is born out of the fact that DM files also reside in drivers/md/ Now all MD-specific files in drivers/md/ start with either "raid" or "md-" and the MAINTAINERS file has been updated accordingly. Shaohua: don't change module name Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: raid10: remove VLAISMatthias Kaehlcke2017-10-161-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The raid10 driver can't be built with clang since it uses a variable length array in a structure (VLAIS): drivers/md/raid10.c:4583:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported Allocate the r10bio struct with kmalloc instead of using the VLAIS construct. Shaohua: set the MD_RECOVERY_INTR bit Neil Brown: use GFP_NOIO Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md-cluster: make function cluster_check_sync_size staticColin Ian King2017-10-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function cluster_check_sync_size is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol 'cluster_check_sync_size' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid5-ppl: check recovery_offset when performing ppl recoveryArtur Paszkiewicz2017-10-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If starting an array that is undergoing rebuild, make ppl recovery honor the recovery_offset of a member disk and don't read data that is not yet in-sync. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid5-ppl: don't resync after rebuildArtur Paszkiewicz2017-10-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for degraded array is unnecessary and causes a resync to be performed after ppl recovery and rebuild when restarting an array during rebuilding after unclean shutdown. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md-cluster: fix wrong condition check in raid1_write_requestGuoqing Jiang2017-10-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check used here is to avoid conflict between write and resync, however we used the wrong logic, it should be the inverse of the checking inside "if". Fixes: 589a1c4 ("Suspend writes in RAID1 if within range") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md/bitmap: revert a patchShaohua Li2017-10-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 8031c3ddc70a. That patches doesn't work well if PAGE_SIZE > 4k. We will fix the original problem with a different approach. Fix: 8031c3ddc70a(md/bitmap: copy correct data for bitmap super) Reported-by: Joshua Kinard <kumba@gentoo.org> Cc: stable@vger.kernel.org (4.10+) Suggested-by: Neil Brown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: always set THREAD_WAKEUP and wake up wqueue if thread existedGuoqing Jiang2017-10-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), the wait_queue is only got invoked if THREAD_WAKEUP is not set previously. With above change, I can see process_metadata_update could always hang on the wait queue, because mddev->thread could stay on 'D' status and the THREAD_WAKEUP flag is not cleared since there are lots of place to wake up mddev->thread. Then deadlock happened as follows: linux175:~ # ps aux|grep md|grep D root 20117 0.0 0.0 0 0 ? D 03:45 0:00 [md0_raid1] root 20125 0.0 0.0 0 0 ? D 03:45 0:00 [md0_cluster_rec] linux175:~ # cat /proc/20117/stack [<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster] [<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster] [<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster] [<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod] [<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod] [<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod] [<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 linux175:~ # cat /proc/20125/stack [<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 So let's revert the part of code in the commit to resovle the problem since we can't get lots of benefits of previous change. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: fix deadlock error in recent patch.NeilBrown2017-10-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent patch aimed to cause md_write_start() to fail (rather than block) when the mddev was suspending, so as to avoid deadlocks. Unfortunately the test in wait_event() was wrong, and it didn't change behaviour at all. We wait_event() must wait until the metadata is written OR the array is suspending. Fixes: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") Cc: stable@vger.kernel.org Reported-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | | | Merge tag 'for-4.15/dm' of ↵Linus Torvalds2017-11-1414-262/+408
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - a few conversions from atomic_t to ref_count_t - a DM core fix for a race during device destruction that could result in a BUG_ON - a stable@ fix for a DM cache race condition that could lead to data corruption when operating in writeback mode (writethrough is default) - various DM cache cleanups and improvements - add DAX support to the DM log-writes target - a fix for the DM zoned target's ability to deal with the last zone of the drive being smaller than all others - a stable@ DM crypt and DM integrity fix for a negative check that was to restrictive (prevented slab debug with XFS ontop of DM crypt from working) - a DM raid target fix for a panic that can occur when forcing a raid to sync * tag 'for-4.15/dm' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (25 commits) dm cache: lift common migration preparation code to alloc_migration() dm cache: remove usused deferred_cells member from struct cache dm cache policy smq: allocate cache blocks in order dm cache policy smq: change max background work from 10240 to 4096 blocks dm cache background tracker: limit amount of background work that may be issued at once dm cache policy smq: take origin idle status into account when queuing writebacks dm cache policy smq: handle races with queuing background_work dm raid: fix panic when attempting to force a raid to sync dm integrity: allow unaligned bv_offset dm crypt: allow unaligned bv_offset dm: small cleanup in dm_get_md() dm: fix race between dm_get_from_kobject() and __dm_destroy() dm: allocate struct mapped_device with kvzalloc dm zoned: ignore last smaller runt zone dm space map metadata: use ARRAY_SIZE dm log writes: add support for DAX dm log writes: add support for inline data buffers dm cache: simplify get_per_bio_data() by removing data_size argument dm cache: remove all obsolete writethrough-specific code dm cache: submit writethrough writes in parallel to origin and cache ...
| * | | | dm cache: lift common migration preparation code to alloc_migration()Mike Snitzer2017-11-101-10/+7
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache: remove usused deferred_cells member from struct cacheJoe Thornber2017-11-101-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: allocate cache blocks in orderJoe Thornber2017-11-101-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, cache blocks were being allocated in reverse order. Fix this by pulling the block off the head of the free list. Shouldn't have any impact on performance or latency but it is more correct to have the cache blocks allocated/mapped in ascending order. This fix will slightly increase the chances of two adjacent oblocks being in adjacent cblocks. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: change max background work from 10240 to 4096 blocksJoe Thornber2017-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 10240 blocks was too much, lowering this reduces the latency of copying and consumes less memory. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache background tracker: limit amount of background work that may be ↵Joe Thornber2017-11-101-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | issued at once On large systems the cache policy can be over enthusiastic and queue far too much dirty data to be written back. This consumes memory. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: take origin idle status into account when queuing ↵Joe Thornber2017-11-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | writebacks If the origin device is idle try and writeback more data. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm cache policy smq: handle races with queuing background_workJoe Thornber2017-11-101-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The background_tracker holds a set of promotions/demotions that the cache policy wishes the core target to implement. When adding a new operation to the tracker it's possible that an operation on the same block is already present (but in practise this doesn't appear to be happening). Catch these situations and do the appropriate cleanup. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm raid: fix panic when attempting to force a raid to syncHeinz Mauelshagen2017-11-101-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Requesting a sync on an active raid device via a table reload (see 'sync' parameter in Documentation/device-mapper/dm-raid.txt) skips the super_load() call that defines the superblock size (rdev->sb_size) -- resulting in an oops if/when super_sync()->memset() is called. Fix by moving the initialization of the superblock start and size out of super_load() to the caller (analyse_superblocks). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm integrity: allow unaligned bv_offsetMikulas Patocka2017-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When slub_debug is enabled kmalloc returns unaligned memory. XFS uses this unaligned memory for its buffers (if an unaligned buffer crosses a page, XFS frees it and allocates a full page instead - see the function xfs_buf_allocate_memory). dm-integrity checks if bv_offset is aligned on page size and this check fail with slub_debug and XFS. Fix this bug by removing the bv_offset check, leaving only the check for bv_len. Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Bruno Prémont <bonbons@sysophe.eu> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm crypt: allow unaligned bv_offsetMikulas Patocka2017-11-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When slub_debug is enabled kmalloc returns unaligned memory. XFS uses this unaligned memory for its buffers (if an unaligned buffer crosses a page, XFS frees it and allocates a full page instead - see the function xfs_buf_allocate_memory). dm-crypt checks if bv_offset is aligned on page size and these checks fail with slub_debug and XFS. Fix this bug by removing the bv_offset checks. Switch to checking if bv_len is aligned instead of bv_offset (this check should be sufficient to prevent overruns if a bio with too small bv_len is received). Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Bruno Prémont <bonbons@sysophe.eu> Tested-by: Bruno Prémont <bonbons@sysophe.eu> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm: small cleanup in dm_get_md()Mike Snitzer2017-11-101-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes dm_get_md() and dm_get_from_kobject() have similar code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | | | dm: fix race between dm_get_from_kobject() and __dm_destroy()Hou Tao2017-11-101-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following BUG_ON was hit when testing repeat creation and removal of DM devices: kernel BUG at drivers/md/dm.c:2919! CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44 Call Trace: [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44 [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf [<ffffffff811de257>] kernfs_seq_show+0x23/0x25 [<ffffffff81199118>] seq_read+0x16f/0x325 [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f [<ffffffff8117b625>] __vfs_read+0x26/0x9d [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44 [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9 [<ffffffff8117be9d>] vfs_read+0x8f/0xcf [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41 [<ffffffff8117c686>] SyS_read+0x4b/0x76 [<ffffffff817b606e>] system_call_fastpath+0x12/0x71 The bug can be easily triggered, if an extra delay (e.g. 10ms) is added between the test of DMF_FREEING & DMF_DELETING and dm_get() in dm_get_from_kobject(). To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and dm_get() are done in an atomic way, so _minor_lock is used. The other callers of dm_get() have also been checked to be OK: some callers invoke dm_get() under _minor_lock, some callers invoke it under _hash_lock, and dm_start_request() invoke it after increasing md->open_count. Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
OpenPOWER on IntegriCloud