summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* dm crypt: fix memory leak in crypt_ctr_cipher_old()Jeffy Chen2017-10-121-0/+1
| | | | | | | | | | | | commit bd86e32059526e2d0d13ca1e4447dfbbddb6e5cc upstream. Fix memory leak of cipher_api. Fixes: 33d2f09fcb35 (dm crypt: introduce new format of cipher with "capi:" prefix) Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm ioctl: fix alignment of event number in the device listMikulas Patocka2017-10-123-15/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 62e082430ea4bb5b28909ca4375bb683931e22aa upstream. The size of struct dm_name_list is different on 32-bit and 64-bit kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels). This mismatch caused some harmless difference in padding when using 32-bit or 64-bit kernel. Commit 23d70c5e52dd ("dm ioctl: report event number in DM_LIST_DEVICES") added reporting event number in the output of DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for userspace to determine the location of the event number (the location would be different when running on 32-bit and 64-bit kernels). Fix the padding by using offsetof(struct dm_name_list, name) instead of sizeof(struct dm_name_list) to determine the location of entries. Also, the ioctl version number is incremented to 37 so that userspace can use the version number to determine that the event number is present and correctly located. In addition, a global event is now raised when a DM device is created, removed, renamed or when table is swapped, so that the user can monitor for device changes. Reported-by: Eugene Syromiatnikov <esyr@redhat.com> Fixes: 23d70c5e52dd ("dm ioctl: report event number in DM_LIST_DEVICES") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm crypt: reject sector_size feature if device length is not aligned to itMilan Broz2017-10-121-0/+4
| | | | | | | | | | | | | | commit 783874b050768d361239e444ba0fa396bb6d463f upstream. If a crypt mapping uses optional sector_size feature, additional restrictions to mapped device segment size must be applied in constructor, otherwise the device activation will fail later. Fixes: 8f0009a225 ("dm crypt: optionally support larger encryption sector size") Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: separate request handlingShaohua Li2017-10-052-25/+34
| | | | | | | | | | | | | | | | commit 393debc23c7820211d1c8253dd6a8408a7628fe7 upstream. With commit cc27b0c78c79, pers->make_request could bail out without handling the bio. If that happens, we should retry. The commit fixes md_make_request but not other call sites. Separate the request handling part, so other call sites can use it. Reported-by: Nate Dailey <nate.dailey@stratus.com> Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start()) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: fix a race condition for flush request handlingShaohua Li2017-10-051-4/+10
| | | | | | | | | | | | | | | commit 79bf31a3b2a7ca467cfec8ff97d359a77065d01f upstream. md_submit_flush_data calls pers->make_request, which missed the suspend check. Fix it with the new md_handle_request API. Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Nate Dailey <nate.dailey@stratus.com> Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start()) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_listDennis Yang2017-10-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 184a09eb9a2fe425e49c9538f1604b05ed33cfef upstream. In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST set, it indicates that this stripe_head is already in the raid5_plug_cb list and release_stripe() would be called instead to drop a reference count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this stripe_head and it will get queued into the raid5_plug_cb list. Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST, A stripe could be re-added to plug list while it is still on that list in the following situation. If stripe_head A is added to another stripe_head B's batch list, in this case A will have its batch_head != NULL and be added into the plug list. After that, stripe_head B gets handled and called break_stripe_batch_list() to reset all the batched stripe_head(including A which is still on the plug list)'s state and reset their batch_head to NULL. Before the plug list gets processed, if there is another write request comes in and get stripe_head A, A will have its batch_head == NULL (cleared by calling break_stripe_batch_list() on B) and be added to plug list once again. Signed-off-by: Dennis Yang <dennisyang@qnap.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/raid5: fix a race condition in stripe batchShaohua Li2017-10-051-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3664847d95e60a9a943858b7800f8484669740fc upstream. We have a race condition in below scenario, say have 3 continuous stripes, sh1, sh2 and sh3, sh1 is the stripe_head of sh2 and sh3: CPU1 CPU2 CPU3 handle_stripe(sh3) stripe_add_to_batch_list(sh3) -> lock(sh2, sh3) -> lock batch_lock(sh1) -> add sh3 to batch_list of sh1 -> unlock batch_lock(sh1) clear_batch_ready(sh1) -> lock(sh1) and batch_lock(sh1) -> clear STRIPE_BATCH_READY for all stripes in batch_list -> unlock(sh1) and batch_lock(sh1) ->clear_batch_ready(sh3) -->test_and_clear_bit(STRIPE_BATCH_READY, sh3) --->return 0 as sh->batch == NULL -> sh3->batch_head = sh1 -> unlock (sh2, sh3) In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it impossible to clear STRIPE_BATCH_READY before batch_head is set. Thanks Stephane for helping debug this tricky issue. Reported-and-tested-by: Stephane Thiell <sthiell@stanford.edu> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm integrity: do not check integrity for failed read operationsHyunchul Lee2017-10-051-1/+5
| | | | | | | | | | | | | | | | | | commit b7e326f7b7375392d06f9cfbc27a7c63181f69d7 upstream. Even though read operations fail, dm_integrity_map_continue() calls integrity_metadata() to check integrity. In this case, just complete these. This also makes it so read I/O errors do not generate integrity warnings in the kernel log. Signed-off-by: Hyunchul Lee <cheol.lee@lge.com> Acked-by: Milan Broz <gmazyland@gmail.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dax: remove the pmem_dax_ops->flush abstractionMikulas Patocka2017-10-053-54/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c3ca015fab6df124c933b91902f3f2a3473f9da5 upstream. Commit abebfbe2f731 ("dm: add ->flush() dax operation support") is buggy. A DM device may be composed of multiple underlying devices and all of them need to be flushed. That commit just routes the flush request to the first device and ignores the other devices. It could be fixed by adding more complex logic to the device mapper. But there is only one implementation of the method pmem_dax_ops->flush - that is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we don't need the pmem_dax_ops->flush abstraction at all, we can call arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush can't ever reach anything different from arch_wb_cache_pmem(). It should be also pointed out that for some uses of persistent memory it is needed to flush only a very small amount of data (such as 1 cacheline), and it would be overkill if we go through that device mapper machinery for a single flushed cache line. Fix this by removing the pmem_dax_ops->flush abstraction and call arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device mapper code that forwards the flushes. Fixes: abebfbe2f731 ("dm: add ->flush() dax operation support") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: fix bch_hprint crash and improve outputMichael Lyle2017-09-271-15/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9276717b9e297a62d1151a43d1cd286213f68eb7 upstream. Most importantly, solve a crash where %llu was used to format signed numbers. This would cause a buffer overflow when reading sysfs writeback_rate_debug, as only 20 bytes were allocated for this and %llu writes 20 characters plus a null. Always use the units mechanism rather than having different output paths for simplicity. Also, correct problems with display output where 1.10 was a larger number than 1.09, by multiplying by 10 and then dividing by 1024 instead of dividing by 100. (Remainders of >= 1000 would print as .10). Minor changes: Always display the decimal point instead of trying to omit it based on number of digits shown. Decide what units to use based on 1000 as a threshold, not 1024 (in other words, always print at most 3 digits before the decimal point). Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: fix for gc and write-back raceTang Junhui2017-09-273-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9baf30972b5568d8b5bc8b3c46a6ec5b58100463 upstream. gc and write-back get raced (see the email "bcache get stucked" I sended before): gc thread write-back thread | |bch_writeback_thread() |bch_gc_thread() | | |==>read_dirty() |==>bch_btree_gc() | |==>btree_root() //get btree root | | //node write locker | |==>bch_btree_gc_root() | | |==>read_dirty_submit() | |==>write_dirty() | |==>continue_at(cl, | | write_dirty_finish, | | system_wq); | |==>write_dirty_finish()//excute | | //in system_wq | |==>bch_btree_insert() | |==>bch_btree_map_leaf_nodes() | |==>__bch_btree_map_nodes() | |==>btree_root //try to get btree | | //root node read | | //lock | |-----stuck here |==>bch_btree_set_root() |==>bch_journal_meta() |==>bch_journal() |==>journal_try_write() |==>journal_write_unlocked() //journal_full(&c->journal) | //condition satisfied |==>continue_at(cl, journal_write, system_wq); //try to excute | //journal_write in system_wq | //but work queue is excuting | //write_dirty_finish() |==>closure_sync(); //wait journal_write execute | //over and wake up gc, |-------------stuck here |==>release root node write locker This patch alloc a separate work-queue for write-back thread to avoid such race. (Commit log re-organized by Coly Li to pass checkpatch.pl checking) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: fix sequential large write IO bypassTang Junhui2017-09-271-6/+0
| | | | | | | | | | | | | | | | | | | | | | | commit c81ffa32a214c84b08900fbc9d432187bd948eba upstream. Sequential write IOs were tested with bs=1M by FIO in writeback cache mode, these IOs were expected to be bypassed, but actually they did not. We debug the code, and find in check_should_bypass(): if (!congested && mode == CACHE_MODE_WRITEBACK && op_is_write(bio_op(bio)) && (bio->bi_opf & REQ_SYNC)) goto rescale that means, If in writeback mode, a write IO with REQ_SYNC flag will not be bypassed though it is a sequential large IO, It's not a correct thing to do actually, so this patch remove these codes. Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: Correct return value for sysfs attach errorsTony Asleson2017-09-271-2/+2
| | | | | | | | | | | | | | | | | commit 77fa100f27475d08a569b9d51c17722130f089e7 upstream. If you encounter any errors in bch_cached_dev_attach it will return a negative error code. The variable 'v' which stores the result is unsigned, thus user space sees a very large value returned for bytes written which can cause incorrect user space behavior. Utilize 1 signed variable to use throughout the function to preserve error return capability. Signed-off-by: Tony Asleson <tasleson@redhat.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: correct cache_dirty_target in __update_writeback_rate()Tang Junhui2017-09-272-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a8394090a9129b40f9d90dcb7f4a49d60c727ca6 upstream. __update_write_rate() uses a Proportion-Differentiation Controller algorithm to control writeback rate. A dirty target number is used in this PD controller to control writeback rate. A larger target number will make the writeback rate smaller, on the versus, a smaller target number will make the writeback rate larger. bcache uses the following steps to calculate the target number, 1) cache_sectors = all-buckets-of-cache-set * buckets-size 2) cache_dirty_target = cache_sectors * cached-device-writeback_percent 3) target = cache_dirty_target * (sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set) The calculation at step 1) for cache_sectors is incorrect, which does not consider dirty blocks occupied by flash only volume. A flash only volume can be took as a bcache device without cached device. All data sectors allocated for it are persistent on cache device and marked dirty, they are not touched by bcache writeback and garbage collection code. So data blocks of flash only volume should be ignore when calculating cache_sectors of cache set. Current code does not subtract dirty sectors of flash only volume, which results a larger target number from the above 3 steps. And in sequence the cache device's writeback rate is smaller then a correct value, writeback speed is slower on all cached devices. This patch fixes the incorrect slower writeback rate by subtracting dirty sectors of flash only volumes in __update_writeback_rate(). (Commit log composed by Coly Li to pass checkpatch.pl checking) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: do not subtract sectors_to_gc for bypassed IOTang Junhui2017-09-271-3/+3
| | | | | | | | | | | | | | | commit 69daf03adef5f7bc13e0ac86b4b8007df1767aab upstream. Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to trigger gc thread. Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: Fix leak of bdev referenceJan Kara2017-09-271-0/+2
| | | | | | | | | | | | | | | commit 4b758df21ee7081ab41448d21d60367efaa625b3 upstream. If blkdev_get_by_path() in register_bcache() fails, we try to lookup the block device using lookup_bdev() to detect which situation we are in to properly report error. However we never drop the reference returned to us from lookup_bdev(). Fix that. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: initialize dirty stripes in flash_dev_run()Tang Junhui2017-09-273-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 175206cf9ab63161dec74d9cd7f9992e062491f5 upstream. bcache uses a Proportion-Differentiation Controller algorithm to control writeback rate to cached devices. In the PD controller algorithm, dirty stripes of thin flash device should not be counted in, because flash only volumes never write back dirty data. Currently dirty stripe counter for thin flash device is not initialized when the thin flash device starts. Which means the following calculation in PD controller will reference an undefined dirty stripes number, and all cached devices attached to the same cache set where the thin flash device lies on may have an inaccurate writeback rate. This patch calles bch_sectors_dirty_init() in flash_dev_run(), to correctly initialize dirty stripe counter when the thin flash device starts to run. This patch also does following parameter data type change, -void bch_sectors_dirty_init(struct cached_dev *dc); +void bch_sectors_dirty_init(struct bcache_device *); to call this function conveniently in flash_dev_run(). (Commit log is composed by Coly Li) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/bitmap: disable bitmap_resize for file-backed bitmaps.NeilBrown2017-09-271-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit e8a27f836f165c26f867ece7f31eb5c811692319 upstream. bitmap_resize() does not work for file-backed bitmaps. The buffer_heads are allocated and initialized when the bitmap is read from the file, but resize doesn't read from the file, it loads from the internal bitmap. When it comes time to write the new bitmap, the bh is non-existent and we crash. The common case when growing an array involves making the array larger, and that normally means making the bitmap larger. Doing that inside the kernel is possible, but would need more code. It is probably easier to require people who use file-backed bitmaps to remove them and re-add after a reshape. So this patch disables the resizing of arrays which have file-backed bitmaps. This is better than crashing. Reported-by: Zhilong Liu <zlliu@suse.com> Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/bitmap: copy correct data for bitmap superShaohua Li2017-09-271-2/+2
| | | | | | | | | | | | | | | | | | commit 8031c3ddc70ab93099e7d1814382dba39f57b43e upstream. raid5 cache could write bitmap superblock before bitmap superblock is initialized. The bitmap superblock is less than 512B. The current code will only copy the superblock to a new page and write the whole 512B, which will zero the the data after the superblock. Unfortunately the data could include bitmap, which we should preserve. The patch will make superblock read do 4k chunk and we always copy the 4k data to new page, so the superblock write will old data to disk and we don't change the bitmap. Reported-by: Song Liu <songliubraving@fb.com> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/raid5: release/flush io in raid5_do_work()Song Liu2017-09-201-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 9c72a18e46ebe0f09484cce8ebf847abdab58498 upstream. In raid5, there are scenarios where some ios are deferred to a later time, and some IO need a flush to complete. To make sure we make progress with these IOs, we need to call the following functions: flush_deferred_bios(conf); r5l_flush_stripe_to_raid(conf->log); Both of these functions are called in raid5d(), but missing in raid5_do_work(). As a result, these functions are not called when multi-threading (group_thread_cnt > 0) is enabled. This patch adds calls to these function to raid5_do_work(). Note for stable branches: r5l_flush_stripe_to_raid(conf->log) is need for 4.4+ flush_deferred_bios(conf) is only needed for 4.11+ Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md/raid1/10: reset bio allocated from mempoolShaohua Li2017-09-202-4/+50
| | | | | | | | | | | | | | | | commit 208410b546207cfc4c832635fa46419cfa86b4cd upstream. Data allocated from mempool doesn't always get initialized, this happens when the data is reused instead of fresh allocation. In the raid1/10 case, we must reinitialize the bios. Reported-by: Jonathan G. Underwood <jonathan.underwood@gmail.com> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* dm mpath: do not lock up a CPU with requeuing activityBart Van Assche2017-08-281-1/+0
| | | | | | | | | | | | | | | When using the block layer in single queue mode, get_request() returns ERR_PTR(-EAGAIN) if the queue is dying and the REQ_NOWAIT flag has been passed to get_request(). Avoid that the kernel reports soft lockup complaints in this case due to continuous requeuing activity. Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: fix printk() rate limiting codeBart Van Assche2017-08-281-10/+0
| | | | | | | | | | | | Using the same rate limiting state for different kinds of messages is wrong because this can cause a high frequency message to suppress a report of a low frequency message. Hence use a unique rate limiting state per message type. Fixes: 71a16736a15e ("dm: use local printk ratelimit") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm mpath: retry BLK_STS_RESOURCE errorsBart Van Assche2017-08-281-1/+0
| | | | | | | | | | | Retry requests instead of failing them if an out-of-memory error occurs or the block driver below dm-mpath is busy. This restores the v4.12 behavior of noretry_error(), namely that -ENOMEM results in a retry. Fixes: 2a842acab109 ("block: introduce new block status code type") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: fix the second dec_pending() argument in __split_and_process_bio()Bart Van Assche2017-08-281-1/+1
| | | | | | | | | | Detected by sparse. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* MD: not clear ->safemode for external metadata arrayShaohua Li2017-08-111-1/+1
| | | | | | | | | ->safemode should be triggered by mdadm for external metadaa array, otherwise array's state confuses mdadm. Fixes: 33182d15c6bf(md: always clear ->safemode when md_check_recovery gets the mddev lock.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/r5cache: fix io_unit handling in r5l_log_endio()Song Liu2017-08-081-9/+31
| | | | | | | | | | | | | | | | | | | | In r5l_log_endio(), once log->io_list_lock is released, the io unit may be accessed (or even freed) by other threads. Current code doesn't handle the io_unit properly, which leads to potential race conditions. This patch solves this race condition by: 1. Add a pending_stripe count flush_payload. Multiple flush_payloads are counted as only one pending_stripe. Flag has_flush_payload is added to show whether the io unit has flush_payload; 2. In r5l_log_endio(), check flags has_null_flush and has_flush_payload with log->io_list_lock held. After the lock is released, this IO unit is only accessed when we know the pending_stripe counter cannot be zeroed by other threads. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_setSong Liu2017-08-081-6/+15
| | | | | | | | | | | In r5c_journal_mode_set(), it is necessary to call mddev_lock() before accessing conf and conf->log. Otherwise, the conf->log may change (and become NULL). Shaohua: fix unlock in failure cases Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: fix test in md_write_start()NeilBrown2017-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md_write_start() needs to clear the in_sync flag is it is set, or if there might be a race with set_in_sync() such that the later will set it very soon. In the later case it is sufficient to take the spinlock to synchronize with set_in_sync(), and then set the flag if needed. The current test is incorrect. It should be: if "flag is set" or "race is possible" "flag is set" is trivially "mddev->in_sync". "race is possible" should be tested by "mddev->sync_checkers". If sync_checkers is 0, then there can be no race. set_in_sync() will wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period, and as md_write_start() holds the rcu_read_lock(), set_in_sync() will be sure ot see the update to writes_pending. If sync_checkers is > 0, there could be race. If md_write_start() happened entirely between if (!mddev->in_sync && percpu_ref_is_zero(&mddev->writes_pending)) { and mddev->in_sync = 1; in set_in_sync(), then it would not see that is_sync had been set, and set_in_sync() would not see that writes_pending had been incremented. This bug means that in_sync is sometimes not set when it should be. Consequently there is a small chance that the array will be marked as "clean" when in fact it is inconsistent. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") cc: stable@vger.kernel.org (v4.12+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: always clear ->safemode when md_check_recovery gets the mddev lock.NeilBrown2017-08-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If ->safemode == 1, md_check_recovery() will try to get the mddev lock and perform various other checks. If mddev->in_sync is zero, it will call set_in_sync, and clear ->safemode. However if mddev->in_sync is not zero, ->safemode will not be cleared. When md_check_recovery() drops the mddev lock, the thread is woken up again. Normally it would just check if there was anything else to do, find nothing, and go to sleep. However as ->safemode was not cleared, it will take the mddev lock again, then wake itself up when unlocking. This results in an infinite loop, repeatedly calling md_check_recovery(), which RCU or the soft-lockup detector will eventually complain about. Prior to commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), safemode would only be set to one when the writes_pending counter reached zero, and would be cleared again when writes_pending is incremented. Since that patch, safemode is set more freely, but is not reliably cleared. So in md_check_recovery() clear ->safemode before checking ->in_sync. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Cc: stable@vger.kernel.org (4.12+) Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Reported-by: David R <david@unsolicited.net> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2017-07-286-126/+115
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MD fixes from Shaohua Li: "This fixes several bugs, three of them are marked for stable: - an initialization issue fixed by Ming - a bio clone race issue fixed by me - an async tx flush issue fixed by Ofer - other cleanups" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: fix warnning for UP case md/raid5: add thread_group worker async_tx_issue_pending_all md: simplify code with bio_io_error md/raid1: fix writebehind bio clone md: raid1-10: move raid1/raid10 common code into raid1-10.c md: raid1/raid10: initialize bvec table via bio_add_page() md: remove 'idx' from 'struct resync_pages'
| * MD: fix warnning for UP caseShaohua Li2017-07-251-1/+1
| | | | | | | | | | | | | | spin_is_locked always returns 0 for UP case, so ignores it Reported-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid5: add thread_group worker async_tx_issue_pending_allOfer Heifetz2017-07-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since thread_group worker and raid5d kthread are not in sync, if worker writes stripe before raid5d then requests will be waiting for issue_pendig. Issue observed when building raid5 with ext4, in some build runs jbd2 would get hung and requests were waiting in the HW engine waiting to be issued. Fix this by adding a call to async_tx_issue_pending_all in the raid5_do_work. Signed-off-by: Ofer Heifetz <oferh@marvell.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com>
| * md: simplify code with bio_io_errorGuoqing Jiang2017-07-213-12/+6
| | | | | | | | | | | | | | | | | | | | | | Since bio_io_error sets bi_status to BLK_STS_IOERR, and calls bio_endio, so we can use it directly. And as mentioned by Shaohua, there are also two places in raid5.c can use bio_io_error either. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: fix writebehind bio cloneShaohua Li2017-07-211-21/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After bio is submitted, we should not clone it as its bi_iter might be invalid by driver. This is the case of behind_master_bio. In certain situration, we could dispatch behind_master_bio immediately for the first disk and then clone it for other disks. https://bugzilla.kernel.org/show_bug.cgi?id=196383 Reported-and-tested-by: Markus <m4rkusxxl@web.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fix: 841c1316c7da(md: raid1: improve write behind) Cc: stable@vger.kernel.org (4.12+) Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1-10: move raid1/raid10 common code into raid1-10.cMing Lei2017-07-214-71/+62
| | | | | | | | | | | | | | | | No function change, just move 'struct resync_pages' and related helpers into raid1-10.c Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1/raid10: initialize bvec table via bio_add_page()Ming Lei2017-07-213-16/+27
| | | | | | | | | | | | | | | | | | | | | | We will support multipage bvec soon, so initialize bvec table using the standardy way instead of writing the talbe directly. Otherwise it won't work any more once multipage bvec is enabled. Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: remove 'idx' from 'struct resync_pages'Ming Lei2017-07-213-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_add_page() won't fail for resync bio, and the page index for each bio is same, so remove it. More importantly the 'idx' of 'struct resync_pages' is initialized in mempool allocator function, the current way is wrong since mempool is only responsible for allocation, we can't use that for initialization. Suggested-by: NeilBrown <neilb@suse.com> Reported-by: NeilBrown <neilb@suse.com> Reported-and-tested-by: Patrick <dto@gmx.net> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Cc: stable@vger.kernel.org (4.12+) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | Merge tag 'for-4.13/dm-fixes' of ↵Linus Torvalds2017-07-288-46/+86
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a few DM integrity fixes that improve performance. One that address inefficiencies in the on-disk journal device layout. Another that makes use of the block layer's on-stack plugging when writing the journal. - a dm-bufio fix for the blk_status_t conversion that went in during the merge window. - a few DM raid fixes that address correctness when suspending the device and a validation fix for validation that occurs during device activation. - a couple DM zoned target fixes. Important one being the fix to not use GFP_KERNEL in the IO path due to concerns about deadlock in low-memory conditions (e.g. swap over a DM zoned device, etc). - a DM DAX device fix to make sure dm_dax_flush() is called if the underlying DAX device is operating as a write cache. * tag 'for-4.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm, dax: Make sure dm_dax_flush() is called if device supports it dm verity fec: fix GFP flags used with mempool_alloc() dm zoned: use GFP_NOIO in I/O path dm zoned: remove test for impossible REQ_OP_FLUSH conditions dm raid: bump target version dm raid: avoid mddev->suspended access dm raid: fix activation check in validate_raid_redundancy() dm raid: remove WARN_ON() in raid10_md_layout_to_format() dm bufio: fix error code in dm_bufio_write_dirty_buffers() dm integrity: test for corrupted disk format during table load dm integrity: WARN_ON if variables representing journal usage get out of sync dm integrity: use plugging when writing the journal dm integrity: fix inefficient allocation of journal space
| * dm, dax: Make sure dm_dax_flush() is called if device supports itVivek Goyal2017-07-261-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently dm_dax_flush() is not being called, even if underlying dax device supports write cache, because DAXDEV_WRITE_CACHE is not being propagated up to the DM dax device. If the underlying dax device supports write cache, set DAXDEV_WRITE_CACHE on the DM dax device. This will cause dm_dax_flush() to be called. Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm verity fec: fix GFP flags used with mempool_alloc()NeilBrown2017-07-261-16/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mempool_alloc() cannot fail for GFP_NOIO allocation, so there is no point testing for failure. One place the code tested for failure was passing "0" as the GFP flags. This is most unusual and is probably meant to be GFP_NOIO, so that is changed. Also, allocation from ->extra_pool and ->prealloc_pool are repeated before releasing the previous allocation. This can deadlock if the code is servicing a write under high memory pressure. To avoid deadlocks, change these to use GFP_NOWAIT and leave the error handling in place. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm zoned: use GFP_NOIO in I/O pathDamien Le Moal2017-07-263-9/+9
| | | | | | | | | | | | | | | | | | | | Use GFP_NOIO for memory allocations in the I/O path. Other memory allocations in the initialization path can use GFP_KERNEL. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm zoned: remove test for impossible REQ_OP_FLUSH conditionsMikulas Patocka2017-07-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | The value REQ_OP_FLUSH is only used by the block code for request-based devices. Remove the tests for REQ_OP_FLUSH from the bio-based dm-zoned-target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm raid: bump target versionHeinz Mauelshagen2017-07-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | Bumo dm-raid target version to 1.12.1 to reflect that commit cc27b0c78c ("md: fix deadlock between mddev_suspend() and md_write_start()") is available. This version change allows userspace to detect that MD fix is available. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm raid: avoid mddev->suspended accessHeinz Mauelshagen2017-07-251-5/+7
| | | | | | | | | | | | | | Use runtime flag to ensure that an mddev gets suspended/resumed just once. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm raid: fix activation check in validate_raid_redundancy()Heinz Mauelshagen2017-07-251-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During growing reshapes (i.e. stripes being added to a raid set), the new stripe images are not in-sync and not part of the raid set until the reshape is started. LVM2 has to request multiple table reloads involving superblock updates in order to reflect proper size of SubLVs in the cluster. Before a stripe adding reshape starts, validate_raid_redundancy() fails as a result of that because it checks the total number of devices against the number of rebuild ones rather than the actual ones in the raid set (as retrieved from the superblock) thus resulting in failed raid4/5/6/10 redundancy checks. E.g. convert 3 stripes -> 7 stripes raid5 (which only allows for maximum 1 device to fail) requesting +4 delta disks causing 4 devices to rebuild during reshaping thus failing activation. To fix this, move validate_raid_redundancy() to get access to the current raid_set members. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm raid: remove WARN_ON() in raid10_md_layout_to_format()Heinz Mauelshagen2017-07-251-2/+3
| | | | | | | | | | Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm bufio: fix error code in dm_bufio_write_dirty_buffers()Dan Carpenter2017-07-251-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We should be returning normal negative error codes here. The "a" variables comes from &c->async_write_error which is a blk_status_t converted to a regular error code. In the current code, the blk_status_t gets propogated back to pool_create() and eventually results in an Oops. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm integrity: test for corrupted disk format during table loadMikulas Patocka2017-07-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | If the dm-integrity superblock was corrupted in such a way that the journal_sections field was zero, the integrity target would deadlock because it would wait forever for free space in the journal. Detect this situation and refuse to activate the device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm integrity: WARN_ON if variables representing journal usage get out of syncMikulas Patocka2017-07-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If this WARN_ON triggers it speaks to programmer error, and likely implies corruption, but no released kernel should trigger it. This WARN_ON serves to assist DM integrity developers as changes are made/tested in the future. BUG_ON is excessive for catching programmer error, if a user or developer would like warnings to trigger a panic, they can enable that via /proc/sys/kernel/panic_on_warn Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
OpenPOWER on IntegriCloud