summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | | | md/raid5: don't test ->writes_pending in raid5_remove_diskNeilBrown2017-03-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This test on ->writes_pending cannot be safe as the counter can be incremented at any moment and cannot be locked against. Change it to test conf->active_stripes, which at least can be locked against. More changes are still needed. A future patch will change ->writes_pending, and testing it here will be very inconvenient. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid1: stop using bi_phys_segmentNeilBrown2017-03-221-66/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change to use bio->__bi_remaining to count number of r1bio attached to a bio. See precious raid10 patch for more details. Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending used to measure different things, but were being compared. This patch fixes another bug in that nr_pending previously did not could write-behind requests, so behind writes could continue while resync was happening. How that nr_pending counts all r1_bio, the resync cannot commence until the behind writes have completed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid10: stop using bi_phys_segmentsNeilBrown2017-03-221-51/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | raid10 currently repurposes bi_phys_segments on each incoming bio to count how many r10bio was used to encode the request. We need to know when the number of attached r10bio reaches zero to: 1/ call bio_endio() when all IO on the bio is finished 2/ decrement ->nr_pending so that resync IO can proceed. Now that the bio has its own __bi_remaining counter, that can be used instead. We can call bio_inc_remaining to increment the counter and call bio_endio() every time an r10bio completes, rather than only when bi_phys_segments reaches zero. This addresses point 1, but not point 2. bio_endio() doesn't (and cannot) report when the last r10bio has finished, so a different approach is needed. So: instead of counting bios in ->nr_pending, count r10bios. i.e. every time we attach a bio, increment nr_pending. Every time an r10bio completes, decrement nr_pending. Normally we only increment nr_pending after first checking that ->barrier is zero, or some other non-trivial tests and possible waiting. When attaching multiple r10bios to a bio, we only need the tests and the waiting once. After the first increment, subsequent increments can happen unconditionally as they are really all part of the one request. So introduce inc_pending() which can be used when we know that nr_pending is already elevated. Note that this fixes a bug. freeze_array() contains the line atomic_read(&conf->nr_pending) == conf->nr_queued+extra, which implies that the units for ->nr_pending, ->nr_queued and extra are the same. ->nr_queue and extra count r10_bios, but prior to this patch, ->nr_pending counted bios. If a bio ever resulted in multiple r10_bios (due to bad blocks), freeze_array() would not work correctly. Now it does. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid1, raid10: move rXbio accounting closer to allocation.NeilBrown2017-03-222-26/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When raid1 or raid10 find they will need to allocate a new r1bio/r10bio, in order to work around a known bad block, they account for the allocation well before the allocation is made. This separation makes the correctness less obvious and requires comments. The accounting needs to be a little before: before the first rXbio is submitted, but that is all. So move the accounting down to where it makes more sense. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | Revert "md/raid5: limit request size according to implementation limits"NeilBrown2017-03-221-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5. Now that raid5 doesn't abuse bi_phys_segments any more, we no longer need to impose these limits. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: remove over-loading of ->bi_phys_segments.NeilBrown2017-03-222-41/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a read request, which bypassed the cache, fails, we need to retry it through the cache. This involves attaching it to a sequence of stripe_heads, and it may not be possible to get all the stripe_heads we need at once. We do what we can, and record how far we got in ->bi_phys_segments so we can pick up again later. There is only ever one bio which may have a non-zero offset stored in ->bi_phys_segments, the one that is either active in the single thread which calls retry_aligned_read(), or is in conf->retry_read_aligned waiting for retry_aligned_read() to be called again. So we only need to store one offset value. This can be in a local variable passed between remove_bio_from_retry() and retry_aligned_read(), or in the r5conf structure next to the ->retry_read_aligned pointer. Storing it there allows the last usage of ->bi_phys_segments to be removed from md/raid5.c. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as ↵NeilBrown2017-03-223-62/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a counter md/raid5 needs to keep track of how many stripe_heads are processing a bio so that it can delay calling bio_endio() until all stripe_heads have completed. It currently uses 16 bits of ->bi_phys_segments for this purpose. 16 bits is only enough for 256M requests, and it is possible for a single bio to be larger than this, which causes problems. Also, the bio struct contains a larger counter, __bi_remaining, which has a purpose very similar to the purpose of our counter. So stop using ->bi_phys_segments, and instead use __bi_remaining. This means we don't need to initialize the counter, as our caller initializes it to '1'. It also means we can call bio_endio() directly as it tests this counter internally. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: call bio_endio() directly rather than queueing for later.NeilBrown2017-03-224-38/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently gather bios that need to be returned into a bio_list and call bio_endio() on them all together. The original reason for this was to avoid making the calls while holding a spinlock. Locking has changed a lot since then, and that reason is no longer valid. So discard return_io() and various return_bi lists, and just call bio_endio() directly as needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: simplfy delaying of writes while metadata is updated.NeilBrown2017-03-222-26/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a device fails during a write, we must ensure the failure is recorded in the metadata before the completion of the write is acknowleged. Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") added code for this, but it was unnecessarily complicated. We already had similar functionality for handling updates to the bad-block-list, thanks to Commit de393cdea66c ("md: make it easier to wait for bad blocks to be acknowledged.") So revert most of the former commit, and instead avoid collecting completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush the metadata and retry the stripe_head. As this change can leave a stripe_head ready for handling immediately after handle_active_stripes() returns, we change raid5_do_work() to pause when MD_CHANGE_PENDING is set, so that it doesn't spin. We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set asynchronously. After analyse_stripe(), we have collected stable data about the state of devices, which will be used to make decisions. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: use md_write_start to count stripes, not biosNeilBrown2017-03-224-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md: move bitmap_destroy to the beginning of __md_stopGuoqing Jiang2017-03-163-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we have switched to sync way to handle METADATA_UPDATED msg for md-cluster, then process_metadata_update is depended on mddev->thread->wqueue. With the new change, clustered raid could possible hang if array received a METADATA_UPDATED msg after array unregistered mddev->thread, so we need to stop clustered raid (bitmap_destroy -> bitmap_free -> md_cluster_stop) earlier than unregister thread (mddev_detach -> md_unregister_thread). And this change should be safe for non-clustered raid since all writes are stopped before the destroy. Also in md_run, we activate the personality (pers->run()) before activating the bitmap (bitmap_create()). So it is pleasingly symmetric to stop the bitmap (bitmap_destroy()) before stopping the personality (__md_stop() calls pers->free()), we achieve this by move bitmap_destroy to the beginning of __md_stop. But we don't want to break the codes for waiting behind IO as Shaohua mentioned, so introduce bitmap_wait_behind_writes to call the codes, and call the new fun in both mddev_detach and bitmap_destroy, then we will not break original behind IO code and also fit the new condition well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/r5cache: generate R5LOG_PAYLOAD_FLUSHSong Liu2017-03-161-3/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to log->current_io. Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in quiesce. Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload. However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH, which is simpler. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recoverySong Liu2017-03-161-6/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery. Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush finish. When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity will be dropped from recovery. This will reduce the number of stripes to replay, and thus accelerate the recovery process. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | raid5-ppl: runtime PPL enabling or disablingArtur Paszkiewicz2017-03-164-3/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow writing to 'consistency_policy' attribute when the array is active. Add a new function 'change_consistency_policy' to the md_personality operations structure to handle the change in the personality code. Values "ppl" and "resync" are accepted and turn PPL on and off respectively. When enabling PPL its location and size should first be set using 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be written at this location on each member device. Enabling or disabling PPL is performed under a suspended array. The raid5_reset_stripe_cache function frees the stripe cache and allocates it again in order to allocate or free the ppl_pages for the stripes in the stripe cache. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | raid5-ppl: support disk hot add/remove with PPLArtur Paszkiewicz2017-03-163-2/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a function to modify the log by removing an rdev when a drive fails or adding when a spare/replacement is activated as a raid member. Removing a disk just clears the child log rdev pointer. No new stripes will be accepted for this child log in ppl_write_stripe() and running io units will be processed without writing PPL to the device. Adding a disk sets the child log rdev pointer and writes an empty PPL header. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | raid5-ppl: load and recover the logArtur Paszkiewicz2017-03-162-1/+493
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Load the log from each disk when starting the array and recover if the array is dirty. The initial empty PPL is written by mdadm. When loading the log we verify the header checksum and signature. For external metadata arrays the signature is verified in userspace, so here we read it from the header, verifying only if it matches on all disks, and use it later when writing PPL. In addition to the header checksum, each header entry also contains a checksum of its partial parity data. If the header is valid, recovery is performed for each entry until an invalid entry is found. If the array is not degraded and recovery using PPL fully succeeds, there is no need to resync the array because data and parity will be consistent, so in this case resync will be disabled. Due to compatibility with IMSM implementations on other systems, we can't assume that the recovery data block size is always 4K. Writes generated by MD raid5 don't have this issue, but when recovering PPL written in other environments it is possible to have entries with 512-byte sector granularity. The recovery code takes this into account and also the logical sector size of the underlying drives. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md: add sysfs entries for PPLArtur Paszkiewicz2017-03-161-0/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add 'consistency_policy' attribute for array. It indicates how the array maintains consistency in case of unexpected shutdown. Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location and size of the PPL space on the device. They can't be changed for active members if the array is started and PPL is enabled, so in the setter functions only basic checks are performed. More checks are done in ppl_validate_rdev() when starting the log. These attributes are writable to allow enabling PPL for external metadata arrays and (later) to enable/disable PPL for a running array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | raid5-ppl: Partial Parity Log write logging implementationArtur Paszkiewicz2017-03-165-5/+798
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the calculation of partial parity for a stripe and PPL write logging functionality. The description of PPL is added to the documentation. More details can be found in the comments in raid5-ppl.c. Attach a page for holding the partial parity data to stripe_head. Allocate it only if mddev has the MD_HAS_PPL flag set. Partial parity is the xor of not modified data chunks of a stripe and is calculated as follows: - reconstruct-write case: xor data from all not updated disks in a stripe - read-modify-write case: xor old data and parity from all updated disks in a stripe Implement it using the async_tx API and integrate into raid_run_ops(). It must be called when we still have access to old data, so do it when STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is stored into sh->ppl_page. Partial parity is not meaningful for full stripe write and is not stored in the log or used for recovery, so don't attempt to calculate it when stripe has STRIPE_FULL_WRITE. Put the PPL metadata structures to md_p.h because userspace tools (mdadm) will also need to read/write PPL. Warn about using PPL with enabled disk volatile write-back cache for now. It can be removed once disk cache flushing before writing PPL is implemented. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | raid5: separate header for log functionsArtur Paszkiewicz2017-03-164-69/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move raid5-cache declarations from raid5.h to raid5-log.h, add inline wrappers for functions which will be shared with ppl and use them in raid5 core instead of direct calls to raid5-cache. Remove unused parameter from r5c_cache_data(), move two duplicated pr_debug() calls to r5l_init_log(). Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md: superblock changes for PPLArtur Paszkiewicz2017-03-164-2/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/r5cache: improve recovery with read ahead page poolSong Liu2017-03-161-46/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In r5cache recovery, the journal device is scanned page by page. Currently, we use sync_page_io() to read journal device. This is not efficient when we have to recovery many stripes from the journal. To improve the speed of recovery, this patch introduces a read ahead page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive pages are read in one IO. Then the recovery code read the journal from ra_pool. With ra_pool, r5l_recovery_ctx has become much bigger. Therefore, r5l_recovery_log() is refactored so r5l_recovery_ctx is not using stack space. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: sort biosShaohua Li2017-03-162-26/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous patch (raid5: only dispatch IO from raid5d for harddisk raid) defers IO dispatching. The goal is to create better IO pattern. At that time, we don't sort the deffered IO and hope the block layer can do IO merge and sort. Now the raid5-cache writeback could create large amount of bios. And if we enable muti-thread for stripe handling, we can't control when to dispatch IO to raid disks. In a lot of time, we are dispatching IO which block layer can't do merge effectively. This patch moves further for the IO dispatching defer. We accumulate bios, but we don't dispatch all the bios after a threshold is met. This 'dispatch partial portion of bios' stragety allows bios coming in a large time window are sent to disks together. At the dispatching time, there is large chance the block layer can merge the bios. To make this more effective, we dispatch IO in ascending order. This increases request merge chance and reduces disk seek. Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5-cache: bump flush stripe batch sizeShaohua Li2017-03-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bump the flush stripe batch size to 2048. For my 12 disks raid array, the stripes takes: 12 * 4k * 2048 = 96MB This is still quite small. A hardware raid card generally has 1GB size, which we suggest the raid5-cache has similar cache size. The advantage of a big batch size is we can dispatch a lot of IO in the same time, then we can do some scheduling to make better IO pattern. Last patch prioritizes stripes, so we don't worry about a big flush stripe batch will starve normal stripes. Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md/raid5: prioritize stripes for writebackShaohua Li2017-03-162-9/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In raid5-cache writeback mode, we have two types of stripes to handle. - stripes which aren't cached yet - stripes which are cached and flushing out to raid disks Upperlayer is more sensistive to latency of the first type of stripes generally. But we only one handle list for all these stripes, where the two types of stripes are mixed together. When reclaim flushes a lot of stripes, the first type of stripes could be noticeably delayed. On the other hand, if the log space is tight, we'd like to handle the second type of stripes faster and free log space. This patch destinguishes the two types stripes. They are added into different handle list. When we try to get a stripe to handl, we prefer the first type of stripes unless log space is tight. This should have no impact for !writeback case. Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md-cluster: add the support for resizeGuoqing Jiang2017-03-163-5/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To update size for cluster raid, we need to make sure all nodes can perform the change successfully. However, it is possible that some of them can't do it due to failure (bitmap_resize could fail). So we need to consider the issue before we set the capacity unconditionally, and we use below steps to perform sanity check. 1. A change the size, then broadcast METADATA_UPDATED msg. 2. B and C receive METADATA_UPDATED change the size excepts call set_capacity, sync_size is not update if the change failed. Also call bitmap_update_sb to sync sb to disk. 3. A checks other node's sync_size, if sync_size has been updated in all nodes, then send CHANGE_CAPACITY msg otherwise send msg to revert previous change. 4. B and C call set_capacity if receive CHANGE_CAPACITY msg, otherwise pers->resize will be called to restore the old value. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md-cluster: introduce cluster_check_sync_sizeGuoqing Jiang2017-03-163-10/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support resize is a little complex for clustered raid, since we need to ensure all the nodes share the same knowledge about the size of raid. We achieve the goal by check the sync_size which is in each node's bitmap, we can only change the capacity after cluster_check_sync_size returns 0. Also, get_bitmap_from_slot is added to get a slot's bitmap. And we exported some funcs since they are used in cluster_check_sync_size(). We can also reuse get_bitmap_from_slot to remove redundant code existed in bitmap_copy_from_slot. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md-cluster: add CHANGE_CAPACITY message typeGuoqing Jiang2017-03-161-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The msg type CHANGE_CAPACITY is introduced to support resize clustered raid in later patch, and it is sent after all the nodes have the same sync_size, receiver node just need to set new capacity once received this msg. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| | * | | | md-cluster: use sync way to handle METADATA_UPDATED msgGuoqing Jiang2017-03-163-23/+66
| | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, when node received METADATA_UPDATED msg, it just need to wakeup mddev->thread, then md_reload_sb will be called eventually. We taken the asynchronous way to avoid a deadlock issue, the deadlock issue could happen when one node is receiving the METADATA_UPDATED msg (wants reconfig_mutex) and trying to run the path: md_check_recovery -> mddev_trylock(hold reconfig_mutex) -> md_update_sb-metadata_update_start (want EX on token however token is got by the sending node) Since we will support resizing for clustered raid, and we need the metadata update handling to be synchronous so that the initiating node can detect failure, so we need to change the way for handling METADATA_UPDATED msg. But, we obviously need to avoid above deadlock with the sync way. To make this happen, we considered to not hold reconfig_mutex to call md_reload_sb, if some other thread has already taken reconfig_mutex and waiting for the 'token', then process_recvd_msg() can safely call md_reload_sb() without taking the mutex. This is because we can be certain that no other thread will take the mutex, and we also certain that the actions performed by md_reload_sb() won't interfere with anything that the other thread is in the middle of. To make this more concrete, we added a new cinfo->state bit MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD Which is set in lock_token() just before dlm_lock_sync() is called, and cleared just after. As lock_token() is always called with reconfig_mutex() held (the specific case is the resync_info_update which is distinguished well in previous patch), if process_recvd_msg() finds that the new bit is set, then the mutex must be held by some other thread, and it will keep waiting. So process_metadata_update() can call md_reload_sb() if either mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is set. The tricky bit is what to do if neither of these apply. We need to wait. Fortunately mddev_unlock() always calls wake_up() on mddev->thread->wqueue. So we can get lock_token() to call wake_up() on that when it sets the bit. There are also some related changes inside this commit: 1. remove RELOAD_SB related codes since there are not valid anymore. 2. mddev is added into md_cluster_info then we can get mddev inside lock_token. 3. add new parameter for lock_token to distinguish reconfig_mutex is held or not. And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below: 1. set it before unregister thread, otherwise a deadlock could appear if stop a resyncing array. This is because md_unregister_thread(&cinfo->recv_thread) is blocked by recv_daemon -> process_recvd_msg -> process_metadata_update. To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is also need to be set before unregister thread. 2. set it in metadata_update_start to fix another deadlock. a. Node A sends METADATA_UPDATED msg (held Token lock). b. Node B wants to do resync, and is blocked since it can't get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is not set since the callchain (md_do_sync -> sync_request -> resync_info_update -> sendmsg -> lock_comm -> lock_token) doesn't hold reconfig_mutex. c. Node B trys to update sb (held reconfig_mutex), but stopped at wait_event() in metadata_update_start since we have set MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2). d. Then Node B receives METADATA_UPDATED msg from A, of course recv_daemon is blocked forever. Since metadata_update_start always calls lock_token with reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and lock_token don't need to set it twice unless lock_token is invoked from lock_comm. Finally, thanks to Neil for his great idea and help! Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2017-05-021-1/+1
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: tty: fix comment for __tty_alloc_driver() init/main: properly align the multi-line comment init/main: Fix double "the" in comment Fix dead URLs to ftp.kernel.org drivers: Clean up duplicated email address treewide: Fix typo in xml/driver-api/basics.xml tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall" selftests/timers: Spelling s/privledges/privileges/ HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/ net: phy: dp83848: Fix Typo UBI: Fix typos Documentation: ftrace.txt: Correct nice value of 120 priority net: fec: Fix typo in error msg and comment treewide: Fix typos in printk
| * | | | Fix dead URLs to ftp.kernel.orgSeongJae Park2017-03-281-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | URLs to ftp.kernel.org are still exist though the service is closed [0]. This commit fixes the URLs to use www.kernel.org instead. [0] https://www.kernel.org/shutting-down-ftp-services.html Signed-off-by: SeongJae Park <sj38.park@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-05-0121-81/+126
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ was initially a fork of CFQ, but subsequently changed to implement fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant to be used on desktop type single drives, providing good fairness. From Paolo. - Add Kyber IO scheduler. This is a full multiqueue aware scheduler, using a scalable token based algorithm that throttles IO based on live completion IO stats, similary to blk-wbt. From Omar. - A series from Jan, moving users to separately allocated backing devices. This continues the work of separating backing device life times, solving various problems with hot removal. - A series of updates for lightnvm, mostly from Javier. Includes a 'pblk' target that exposes an open channel SSD as a physical block device. - A series of fixes and improvements for nbd from Josef. - A series from Omar, removing queue sharing between devices on mostly legacy drivers. This helps us clean up other bits, if we know that a queue only has a single device backing. This has been overdue for more than a decade. - Fixes for the blk-stats, and improvements to unify the stats and user windows. This both improves blk-wbt, and enables other users to register a need to receive IO stats for a device. From Omar. - blk-throttle improvements from Shaohua. This provides a scalable framework for implementing scalable priotization - particularly for blk-mq, but applicable to any type of block device. The interface is marked experimental for now. - Bucketized IO stats for IO polling from Stephen Bates. This improves efficiency of polled workloads in the presence of mixed block size IO. - A few fixes for opal, from Scott. - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics. From a variety of folks, mostly Sagi and James Smart. - A series from Bart, improving our exposed info and capabilities from the blk-mq debugfs support. - A series from Christoph, cleaning up how handle WRITE_ZEROES. - A series from Christoph, cleaning up the block layer handling of how we track errors in a request. On top of being a nice cleanup, it also shrinks the size of struct request a bit. - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was never used by platforms, and the latter has outlived it's usefulness. - Various little bug fixes and cleanups from a wide variety of folks. * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits) block: hide badblocks attribute by default blk-mq: unify hctx delay_work and run_work block: add kblock_mod_delayed_work_on() blk-mq: unify hctx delayed_run_work and run_work nbd: fix use after free on module unload MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler blk-mq-sched: alloate reserved tags out of normal pool mtip32xx: use runtime tag to initialize command header scsi: Implement blk_mq_ops.show_rq() blk-mq: Add blk_mq_ops.show_rq() blk-mq: Show operation, cmd_flags and rq_flags names blk-mq: Make blk_flags_show() callers append a newline character blk-mq: Move the "state" debugfs attribute one level down blk-mq: Unregister debugfs attributes earlier blk-mq: Only unregister hctxs for which registration succeeded blk-mq-debugfs: Rename functions for registering and unregistering the mq directory blk-mq: Let blk_mq_debugfs_register() look up the queue name blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset ide-pm: always pass 0 error to __blk_end_request_all ..
| * | | blk-mq: remove the error argument to blk_mq_complete_requestChristoph Hellwig2017-04-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all drivers that call blk_mq_complete_requests have a ->complete callback we can remove the direct call to blk_mq_end_request, as well as the error argument to blk_mq_complete_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | dm mpath: don't check for req->errorsChristoph Hellwig2017-04-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We'll get all proper errors reported through ->end_io and ->errors will go away soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | dm rq: don't pass irrelevant error code to blk_mq_complete_requestChristoph Hellwig2017-04-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm never uses rq->errors, so there is no need to pass an error argument to blk_mq_complete_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: remove the discard_zeroes_data flagChristoph Hellwig2017-04-087-61/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | dm kcopyd: switch to use REQ_OP_WRITE_ZEROESChristoph Hellwig2017-04-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | dm: support REQ_OP_WRITE_ZEROESChristoph Hellwig2017-04-088-8/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | dm io: discards don't take a payloadChristoph Hellwig2017-04-081-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | md: support REQ_OP_WRITE_ZEROESChristoph Hellwig2017-04-087-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | Merge branch 'for-linus' into for-4.12/blockJens Axboe2017-04-071-0/+1
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: trace completion of all bios.NeilBrown2017-04-072-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently only dm and md/raid5 bios trigger trace_block_bio_complete(). Now that we have bio_chain() and bio_inc_remaining(), it is not possible, in general, for a driver to know when the bio is really complete. Only bio_endio() knows that. So move the trace_block_bio_complete() call to bio_endio(). Now trace_block_bio_complete() pairs with trace_block_bio_queue(). Any bio for which a 'queue' event is traced, will subsequently generate a 'complete' event. There are a few cases where completion tracing is not wanted. 1/ If blk_update_request() has already generated a completion trace event at the 'request' level, there is no point generating one at the bio level too. In this case the bi_sector and bi_size will have changed, so the bio level event would be wrong 2/ If the bio hasn't actually been queued yet, but is being aborted early, then a trace event could be confusing. Some filesystems call bio_endio() but do not want tracing. 3/ The bio_integrity code interposes itself by replacing bi_end_io, then restoring it and calling bio_endio() again. This would produce two identical trace events if left like that. To handle these, we introduce a flag BIO_TRACE_COMPLETION and only produce the trace event when this is set. We address point 1 above by clearing the flag in blk_update_request(). We address point 2 above by only setting the flag when generic_make_request() is called. We address point 3 above by clearing the flag after generating a completion event. When bio_split() is used on a bio, particularly in blk_queue_split(), there is an extra complication. A new bio is split off the front, and may be handle directly without going through generic_make_request(). The old bio, which has been advanced, is passed to generic_make_request(), so it will trigger a trace event a second time. Probably the best result when a split happens is to see a single 'queue' event for the whole bio, then multiple 'complete' events - one for each component. To achieve this was can: - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split() - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set. This way, the split-off bio won't create a queue event, the original won't either even if it re-submitted to generic_make_request(), but both will produce completion events, each for their own range. So if generic_make_request() is called (which generates a QUEUED event), then bi_endio() will create a single COMPLETE event for each range that the bio is split into, unless the driver has explicitly requested it not to. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: constify struct blk_mq_opsEric Biggers2017-03-311-1/+1
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-04-081-0/+1
|\ \ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Here's a pull request for 4.11-rc, fixing a set of issues mostly centered around the new scheduling framework. These have been brewing for a while, but split up into what we absolutely need in 4.11, and what we can defer until 4.12. These are well tested, on both single queue and multiqueue setups, and with and without shared tags. They fix several hangs that have happened in testing. This is obviously larger than I would have preferred at this point in time, but I don't think we can shave much off this and still get the desired results. In detail, this pull request contains: - a set of five fixes for NVMe, mostly from Christoph and one from Roland. - a series from Bart, fixing issues with dm-mq and SCSI shared tags and scheduling. Note that one of those patches commit messages may read like an optimization, but it is in fact an important fix for queue restarts in particular. - a series from Omar, most importantly fixing a hang with multiple hardware queues when we fail to get a driver tag. Another important fix in there is for resizing hardware queues, which nbd does when handling multiple sockets for one connection. - fixing an imbalance in putting the ctx for hctx request allocations from Minchan" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: Restart a single queue if tag sets are shared dm rq: Avoid that request processing stalls sporadically scsi: Avoid that SCSI queues get stuck blk-mq: Introduce blk_mq_delay_run_hw_queue() blk-mq: remap queues when adding/removing hardware queues blk-mq-sched: fix crash in switch error path blk-mq-sched: set up scheduler tags when bringing up new queues blk-mq-sched: refactor scheduler initialization blk-mq: use the right hctx when getting a driver tag fails nvmet: fix byte swap in nvmet_parse_io_cmd nvmet: fix byte swap in nvmet_execute_write_zeroes nvmet: add missing byte swap in nvmet_get_smart_log nvme: add missing byte swap in nvme_setup_discard nvme: Correct NVMF enum values to match NVMe-oF rev 1.0 block: do not put mq context in blk_mq_alloc_request_hctx
| * | | dm rq: Avoid that request processing stalls sporadicallyBart Van Assche2017-04-071-0/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running the srp-test software I noticed that request processing stalls sporadically at the beginning of a test, namely when mkfs is run against a dm-mpath device. Every time when that happened the following command was sufficient to resume request processing: echo run >/sys/kernel/debug/block/dm-0/state This patch avoids that such request processing stalls occur. The test I ran is as follows: while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge tag 'dm-4.11-fixes-2' of ↵Linus Torvalds2017-04-074-8/+24
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - two stable fixes for the verity target's FEC support - a stable fix for raid target's raid1 support (when no bitmap is used) - a 4.11 cache metadata v2 format fix to properly test blocks are clean * tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity fec: fix bufio leaks dm raid: fix NULL pointer dereference for raid1 without bitmap dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty dm verity fec: limit error correction recursion
| * | dm verity fec: fix bufio leaksSami Tolvanen2017-03-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Buffers read through dm_bufio_read() were not released in all code paths. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm raid: fix NULL pointer dereference for raid1 without bitmapDmitry Bilunov2017-03-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4257e08 ("dm raid: support to change bitmap region size") introduced a bitmap resize call during preresume phase. User can create a DM device with "raid" target configured as raid1 with no metadata devices to hold superblock/bitmap info. It can be achieved using the following sequence: truncate -s 32M /dev/shm/raid-test LOOP=$(losetup --show -f /dev/shm/raid-test) dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0" dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024" dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1" This results in the following crash: [ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB [ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors [ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod] [ 4029.114816] PGD 0 … [ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015 [ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000 [ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod] [ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246 [ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000 [ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000 [ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff [ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058 [ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4029.115223] FS: 00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000 [ 4029.115245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0 [ 4029.115281] Call Trace: [ 4029.115291] ? raid_iterate_devices+0x63/0x80 [dm_raid] [ 4029.115309] ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod] [ 4029.115329] ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod] [ 4029.115346] raid_preresume+0x81/0x2e0 [dm_raid] [ 4029.115361] dm_table_resume_targets+0x47/0xe0 [dm_mod] [ 4029.115378] dm_resume+0xa8/0xd0 [dm_mod] [ 4029.115391] dev_suspend+0x123/0x250 [dm_mod] [ 4029.115405] ? table_load+0x350/0x350 [dm_mod] [ 4029.115419] ctl_ioctl+0x1c2/0x490 [dm_mod] [ 4029.115433] dm_ctl_ioctl+0xe/0x20 [dm_mod] [ 4029.115447] do_vfs_ioctl+0x8d/0x5a0 [ 4029.115459] ? ____fput+0x9/0x10 [ 4029.115470] ? task_work_run+0x79/0xa0 [ 4029.115481] SyS_ioctl+0x3c/0x70 [ 4029.115493] entry_SYSCALL_64_fastpath+0x13/0x94 The raid_preresume() function incorrectly assumes that the raid_set has a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set. But RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap() even if there is no bitmap present (and bitmap_load() happily returns 0 even if a bitmap isn't present). So the only way forward in the near-term is to check if the bitmap is present by seeing if mddev->bitmap is not NULL after bitmap_load() has been called. By doing so the above NULL pointer is avoided. Fixes: 4257e08 ("dm raid: support to change bitmap region size") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru> Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirtyJoe Thornber2017-03-201-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dm_bitset_cursor_begin() call was using the incorrect nr_entries. Also, the last dm_bitset_cursor_next() must be avoided if we're at the end of the cursor. Fixes: 7f1b21591a6 ("dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm verity fec: limit error correction recursionSami Tolvanen2017-03-162-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the hash tree itself is sufficiently corrupt in addition to data blocks, it's possible for error correction to end up in a deep recursive loop, which eventually causes a kernel panic. This change limits the recursion to a reasonable level during a single I/O operation. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.5+
* | | Merge branch 'for-linus' of ↵Linus Torvalds2017-03-166-38/+72
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: - fix a parity calculation bug of raid5 cache by Song - fix a potential deadlock issue by me - fix two endian issues by Jason - fix a disk limitation issue by Neil - other small fixes and cleanup * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/raid1: fix a trivial typo in comments md/r5cache: fix set_syndrome_sources() for data in cache md: fix incorrect use of lexx_to_cpu in does_sb_need_changing md: fix super_offset endianness in super_1_rdev_size_change md/raid1/10: fix potential deadlock md: don't impose the MD_SB_DISKS limit on arrays without metadata. md: move funcs from pers->resize to update_size md-cluster: remove useless memset from gather_all_resync_info md-cluster: free md_cluster_info if node leave cluster md: delete dead code md/raid10: submit bio directly to replacement disk
OpenPOWER on IntegriCloud