summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
...
| * | block: keep established cmd_flags when cloning into a blk-mq requestKeith Busch2015-01-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_alloc_request() may establish REQ_MQ_INFLIGHT in addition to incrementing the hctx->nr_active count. Any cmd_flags that are established in the newly allocated clone request must be preserved in addition to the cmd_flags that are later copied over from the original request as part of blk_rq_prep_clone(). Otherwise, if REQ_MQ_INFLIGHT isn't set in the clone request the hctx->nr_active count won't get decremented via blk_mq_free_request(). The only consumer of blk_rq_prep_clone() is request-based DM, which uses blk_rq_init() prior to calling blk_rq_prep_clone() for the non-blk-mq case. Given the cloned request's cmd_flags will be 0 it is safe to OR them with the original request's cmd_flags for both the non-blk-mq and blk-mq cases. Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: add blk-mq support to blk_insert_cloned_request()Keith Busch2015-01-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | If the request passed to blk_insert_cloned_request() was allocated by a blk-mq device it must be submitted using blk_mq_insert_request(). Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: require blk_rq_prep_clone() be given an initialized clone requestKeith Busch2015-01-281-2/+0
| |/ | | | | | | | | | | | | | | | | | | Prepare to allow blk_rq_prep_clone() to accept clone requests that were allocated from blk-mq request queues. As such the blk_rq_prep_clone() caller must first initialize the clone request. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: add tag allocation policyShaohua Li2015-01-233-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the blk-mq part to support tag allocation policy. The default allocation policy isn't changed (though it's not a strict FIFO). The new policy is round-robin for libata. But it's a try-best implementation. If multiple tasks are competing, the tags returned will be mixed (which is unavoidable even with !mq, as requests from different tasks can be mixed in queue) Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: support different tag allocation policyShaohua Li2015-01-231-8/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The libata tag allocation is using a round-robin policy. Next patch will make libata use block generic tag allocation, so let's add a policy to tag allocation. Currently two policies: FIFO (default) and round-robin. Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Remove annoying "unknown partition table" messageBoaz Harrosh2015-01-221-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As Christoph put it: Can we just get rid of the warnings? It's fairly annoying as devices without partitions are perfectly fine and very useful. Me too I see this message every VM boot for ages on all my devices. Would love to just remove it. For me a partition-table is only needed for a booting BIOS, grub, and stuff. CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Add discard flag to blkdev_issue_zeroout() functionMartin K. Petersen2015-01-212-5/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_issue_discard() will zero a given block range. This is done by way of explicit writing, thus provisioning or allocating the blocks on disk. There are use cases where the desired behavior is to zero the blocks but unprovision them if possible. The blocks must deterministically contain zeroes when they are subsequently read back. This patch adds a flag to blkdev_issue_zeroout() that provides this variant. If the discard flag is set and a block device guarantees discard_zeroes_data we will use REQ_DISCARD to clear the block range. If the device does not support discard_zeroes_data or if the discard request fails we will fall back to first REQ_WRITE_SAME and then a regular REQ_WRITE. Also update the callers of blkdev_issue_zero() to reflect the new flag and make sb_issue_zeroout() prefer the discard approach. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * cfq-iosched: fix incorrect filing of rt async cfqqJeff Moyer2015-01-211-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, If you can manage to submit an async write as the first async I/O from the context of a process with realtime scheduling priority, then a cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It ends up in the best effort array, but actually has realtime I/O scheduling priority set in cfqq->ioprio. The reason is that cfq_get_queue assumes the default scheduling class and priority when there is no information present (i.e. when the async cfqq is created): static struct cfq_queue * cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic, struct bio *bio, gfp_t gfp_mask) { const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio); cic->ioprio starts out as 0, which is "invalid". So, class of 0 (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so: async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio); static struct cfq_queue ** cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio) { switch (ioprio_class) { case IOPRIO_CLASS_RT: return &cfqd->async_cfqq[0][ioprio]; case IOPRIO_CLASS_NONE: ioprio = IOPRIO_NORM; /* fall through */ case IOPRIO_CLASS_BE: return &cfqd->async_cfqq[1][ioprio]; case IOPRIO_CLASS_IDLE: return &cfqd->async_idle_cfqq; default: BUG(); } } Here, instead of returning a class mapped from the process' scheduling priority, we get back the bucket associated with IOPRIO_CLASS_BE. Now, there is no queue allocated there yet, so we create it: cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask); That function ends up doing this: cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync); cfq_init_prio_data(cfqq, cic); cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio data does this: ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); switch (ioprio_class) { default: printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class); case IOPRIO_CLASS_NONE: /* * no prio set, inherit CPU scheduling settings */ cfqq->ioprio = task_nice_ioprio(tsk); cfqq->ioprio_class = task_nice_ioclass(tsk); break; So we basically have two code paths that treat IOPRIO_CLASS_NONE differently, which results in an RT async cfqq filed into a best effort bucket. Attached is a patch which fixes the problem. I'm not sure how to make it cleaner. Suggestions would be welcome. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: fix false negative out-of-tags conditionJens Axboe2015-01-141-17/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The blk-mq tagging tries to maintain some locality between CPUs and the tags issued. The tags are split into groups of words, and the words may not be fully populated. When searching for a new free tag, blk-mq may look at partial words, hence it passes in an offset/size to find_next_zero_bit(). However, it does that wrong, the size must always be the full length of the number of tags in that word, otherwise we'll potentially miss some near the end. Another issue is when __bt_get() goes from one word set to the next. It bumps the index, but not the last_tag associated with the previous index. Bump that to be in the range of the new word. Finally, clean up __bt_get() and __bt_get_word() a bit and get rid of the goto in there, and the unnecessary 'wrap' variable. Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: export blk_mq_freeze_queue()Jens Axboe2015-01-021-0/+1
| | | | | | | | | | | | | | Commit b4c6a028774b exported the start and unfreeze, but we need the regular blk_mq_freeze_queue() for the loop conversion. Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds2015-02-121-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
| * | fs: introduce f_op->mmap_capabilities for nommu mmap supportChristoph Hellwig2015-01-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge branch 'x86-efi-for-linus' of ↵Linus Torvalds2015-02-091-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "Main changes: - Move efivarfs from the misc filesystem section to pseudo filesystem - Expose firmware platform size in sysfs - Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t. - various cleanups and fixes The biggest risk is the get_memory_map() change, which changes the way that both the arm64 and x86 EFI boot stub build the early memory map. There are no known regressions with it at the moment, BYMMV" * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: Don't look for chosen@0 node on DT platforms firmware: efi: Remove unneeded guid unparse efi/libstub: Call get_memory_map() to obtain map and desc sizes efi: Small leak on error in runtime map code efi: rtc-efi: Mark UIE as unsupported arm64/efi: efistub: Apply __init annotation efi: Expose underlying UEFI firmware platform size to userland efi: Rename efi_guid_unparse to efi_guid_to_str efi: Update the URLs for efibootmgr fs: Make efivarfs a pseudo filesystem, built by default with EFI
| * \ \ Merge tag 'efi-next' of ↵Ingo Molnar2015-01-291-1/+1
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi Pull EFI updates from Matt Fleming: " - Move efivarfs from the misc filesystem section to pseudo filesystem, since that's a more logical and accurate place - Leif Lindholm - Update efibootmgr URL in Kconfig help - Peter Jones - Improve accuracy of EFI guid function names - Borislav Petkov - Expose firmware platform size in sysfs for the benefit of EFI boot loader installers and other utilities - Steve McIntyre - Cleanup __init annotations for arm64/efi code - Ard Biesheuvel - Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel - Fix memory leak in error code path of runtime map code - Dan Carpenter - Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t (which could change in future spec versions) and querying the firmware instead of guessing about the memmap size - Ard Biesheuvel - Remove superfluous guid unparse calls - Ivan Khoronzhuk - Delete unnecessary chosen@0 DT node FDT code since was duplicated from code in drivers/of and is entirely unnecessary - Leif Lindholm There's nothing super scary, mainly cleanups, and a merge from Ricardo who kindly picked up some patches from the linux-efi mailing list while I was out on annual leave in December. Perhaps the biggest risk is the get_memory_map() change from Ard, which changes the way that both the arm64 and x86 EFI boot stub build the early memory map. It would be good to have it bake in linux-next for a while. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | efi: Rename efi_guid_unparse to efi_guid_to_strBorislav Petkov2015-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call it what it does - "unparse" is plain-misleading. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
* | | | | blk-mq: release mq's kobjects in blk_release_queue()Ming Lei2015-01-293-7/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kobject memory inside blk-mq hctx/ctx shouldn't have been freed before the kobject is released because driver core can access it freely before its release. We can't do that in all ctx/hctx/mq_kobj's release handler because it can be run before blk_cleanup_queue(). Given mq_kobj shouldn't have been introduced, this patch simply moves mq's release into blk_release_queue(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | Revert "blk-mq: fix hctx/ctx kobject use-after-free"Ming Lei2015-01-292-24/+7
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 76d697d10769048e5721510100bf3a9413a56385. The commit 76d697d10769048 causes general protection fault reported from Bart Van Assche: https://lkml.org/lkml/2015/1/28/334 Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | blk-mq: fix hctx/ctx kobject use-after-freeMing Lei2015-01-202-7/+24
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kobject memory shouldn't have been freed before the kobject is released because driver core can access it freely before its release. This patch frees hctx in its release callback. For ctx, they share one single per-cpu variable which is associated with the request queue, so free ctx in q->mq_kobj's release handler. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> (fix ctx kobjects) Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: End unstarted requests on a dying queueKeith Busch2015-01-081-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Requests that haven't been started prior to a queue dying can be ended in error without waiting for them to start and time out. Signed-off-by: Keith Busch <keith.busch@intel.com> Added code comment to explain why this is done. Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: Allow requests to never expireKeith Busch2015-01-082-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Some types of requests may be started that are not gauranteed to ever complete. This adds a request flag that a driver can use so mark the request as such. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: Add helper to abort requeued requestsJens Axboe2015-01-081-0/+20
| | | | | | | | | | | | | | | | | | | | | Adds a helper function a driver can use to abort requeued requests in case any are pending when h/w queues are being removed. Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: Let drivers cancel requeue_workKeith Busch2015-01-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kicking requeued requests will start h/w queues in a work_queue, which may alter the driver's requested state to temporarily stop them. This patch exports a method to cancel the q->requeue_work so a driver can be assured stopped h/w queues won't be started up before it is ready. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: Export if requests were startedKeith Busch2015-01-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | Drivers can iterate over all allocated request tags, but their callback needs a way to know if the driver started the request in the first place. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: Wake tasks entering queue on dyingKeith Busch2015-01-081-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the queue is set to dying, wake up tasks that are waiting on frozen queue so they realize it is dying and abandon their request. Signed-off-by: Keith Busch <keith.busch@intel.com> Modified by me to add a code comment on the need for the wakeup. Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-mq: get rid of ->cmd_size in the hardware queueJens Axboe2015-01-071-1/+0
| |/ |/| | | | | | | | | | | | | | | We store it in the tag set, we don't need it in the hardware queue. While removing cmd_size, place ->queue_num further down to avoid a hole on 64-bit archs. It's not used in any fast paths, so we can safely move it. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: wake up waiters when a queue is marked dyingJens Axboe2014-12-315-5/+42
| | | | | | | | | | | | | | | | | | | | | | If it's dying, we can't expect new request to complete and come in an wake up other tasks waiting for requests. So after we have marked it as dying, wake up everybody currently waiting for a request. Once they wake, they will retry their allocation and fail appropriately due to the state of the queue. Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: Export freeze/unfreeze functionsKeith Busch2014-12-201-2/+4
| | | | | | | | | | | | | | Let drivers prevent entering a queue that isn't available. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: Exit queue on alloc failureKeith Busch2014-12-201-1/+3
| | | | | | | | | | | | | | Fixes usage counter when a request could not be allocated. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Revert "blk-mq: Micro-optimize bt_get()"Jens Axboe2014-12-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | This reverts commit 52f7eb945f2ba62b324bb9ae16d945326a961dcf. The optimization is only really safe for a single queue, otherwise 'bs' and 'bt' can indeed change, and if we don't do a finish_wait() for each loop, we'll potentially change the wait structure and corrupt task wait list. Reported-by: Jan Kara <jack@suse.cz>
* | Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-blockLinus Torvalds2014-12-1310-119/+197
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ...
| * | bio: modify __bio_add_page() to accept pages that don't start a new segmentMaurizio Lombardi2014-12-111-24/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jet Chen <jet.chen@intel.com> Cc: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Fix uninitialized kobject at CPU hotpluggingTakashi Iwai2014-12-101-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a CPU is hotplugged, the current blk-mq spews a warning like: kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong. CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014 0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8 ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58 ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007 Call Trace: [<ffffffff81005306>] dump_trace+0x86/0x330 [<ffffffff81005644>] show_stack_log_lvl+0x94/0x170 [<ffffffff81006d21>] show_stack+0x21/0x50 [<ffffffff81605f07>] dump_stack+0x41/0x51 [<ffffffff8132c7a0>] kobject_add+0xa0/0xb0 [<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0 [<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60 [<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190 [<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70 [<ffffffff8105fd23>] cpu_notify+0x23/0x50 [<ffffffff81060037>] _cpu_up+0x157/0x170 [<ffffffff810600d9>] cpu_up+0x89/0xb0 [<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80 [<ffffffff814323cd>] device_online+0x5d/0xa0 [<ffffffff81432485>] online_store+0x75/0x80 [<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150 [<ffffffff811c5532>] vfs_write+0xb2/0x1f0 [<ffffffff811c5f42>] SyS_write+0x42/0xb0 [<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b [<00007f0132fb24e0>] 0x7f0132fb24e0 This is indeed because of an uninitialized kobject for blk_mq_ctx. The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it goes loop over hctx_for_each_ctx(), i.e. it initializes only for online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly onlined CPU is registered without initialization. This patch fixes the issue by initializing the all ctx kobjects belonging to each queue. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794 Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Use all available hardware queuesBart Van Assche2014-12-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suppose that a system has two CPU sockets, three cores per socket, that it does not support hyperthreading and that four hardware queues are provided by a block driver. With the current algorithm this will lead to the following assignment of CPU cores to hardware queues: HWQ 0: 0 1 HWQ 1: 2 3 HWQ 2: 4 5 HWQ 3: (none) This patch changes the queue assignment into: HWQ 0: 0 1 HWQ 1: 2 HWQ 2: 3 4 HWQ 3: 5 In other words, this patch has the following three effects: - All four hardware queues are used instead of only three. - CPU cores are spread more evenly over hardware queues. For the above example the range of the number of CPU cores associated with a single HWQ is reduced from [0..2] to [1..2]. - If the number of HWQ's is a multiple of the number of CPU sockets it is now guaranteed that all CPU cores associated with a single HWQ reside on the same CPU socket. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Micro-optimize bt_get()Bart Van Assche2014-12-091-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr() calls into a single call. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Fix a race between bt_clear_tag() and bt_get()Bart Van Assche2014-12-091-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | What we need is the following two guarantees: * Any thread that observes the effect of the test_and_set_bit() by __bt_get_word() also observes the preceding addition of 'current' to the appropriate wait list. This is guaranteed by the semantics of the spin_unlock() operation performed by prepare_and_wait(). Hence the conversion of test_and_set_bit_lock() into test_and_set_bit(). * The wait lists are examined by bt_clear() after the tag bit has been cleared. clear_bit_unlock() guarantees that any thread that observes that the bit has been cleared also observes the store operations preceding clear_bit_unlock(). However, clear_bit_unlock() does not prevent that the wait lists are examined before that the tag bit is cleared. Hence the addition of a memory barrier between clear_bit() and the wait list examination. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Avoid that __bt_get_word() wraps multiple timesBart Van Assche2014-12-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If __bt_get_word() is called with last_tag != 0, if the first find_next_zero_bit() fails, if after wrap-around the test_and_set_bit() call fails and find_next_zero_bit() succeeds, if the next test_and_set_bit() call fails and subsequently find_next_zero_bit() does not find a zero bit, then another wrap-around will occur. Avoid this by introducing an additional local variable. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Fix a use-after-freeBart Van Assche2014-12-092-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq users are allowed to free the memory request_queue.tag_set points at after blk_cleanup_queue() has finished but before blk_release_queue() has started. This can happen e.g. in the SCSI core. The SCSI core namely embeds the tag_set structure in a SCSI host structure. The SCSI host structure is freed by scsi_host_dev_release(). This function is called after blk_cleanup_queue() finished but can be called before blk_release_queue(). This means that it is not safe to access request_queue.tag_set from inside blk_release_queue(). Hence remove the blk_sync_queue() call from blk_release_queue(). This call is not necessary - outstanding requests must have finished before blk_release_queue() is called. Additionally, move the blk_mq_free_queue() call from blk_release_queue() to blk_cleanup_queue() to avoid that struct request_queue.tag_set gets accessed after it has been freed. This patch avoids that the following kernel oops can be triggered when deleting a SCSI host for which scsi-mq was enabled: Call Trace: [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270 [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380 [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180 [<ffffffff8124d654>] blk_release_queue+0x84/0xd0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff81245895>] blk_put_queue+0x15/0x20 [<ffffffff8125c409>] disk_release+0x99/0xd0 [<ffffffff8133d056>] device_release+0x36/0xb0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff8125a78a>] put_disk+0x1a/0x20 [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0 [<ffffffff811d56a0>] blkdev_put+0x50/0x160 [<ffffffff81199eb4>] kill_block_super+0x44/0x70 [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60 [<ffffffff8119a87e>] deactivate_super+0x4e/0x70 [<ffffffff811b9833>] cleanup_mnt+0x43/0x90 [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20 [<ffffffff8107252c>] task_work_run+0xac/0xe0 [<ffffffff81002c01>] do_notify_resume+0x61/0xa0 [<ffffffff814d2c58>] int_signal+0x12/0x17 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: prevent unmapped hw queue from being scheduledMing Lei2014-12-082-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When one hardware queue has no mapped software queues, it shouldn't have been scheduled. Otherwise WARNING or OOPS can triggered. blk_mq_hw_queue_mapped() helper is introduce for fixing the problem. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: re-check for available tags after running the hardware queueJens Axboe2014-12-081-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | If we run out of tags and have to sleep, we run the hardware queue to kick pending IO into gear. During that run, we may have completed requests, so re-check if we have free tags before going to sleep. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: fix hang in bt_get()Bart Van Assche2014-12-081-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid that if there are fewer hardware queues than CPU threads that bt_get() can hang. The symptoms of the hang were as follows: * All tags allocated for a particular hardware queue. * (nr_tags) pending commands for that hardware queue. * No pending commands for the software queues associated with that hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: move the kdump check to blk_mq_alloc_tag_setShaohua Li2014-11-301-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are allocated in the former function. So the kdump check should be moved to there to really save memory. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: cleanup tag free handlingJens Axboe2014-11-241-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag() from blk_mq_put_tag(), so just inline the two calls instead of having them as separate functions. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu mapJens Axboe2014-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | We currently use num_possible_cpus(), but that breaks on sparc64 where the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID instead, so we don't end up reading from invalid memory. Cc: stable@kernel.org # 3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk: introduce generic io stat accounting help functionGu Zheng2014-11-241-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many block drivers accounting io stat based on bio (e.g. NVMe...), the blk_account_io_start/end() which is based on request does not make sense to them, so here we introduce the similar help function named generic_start/end_io_acct base on raw sectors, and it can simplify some driver's open io accounting code. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: handle the single queue case in blk_mq_hctx_next_cpuChristoph Hellwig2014-11-241-21/+10
| | | | | | | | | | | | | | | | | | | | | | | | Don't duplicate the code to handle the not cpu bounce case in the caller, do it inside blk_mq_hctx_next_cpu instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | genhd: check for int overflow in disk_expand_part_tbl()Jens Axboe2014-11-191-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition() with a user passed in partno value. If we pass in 0x7fffffff, the new target in disk_expand_part_tbl() overflows the 'int' and we access beyond the end of ptbl->part[] and even write to it when we do the rcu_assign_pointer() to assign the new partition. Reported-by: David Ramos <daramos@stanford.edu> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: add blk_mq_free_hctx_request()Jens Axboe2014-11-171-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's silly to use blk_mq_free_request() which in turn maps the request to the hardware queue, for places where we already know what the hardware queue is. This saves us an extra mapping of a hardware queue on request completion, if the caller knows this information already. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: export blk_mq_free_request()Jens Axboe2014-11-171-0/+1
| | | | | | | | | | | | | | | | | | | | | Drivers that know they are blk-mq should just use this function instead of calling through blk_put_request(). Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enablePaolo Bonzini2014-11-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq is using preempt_disable/enable in order to ensure that the queue runners are placed on the right CPU. This does not work with the RT patches, because __blk_mq_run_hw_queue takes a non-raw spinlock with the preemption-disabled region. If there is contention on the lock, this violates the rules for preemption-disabled regions. While this should be easily fixable within the RT patches just by doing migrate_disable/enable, we can do better and document _why_ this particular region runs with disabled preemption. After the previous patch, it is trivial to switch it to get/put_cpu; the RT patches then can change it to get_cpu_light, which lets virtio-blk run under RT kernels. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if neededPaolo Bonzini2014-11-111-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | preempt_disable/enable surrounds every call to blk_mq_run_hw_queue, except the one in blk-flush.c. In fact that one is always asynchronous, and it does not need smp_processor_id(). We can do the same for all other calls, avoiding preempt_disable when async is true. This avoids peppering blk-mq.c with preemption-disabled regions. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
OpenPOWER on IntegriCloud