summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2012-08-0119-177/+238
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous
| * Merge branch 'upstream' of ↵Jens Axboe2012-07-311-2/+1
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy into for-3.6/drivers
| | * floppy: remove duplicated flag FD_RAW_NEED_DISKFengguang Wu2012-07-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Fix coccinelle warning (without behavior change): drivers/block/floppy.c:2518:32-48: duplicated argument to & or | Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
| * | blk: pass from_schedule to non-request unplug functions.NeilBrown2012-07-315-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: stack unplugShaohua Li2012-07-311-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD raid1 prepares to dispatch request in unplug callback. If make_request in low level queue also uses unplug callback to dispatch request, the low level queue's unplug callback will not be called. Recheck the callback list helps this case. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk: centralize non-request unplug handling.NeilBrown2012-07-315-83/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | md: remove plug_cnt feature of plugging.NeilBrown2012-07-315-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block/nbd: micro-optimization in nbd request completionChetan Loke2012-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add in-flight cmds to the tail. That way while searching (during request completion),we will always get a hit on the first element. Signed-off-by: Chetan Loke <loke.chetan@gmail.com> Acked-by: Paul.Clements@steeleye.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-3.6/driversJens Axboe2012-07-301401-8277/+16420
| |\ \
| | * | drbd: announce FLUSH/FUA capability to upper layersLars Ellenberg2012-07-244-8/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unconditionally announce FLUSH/FUA to upper layers. If the lower layers on either node do not actually support this, generic_make_request() will deal with it. If this causes performance regressions on your setup, make sure there are no volatile caches involved, and mount -o nobarrier or equivalent. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: fix max_bio_size to be unsignedLars Ellenberg2012-07-243-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We capped our max_bio_size respectively max_hw_sectors with min_t(int, lower level limit, our limit); unfortunately, some drivers, e.g. the kvm virtio block driver, initialize their limits to "-1U", and that is of course a smaller "int" value than our limit. Impact: we started to request 16 MB resync requests, which lead to protocol error and a reconnect loop. Fix all relevant constants and parameters to be unsigned int. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: flush drbd work queue before invalidate/invalidate remoteLars Ellenberg2012-07-242-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If you do back to back wait-sync/invalidate on a Primary in a tight loop, during application IO load, you could trigger a race: kernel: block drbd6: FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending? Fix this by changing the order of the drbd_queue_work() and the wake_up() in dec_ap_pending(), and adding the additional drbd_flush_workqueue() before requesting the full sync. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: fix potential access after freeLars Ellenberg2012-07-241-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Occasionally, if we disconnect, we triggered this assert: block drbd7: ASSERT FAILED tl_hash[27] == c30b0f04, expected NULL hlist_del() happens only on master bio completion. We used to wait for pending IO to complete before freeing tl_hash on disconnect. We no longer do so, since we learned to "freeze" IO on disconnect. If the local disk is too slow, we may reach C_STANDALONE early, and there are still some requests pending locally when we call drbd_free_tl_hash(). If we now free the tl_hash, and later the local IO completion completes the master bio, which then does hlist_del() and clobbers freed memory. Do hlist_del_init() and hlist_add_fake() before kfree(tl_hash), so the hlist_del() on master bio completion is harmless. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: call local-io-error handler earlyLars Ellenberg2012-07-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case we want to hard-reset from the local-io-error handler, we need to call it before notifying the peer or aborting local IO. Otherwise the peer will advance its data generation UUIDs even if secondary. This way, local io error looks like a "regular" node crash, which reduces the number of different failure cases. This may be useful in a bigger picture where crashed or otherwise "misbehaving" nodes are automatically re-deployed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: do not reset rs_pending_cnt too earlyLars Ellenberg2012-07-243-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix asserts like block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 ! We reset the resync lru cache and related information (rs_pending_cnt), once we successfully finished a resync or online verify, or if the replication connection is lost. We also need to reset it if a resync or online verify is aborted because a lower level disk failed. In that case the replication link is still established, and we may still have packets queued in the network buffers which want to touch rs_pending_cnt. We do not have any synchronization mechanism to know for sure when all such pending resync related packets have been drained. To avoid this counter to go negative (and violate the ASSERT that it will always be >= 0), just do not reset it when we lose a disk. It is good enough to make sure it is re-initialized before the next resync can start: reset it when we re-attach a disk. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: reset congestion information before reporting it in /proc/drbdLars Ellenberg2012-07-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cache the congestion status in mdev->congestion_reason whenever drbd_congested() was called. Reset this cached info before reporting it when reading /proc/drbd. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: report congestion if we are waiting for some userland callbackLars Ellenberg2012-07-243-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the drbd worker thread is synchronously waiting for some userland callback, we don't want some casual pageout to block on us. Have drbd_congested() report congestion in that case. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: differentiate between normal and forced detachLars Ellenberg2012-07-247-15/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Aborting local requests (not waiting for completion from the lower level disk) is dangerous: if the master bio has been completed to upper layers, data pages may be re-used for other things already. If local IO is still pending and later completes, this may cause crashes or corrupt unrelated data. Only abort local IO if explicitly requested. Intended use case is a lower level device that turned into a tarpit, not completing io requests, not even doing error completion. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| | * | drbd: cleanup, remove two unused global flagsLars Ellenberg2012-07-241-2/+0
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
| * | | Merge branch 'upstream' of ↵Jens Axboe2012-07-241-1/+20
| |\ \ \ | | | |/ | | |/| | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy into for-3.6/drivers
| | * | floppy: Run floppy initialization asynchronousAndi Kleen2012-07-041-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | floppy_init is quite slow, 3s on my test system to determine that there is no floppy. Run it asynchronous to the other init calls to improve boot time. [jkosina@suse.cz: fix modular build] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-blockLinus Torvalds2012-08-0121-301/+530
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block IO bits from Jens Axboe: "The most complicated part if this is the request allocation rework by Tejun, which has been queued up for a long time and has been in for-next ditto as well. There are a few commits from yesterday and today, mostly trivial and obvious fixes. So I'm pretty confident that it is sound. It's also smaller than usual." * 'for-3.6/core' of git://git.kernel.dk/linux-block: block: remove dead func declaration block: add partition resize function to blkpg ioctl block: uninitialized ioc->nr_tasks triggers WARN_ON block: do not artificially constrain max_sectors for stacking drivers blkcg: implement per-blkg request allocation block: prepare for multiple request_lists block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv blkcg: inline bio_blkcg() and friends block: allocate io_context upfront block: refactor get_request[_wait]() block: drop custom queue draining used by scsi_transport_{iscsi|fc} mempool: add @gfp_mask to mempool_create_node() blkcg: make root blkcg allocation use %GFP_KERNEL blkcg: __blkg_lookup_create() doesn't need radix preload
| * | | | block: remove dead func declarationYuanhan Liu2012-08-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __generic_unplug_device() function is removed with commit 7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to remove the declaration at meantime. Here remove it. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: add partition resize function to blkpg ioctlVivek Goyal2012-08-015-9/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use. This patch converts hd_struct->nr_sects into sequence counter because One might extend a partition while IO is happening to it and update of nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This can lead to issues like reading inconsistent size of a partition. Sequence counter have been used so that readers don't have to take bdev mutex lock as we call sector_in_part() very frequently. Now all the access to hd_struct->nr_sects should happen using sequence counter read/update helper functions part_nr_sects_read/part_nr_sects_write. There is one exception though, set_capacity()/get_capacity(). I think theoritically race should exist there too but this patch does not modify set_capacity()/get_capacity() due to sheer number of call sites and I am afraid that change might break something. I have left that as a TODO item. We can handle it later if need be. This patch does not introduce any new races as such w.r.t set_capacity()/get_capacity(). v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Phillip Susi <psusi@ubuntu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: uninitialized ioc->nr_tasks triggers WARN_ONOlof Johansson2012-08-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the below warning as of 3.5-rc1 and later (3.4 is fine): [ 10.886893] ------------[ cut here ]------------ [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560() [ 10.886905] Hardware name: Bochs [ 10.886906] Modules linked in: [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27 [ 10.886908] Call Trace: [ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0 [ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20 [ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560 [ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340 [ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50 [ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80 [ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60 [ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30 [ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20 [ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b [ 10.886928] ---[ end trace 32a14af7ee6a590b ]--- Reproducing is easy, I can hit it on a KVM system with a very basic config (x86_64 make defconfig + enable the drivers needed). To hit it, just install dump (on debian/ubuntu, not sure what the package might be called on Fedora), and: dump -o -f /tmp/foo / You'll see the warning in dmesg once it forks off the I/O process and starts dumping filesystem contents. I bisected it down to the following commit: commit f6e8d01bee036460e03bd4f6a79d014f98ba712e Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 5 13:15:26 2012 -0800 block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> It seems like the init of ioc->nr_tasks was removed in that patch, so it starts out at 0 instead of 1. Tejun, is the right thing here to add back the init, or should something else be done? The below patch removes the warning, but I haven't done any more extensive testing on it. Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: do not artificially constrain max_sectors for stacking driversMike Snitzer2012-08-011-2/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_set_stacking_limits is intended to allow stacking drivers to build up the limits of the stacked device based on the underlying devices' limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024) doesn't allow the stacking driver to inherit a max_sectors larger than 1024 -- due to blk_stack_limits' use of min_not_zero. It is now clear that this artificial limit is getting in the way so change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows stacking drivers like dm-multipath to inherit 'max_sectors' from the underlying paths). Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com> Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blkcg: implement per-blkg request allocationTejun Heo2012-06-266-30/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: prepare for multiple request_listsTejun Heo2012-06-254-46/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Request allocation is about to be made per-blkg meaning that there'll be multiple request lists. * Make queue full state per request_list. blk_*queue_full() functions are renamed to blk_*rl_full() and takes @rl instead of @q. * Rename blk_init_free_list() to blk_init_rl() and make it take @rl instead of @q. Also add @gfp_mask parameter. * Add blk_exit_rl() instead of destroying rl directly from blk_release_queue(). * Add request_list->q and make request alloc/free functions - blk_free_request(), [__]freed_request(), __get_request() - take @rl instead of @q. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvprivTejun Heo2012-06-252-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated to use q->nr_rqs[] instead of q->rq.count[]. These counters separates queue-wide request statistics from the request list and allow implementation of per-queue request allocation. While at it, properly indent fields of struct request_list. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blkcg: inline bio_blkcg() and friendsTejun Heo2012-06-252-25/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make bio_blkcg() and friends inline. They all are very simple and used only in few places. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: allocate io_context upfrontTejun Heo2012-06-252-30/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer very lazy allocation of ioc. It waits until the moment ioc is absolutely necessary; unfortunately, that time could be inside queue lock and __get_request() performs unlock - try alloc - retry dancing. Just allocate it up-front on entry to block layer. We're not saving the rain forest by deferring it to the last possible moment and complicating things unnecessarily. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: refactor get_request[_wait]()Tejun Heo2012-06-251-39/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there are two request allocation functions - get_request() and get_request_wait(). The former tries to allocate a request once and the latter keeps retrying until it succeeds. The latter wraps the former and keeps retrying until allocation succeeds. The combination of two functions deliver fallible non-wait allocation, fallible wait allocation and unfailing wait allocation. However, given that forward progress is guaranteed, fallible wait allocation isn't all that useful and in fact nobody uses it. This patch simplifies the interface as follows. * get_request() is renamed to __get_request() and is only used by the wrapper function. * get_request_wait() is renamed to get_request(). It now takes @gfp_mask and retries iff it contains %__GFP_WAIT. This patch doesn't introduce any functional change and is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: drop custom queue draining used by scsi_transport_{iscsi|fc}Tejun Heo2012-06-254-93/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iscsi_remove_host() uses bsg_remove_queue() which implements custom queue draining. fc_bsg_remove() open-codes mostly identical logic. The draining logic isn't correct in that blk_stop_queue() doesn't prevent new requests from being queued - it just stops processing, so nothing prevents new requests to be queued after the logic determines that the queue is drained. blk_cleanup_queue() now implements proper queue draining and these custom draining logics aren't necessary. Drop them and use bsg_unregister_queue() + blk_cleanup_queue() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@emulex.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | mempool: add @gfp_mask to mempool_create_node()Tejun Heo2012-06-253-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mempool_create_node() currently assumes %GFP_KERNEL. Its only user, blk_init_free_list(), is about to be updated to use other allocation flags - add @gfp_mask argument to the function. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blkcg: make root blkcg allocation use %GFP_KERNELTejun Heo2012-06-251-16/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation from __blkg_lookup_create() for root blkcg creation. This could make policy fail unnecessarily. Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an optional @new_blkg for preallocated blkg, and blkcg_activate_policy() preload radix tree and preallocate blkg with %GFP_KERNEL before trying to create the root blkg. v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure instead of ERR_PTR() value. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blkcg: __blkg_lookup_create() doesn't need radix preloadTejun Heo2012-06-251-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no point in calling radix_tree_preload() if preloading doesn't use more permissible GFP mask. Drop preloading from __blkg_lookup_create(). While at it, drop sparse locking annotation which no longer applies. v2: Vivek pointed out the odd preload usage. Instead of updating, just drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge branch 'for-next' of git://neil.brown.name/mdLinus Torvalds2012-08-019-219/+426
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from NeilBrown. * 'for-next' of git://neil.brown.name/md: DM RAID: Add support for MD RAID10 md/RAID1: Add missing case for attempting to repair known bad blocks. md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. md/raid1: don't abort a resync on the first badblock. md: remove duplicated test on ->openers when calling do_md_stop() raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer md/raid1: prevent merging too large request md/raid1: read balance chooses idlest disk for SSD md/raid1: make sequential read detection per disk based MD RAID10: Export md_raid10_congested MD: Move macros from raid1*.h to raid1*.c MD RAID1: rename mirror_info structure MD RAID10: rename mirror_info structure MD RAID10: Fix compiler warning. raid5: add a per-stripe lock raid5: remove unnecessary bitmap write optimization raid5: lockless access raid5 overrided bi_phys_segments raid5: reduce chance release_stripe() taking device_lock
| * | | | DM RAID: Add support for MD RAID10Jonathan Brassow2012-08-012-5/+116
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support the MD RAID10 personality through dm-raid.c Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | Merge commit 'c039c332f23e794deb6d6f37b9f07ff3b27fb2cf' into mdNeilBrown2012-08-014433-104467/+180036
| |\ \ \ \ | | | | | | | | | | | | | | | | | | Pull in pre-requisites for adding raid10 support to dm-raid.
| * | | | | md/RAID1: Add missing case for attempting to repair known bad blocks.Alexander Lyakas2012-07-311-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing resync or repair, attempt to correct bad blocks, according to WriteErrorSeen policy Signed-off-by: Alex Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.majianpeng2012-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sync' writes set both REQ_SYNC and REQ_NOIDLE. O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE. We currently assume that a REQ_SYNC request will not be followed by more requests and so set STRIPE_PREREAD_ACTIVE to expedite the request. This is appropriate for sync requests, but not for O_DIRECT requests. So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE rather than REQ_SYNC. This is consistent with the documented meaning of REQ_NOIDLE: __REQ_NOIDLE, /* don't anticipate more IO after this one */ Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | md/raid1: don't abort a resync on the first badblock.NeilBrown2012-07-311-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | md: remove duplicated test on ->openers when calling do_md_stop()NeilBrown2012-07-311-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_md_stop tests mddev->openers while holding ->open_mutex, and fails if this count is too high. So callers do not need to check mddev->openers and doing so isn't very meaningful as they don't hold ->open_mutex so the number could change. So remove the unnecessary tests on mddev->openers. These are not called often enough for there to be any gain in an early test on ->open_mutex to avoid the need for a slightly more costly mutex_lock call. Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layermajianpeng2012-07-312-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | md/raid1: prevent merging too large requestShaohua Li2012-07-312-7/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | md/raid1: read balance chooses idlest disk for SSDShaohua Li2012-07-311-3/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | md/raid1: make sequential read detection per disk basedShaohua Li2012-07-312-34/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | MD RAID10: Export md_raid10_congestedJonathan Brassow2012-07-312-22/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md/raid10: Export is_congested test. In similar fashion to commits 11d8a6e3719519fbc0e2c9d61b6fa931b84bf813 1ed7242e591af7e233234d483f12d33818b189d9 we export the RAID10 congestion checking function so that dm-raid.c can make use of it and make use of the personality. The 'queue' and 'gendisk' structures will not be available to the MD code when device-mapper sets up the device, so we conditionalize access to these fields also. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | MD: Move macros from raid1*.h to raid1*.cJonathan Brassow2012-07-314-29/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | | MD RAID1: rename mirror_info structureJonathan Brassow2012-07-312-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
OpenPOWER on IntegriCloud