| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit 27a7c642174e ("partitions/efi: account for pmbr size in lba")
we started treating bad sizes in lba field of the partition that has the
0xEE (GPT protective) as errors.
However, we may run into these "bad sizes" in the real world if someone
uses dd to copy an image from a smaller disk to a bigger disk. Since
this case used to work (even without using force_gpt), keep it working
and treat the size mismatch as a warning instead of an error.
Reported-by: Josh Triplett <josh@joshtriplett.org>
Reported-by: Sean Paul <seanpaul@chromium.org>
Signed-off-by: Doug Anderson <dianders@chromium.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recently commit bab55417b10c ("block: support embedded device command
line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
is too generic and sounds like it enables/disables generic kernel boot
arg processing, when it really is block specific.
Before this option becomes a part of a full/final release, add the BLK_
prefix to it so that it is clear in absence of any other context that it
is block specific.
In addition, fix up the following less critical items:
- help text was not really at all helpful.
- index file for Documentation was not updated
- add the new arg to Documentation/kernel-parameters.txt
- clarify wording in source comments
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Cai Zhiyong <caizhiyong@huawei.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"
* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
'samples' is 64bit operant, but do_div() second parameter is 32.
do_div silently truncates high 32 bits and calculated result
is invalid.
In case if low 32bit of 'samples' are zeros then do_div() produces
kernel crash.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This leaves bios setup on the request, because the caller assumes when
the blk_execute_rq_nowait/blk_execute_rq call has completed that
the rq->bios have been cleaned up.
This patch has blk_execute_rq_nowait use __blk_end_request_all
to free bios and also call rq->end_io.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Hello, Jens.
The original thread can be read from
http://thread.gmane.org/gmane.linux.kernel.cgroups/8937
While it leads to oops, given that it only triggers under specific
configurations which aren't common. I don't think it's necessary to
backport it through -stable and merging it during the coming merge
window should be enough.
Thanks!
----- 8< -----
Currently, q->root_blkg and q->root_rl.blkg are set from
blkcg_activate_policy() and cleared from blkg_destroy_all(). This
doesn't necessarily coincide with the lifetime of the root blkcg_gq
leading to the following oops when blkcg is enabled but no policy is
activated because __blk_queue_next_rl() malfunctions expecting the
root_blkg pointers to be set.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
PGD 60f7a9067 PUD 60f4c9067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
gsmi: Log Shutdown Reason 0x03
Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
Hardware name: Intel XXX
task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
RIP: 0010:[<ffffffff810c58cb>] [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
RSP: 0000:ffff88060d171818 EFLAGS: 00010096
RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
Stack:
0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
0000000000000082 0000000000000003 0000000000000000 0000000000000000
ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
Call Trace:
[<ffffffff810c7848>] __wake_up+0x48/0x70
[<ffffffff8131da53>] __blk_drain_queue+0x123/0x190
[<ffffffff8131dbb5>] blk_cleanup_queue+0xf5/0x210
[<ffffffff8141877a>] __scsi_remove_device+0x5a/0xd0
[<ffffffff81418824>] scsi_remove_device+0x34/0x50
[<ffffffff814189cb>] scsi_remove_target+0x16b/0x220
[<ffffffff814210f1>] __iscsi_unbind_session+0xd1/0x1b0
[<ffffffff814212b2>] iscsi_remove_session+0xe2/0x1c0
[<ffffffff814213a6>] iscsi_destroy_session+0x16/0x60
[<ffffffff81423a59>] iscsi_session_teardown+0xd9/0x100
[<ffffffff8142b75a>] iscsi_sw_tcp_session_destroy+0x5a/0xb0
[<ffffffff81420948>] iscsi_if_rx+0x10e8/0x1560
[<ffffffff81573335>] netlink_unicast+0x145/0x200
[<ffffffff815736f3>] netlink_sendmsg+0x303/0x410
[<ffffffff81528196>] sock_sendmsg+0xa6/0xd0
[<ffffffff815294bc>] ___sys_sendmsg+0x38c/0x3a0
[<ffffffff811ea840>] ? fget_light+0x40/0x160
[<ffffffff811ea899>] ? fget_light+0x99/0x160
[<ffffffff811ea840>] ? fget_light+0x40/0x160
[<ffffffff8152bc79>] __sys_sendmsg+0x49/0x90
[<ffffffff8152bcd2>] SyS_sendmsg+0x12/0x20
[<ffffffff815fb642>] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 <4c> 8b 2a 48 8d 42 e8 49
Fix it by moving r->root_blkg and q->root_rl.blkg setting to
blkg_create() and clearing to blkg_destroy() so that they area
initialized when a root blkg is created and cleared when destroyed.
Reported-and-tested-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| | |
Use the helper function instead of __GFP_ZERO.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In func blk_queue_bio, if list of plug is empty,it will call
blk_trace_plug.
If process deal with a single device,it't ok.But if process deal with
multi devices,it only trace the first device.
Using request_count to judge, it can soleve this problem.
In addition, i modify the comment.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr
size in lba") caused his GPT formatted eMMC device not to boot. The
reason is that this commit enforced Linux to always check the lesser of
the whole disk or 2Tib for the pMBR size in LBA. While most disk
partitioning tools out there create a pMBR with these characteristics,
Microsoft does not, as it always sets the entry to the maximum 32-bit
limitation - even though a drive may be smaller than that[1].
Loosen this check and only verify that the size is either the whole disk
or 0xFFFFFFFF. No tool in its right mind would set it to any value
other than these.
[1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT
Reported-and-tested-by: Matt Porter <matt.porter@linaro.org>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Karel Zak <kzak@redhat.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Trivial coding style cleanups - still plenty left.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I love emacs, but these settings for coding style are annoying when trying
to open the efi.h file. More important, we already have checkpatch for
that.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When verifying GPT header integrity, make sure that first usable LBA is
smaller than last usable LBA.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The partition that has the 0xEE (GPT protective), must have the size in
lba field set to the lesser of the size of the disk minus one or
0xFFFFFFFF for larger disks.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
One of the biggest problems with GPT is compatibility with older, non-GPT
systems. The problem is addressed by creating hybrid mbrs, an extension,
or variant, of the traditional protective mbr. This contains, apart from
the 0xEE partition, up three additional primary partitions that point to
the same space marked by up to three GPT partitions. The result is that
legacy OSs can see the three required MBR partitions and at the same time
ignore the GPT-aware partitions that protect the GPT structures.
While hybrid MBRs are hacks, workarounds and simply not part of the GPT
standard, they do exist and we have no way around them. For instance, by
default, OSX creates a hybrid scheme when using multi-OS booting.
In order for Linux to properly discover protective MBRs, it must be made
aware of devices that have hybrid MBRs. No functionality is changed by
this patch, just a debug message informing the user of the MBR scheme that
is being used.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When detecting a valid protective MBR, the Linux kernel isn't picky about
the partition (1-4) the 0xEE is at, but, unlike other operating systems,
it does require it to begin at the second sector (sector 1). This check,
apart from it not being enforced by UEFI, and causing Linux to potentially
fail to detect any *valid* partitions on the disk, can present problems
when dealing with hybrid MBRs[1].
For compatibility reasons, if the first partition is hybridized, the 0xEE
partition must be small enough to ensure that it only protects the GPT
data structures - as opposed to the the whole disk in a protective MBR.
This problem is very well described by Rod Smith[1]: where MBR-only
partitioning programs (such as older versions of fdisk) can see some of
the disk space as unallocated, thus loosing the purpose of the 0xEE
partition's protection of GPT data structures.
By dropping this check, this patch enables Linux to be more flexible when
probing for GPT disklabels.
[1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that
has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the
LBA of the GPT Partition Header.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The kernel's GPT implementation currently uses the generic 'struct
partition' type for dealing with legacy MBR partition records. While this
is is useful for disklabels that we designed for CHS addressing, such as
msdos, it doesn't adapt well to newer standards that use LBA instead, such
as GUID partition tables. Furthermore, these generic partition structures
do not have all the required fields to properly follow the UEFI specs.
While a CHS address can be translated to LBA, it's much simpler and
cleaner to just replace the partition type. This patch adds a new
'gpt_record' type that is fully compliant with EFI and will allow, in the
next patches, to add more checks to properly verify a protective MBR,
which is paramount to probing a device that makes use of GPT.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Karel Zak <kzak@redhat.com>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I found the following pattern that leads in to interesting findings:
grep -r "ret.*|=.*__put_user" *
grep -r "ret.*|=.*__get_user" *
grep -r "ret.*|=.*__copy" *
The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.
For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.
The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely. The fix is inspired from x86. This could
lead to information leak on alpha. I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Read block device partition table from command line. The partition used
for fixed block device (eMMC) embedded device. It is no MBR, save
storage space. Bootloader can be easily accessed by absolute address of
data on the block device. Users can easily change the partition.
This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
About the partition verbose reference
"Documentation/block/cmdline-partition.txt"
[akpm@linux-foundation.org: fix printk text]
[yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()]
Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com>
Cc: Karel Zak <kzak@redhat.com>
Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com>
Cc: Marius Groeger <mag@sysgo.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The usage of strict_strtoul() is not preferred, because strict_strtoul()
is obsolete. Thus, kstrtoul() should be used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.
- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.
Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.
- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.
- Various fixes and cleanups"
Fixed up conflict in kernel/cgroup.c as per Tejun.
* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
the iteration
Previously, all css descendant iterators didn't include the origin
(root of subtree) css in the iteration. The reasons were maintaining
consistency with css_for_each_child() and that at the time of
introduction more use cases needed skipping the origin anyway;
however, given that css_is_descendant() considers self to be a
descendant, omitting the origin css has become more confusing and
looking at the accumulated use cases rather clearly indicates that
including origin would result in simpler code overall.
While this is a change which can easily lead to subtle bugs, cgroup
API including the iterators has recently gone through major
restructuring and no out-of-tree changes will be applicable without
adjustments making this a relatively acceptable opportunity for this
type of change.
The conversions are mostly straight-forward. If the iteration block
had explicit origin handling before or after, it's moved inside the
iteration. If not, if (pos == origin) continue; is added. Some
conversions add extra reference get/put around origin handling by
consolidating origin handling and the rest. While the extra ref
operations aren't strictly necessary, this shouldn't cause any
noticeable difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.
cgroup_taskset which is used by the subsystem attach methods is the
last cgroup subsystem API which isn't using css as the handle. Update
cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.
The conversions are pretty mechanical. One exception is
cpuset::cgroup_cs(), which lost its last user and got removed.
This patch shouldn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cgroup is currently in the process of transitioning to using css
(cgroup_subsys_state) as the primary handle instead of cgroup in
subsystem API. For hierarchy iterators, this is beneficial because
* In most cases, css is the only thing subsystems care about anyway.
* On the planned unified hierarchy, iterations for different
subsystems will need to skip over different subtrees of the
hierarchy depending on which subsystems are enabled on each cgroup.
Passing around css makes it unnecessary to explicitly specify the
subsystem in question as css is intersection between cgroup and
subsystem
* For the planned unified hierarchy, css's would need to be created
and destroyed dynamically independent from cgroup hierarchy. Having
cgroup core manage css iteration makes enforcing deref rules a lot
easier.
Most subsystem conversions are straight-forward. Noteworthy changes
are
* blkio: cgroup_to_blkcg() is no longer used. Removed.
* freezer: cgroup_freezer() is no longer used. Removed.
* devices: cgroup_to_devcgroup() is no longer used. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.
This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.
Most subsystem conversions are straight forwards but there are some
interesting ones.
* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.
* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.
* cpu: cgroup_tg() doesn't have any user left. Removed.
* cpuacct: cgroup_ca() doesn't have any user left. Removed.
* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.
* net_cls: cgrp_cls_state() doesn't have any user left. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cgroup is transitioning to using css (cgroup_subsys_state) instead of
cgroup as the primary subsystem handle. The cgroupfs file interface
will be converted to use css's which requires finding out the
subsystem from cftype so that the matching css can be determined from
the cgroup.
This patch adds cftype->ss which points to the subsystem the file
belongs to. The field is initialized while a cftype is being
registered. This makes it unnecessary to explicitly specify the
subsystem for other cftype handling functions. @ss argument dropped
from various cftype handling functions.
This patch shouldn't introduce any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.
* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.
* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.
* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.
This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are
* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.
* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.
This patch shouldn't cause any behavior differences.
v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.
Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Currently, controllers have to explicitly follow the cgroup hierarchy
to find the parent of a given css. cgroup is moving towards using
cgroup_subsys_state as the main controller interface construct, so
let's provide a way to climb the hierarchy using just csses.
This patch implements css_parent() which, given a css, returns its
parent. The function is guarnateed to valid non-NULL parent css as
long as the target css is not at the top of the hierarchy.
freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
are converted to use css_parent() instead of accessing cgroup->parent
directly.
* __parent_ca() is dropped from cpuacct and its usage is replaced with
parent_ca(). The only difference between the two was NULL test on
cgroup->parent which is now embedded in css_parent() making the
distinction moot. Note that eventually a css->parent field will be
added to css and the NULL check in css_parent() will go away.
This patch shouldn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
css (cgroup_subsys_state) is usually embedded in a subsys specific
data structure. Subsystems either use container_of() directly to cast
from css to such data structure or has an accessor function wrapping
such cast. As cgroup as whole is moving towards using css as the main
interface handle, add and update such accessors to ease dealing with
css's.
All accessors explicitly handle NULL input and return NULL in those
cases. While this looks like an extra branch in the code, as all
controllers specific data structures have css as the first field, the
casting doesn't involve any offsetting and the compiler can trivially
optimize out the branch.
* blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
accessor. Added.
* memory, hugetlb and devices already had one but didn't explicitly
handle NULL input. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.
We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.
Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a medium error is detected the SCSI stack should return
ENODATA to the upper layers.
[jejb: fix whitespace error]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
| |/
|/|
| |
| |
| |
| |
| |
| | |
When the thin provisioning hard threshold is reached we
should return ENOSPC to inform upper layers about this fact.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
This removes all the drivers/block uses of the __cpuinit macros
from all C files.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull core block IO updates from Jens Axboe:
"Here are the core IO block bits for 3.11. It contains:
- A tweak to the reserved tag logic from Jan, for weirdo devices with
just 3 free tags. But for those it improves things substantially
for random writes.
- Periodic writeback fix from Jan. Marked for stable as well.
- Fix for a race condition in IO scheduler switching from Jianpeng.
- The hierarchical blk-cgroup support from Tejun. This is the grunt
of the series.
- blk-throttle fix from Vivek.
Just a note that I'm in the middle of a relocation, whole family is
flying out tomorrow. Hence I will be awal the remainder of this week,
but back at work again on Monday the 15th. CC'ing Tejun, since any
potential "surprises" will most likely be from the blk-cgroup work.
But it's been brewing for a while and sitting in my tree and
linux-next for a long time, so should be solid."
* 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
elevator: Fix a race in elevator switching
block: Reserve only one queue tag for sync IO if only 3 tags are available
writeback: Fix periodic writeback after fs mount
blk-throttle: implement proper hierarchy support
blk-throttle: implement throtl_grp->has_rules[]
blk-throttle: Account for child group's start time in parent while bio climbs up
blk-throttle: add throtl_qnode for dispatch fairness
blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
blk-throttle: make blk_throtl_bio() ready for hierarchy
blk-throttle: make blk_throtl_drain() ready for hierarchy
blk-throttle: dispatch from throtl_pending_timer_fn()
blk-throttle: implement dispatch looping
blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
blk-throttle: add throtl_service_queue->parent_sq
blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
blk-throttle: move bio_lists[] and friends to throtl_service_queue
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There's a race between elevator switching and normal io operation.
Because the allocation of struct elevator_queue and struct elevator_data
don't in a atomic operation.So there are have chance to use NULL
->elevator_data.
For example:
Thread A: Thread B
blk_queu_bio elevator_switch
spin_lock_irq(q->queue_block) elevator_alloc
elv_merge elevator_init_fn
Because call elevator_alloc, it can't hold queue_lock and the
->elevator_data is NULL.So at the same time, threadA call elv_merge and
nedd some info of elevator_data.So the crash happened.
Move the elevator_alloc into func elevator_init_fn, it make the
operations in a atomic operation.
Using the follow method can easy reproduce this bug
1:dd if=/dev/sdb of=/dev/null
2:while true;do echo noop > scheduler;echo deadline > scheduler;done
The test method also use this method.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In case a device has three tags available we still reserve two of them
for sync IO. That leaves only a single tag for async IO such as
writeback from flusher thread which results in poor performance.
Allow async IO to consume two tags in case queue has three tag availabe
to get a decent async write performance.
This patch improves streaming write performance on a machine with such disk
from ~21 MB/s to ~52 MB/s. Also postmark throughput in presence of
streaming writer improves from 8 to 12 transactions per second so sync
IO doesn't seem to be harmed in presence of heavy async writer.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With the recent updates, blk-throttle is finally ready for proper
hierarchy support. Dispatching now honors service_queue->parent_sq
and propagates correctly. The only thing missing is setting
->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
hierarchy.
This patch updates throtl_pd_init() such that service_queues form the
same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
As this concludes proper hierarchy support for blkcg, the shameful
.broken_hierarchy tag is removed from blkio_subsys.
v2: Updated blkio-controller.txt as suggested by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
blk_throtl_bio() has a quick exit path for throtl_grps without limits
configured. It looks at the bps and iops limits and if both are not
configured, the bio is issued immediately. While this is correct in
the current flat hierarchy as each throtl_grp behaves completely
independently, it would become wrong in proper hierarchy mode. A
group without any limits could still be limited by one of its
ancestors and bio's queued for such group should not bypass
blk-throtl.
As having a quick bypass mechanism is beneficial, this patch
reimplements the mechanism such that it's correct even with proper
hierarchy. throtl_grp->has_rules[] is added. These booleans are
updated for the whole subtree whenever a config is updated so that
has_rules[] of the whole subtree stays synchronized. They're also
updated when a new throtl_grp comes online so that it can't escape the
limits of its ancestors.
As no throtl_grp has another throtl_grp as parent now, this patch
doesn't yet make any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With the planned proper hierarchy support, a bio will climb up the
tree before actually being dispatched. This makes sure bio is also
subjected to parent's throttling limits, if any.
It might happen that parent is idle and when bio is transferred to
parent, a new slice starts fresh. But that is incorrect as parents
wait time should have started when bio was queued in child group and
causes IOs to be throttled more than configured as they climb the
hierarchy.
Given the fact that we have not written hierarchical algorithm in a
way where child's and parents time slices are synchronized, we
transfer the child's start time to parent if parent was idling. If
parent was busy doing dispatch of other bios all this while, this is
not an issue.
Child's slice start time is passed to parent. Parent looks at its
last expired slice start time. If child's start time is after parents
old start time, that means parent had been idle and after parent
went idle, child had an IO queued. So use child's start time as
parent start time.
If parent's start time is after child's start time, that means,
when IO got queued in child group, parent was not idle. But later
it dispatched some IO, its slice got trimmed and then it went idle.
After a while child's request got shifted in parent group. In this
case use parent's old start time as new start time as that's the
duration of slice we did not use.
This logic is far from perfect as if there are multiple childs
then first child transferring the bio decides the start time while
a bio might have queued up even earlier in other child, which is
yet to be transferred up to parent. In that case we will lose
time and bandwidth in parent. This patch is just an approximation
to make situation somewhat better.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With flat hierarchy, there's only single level of dispatching
happening and fairness beyond that point is the responsibility of the
rest of the block layer and driver, which usually works out okay;
however, with the planned hierarchy support,
service_queue->bio_lists[] can be filled up by bios from a single
source. While the limits would still be honored, it'd be very easy to
starve IOs from siblings or children.
To avoid such starvation, this patch implements throtl_qnode and
converts service_queue->bio_lists[] to lists of per-source qnodes
which in turn contains the bio's. For example, when a bio is
dispatched from a child group, the bio doesn't get queued on
->bio_lists[] directly but it first gets queued on the group's qnode
which in turn gets queued on service_queue->queued[]. When
dispatching for the upper level, the ->queued[] list is consumed in
round-robing order so that the dispatch windows is consumed fairly by
all IO sources.
There are two ways a bio can come to a throtl_grp - directly queued to
the group or dispatched from a child. For the former
throtl_grp->qnode_on_self[rw] is used. For the latter, the child's
->qnode_on_parent[rw].
Note that this means that the child which is contributing a bio to its
parent should stay pinned until all its bios are dispatched to its
grand-parent. This patch moves blkg refcnting from bio add/remove
spots to qnode activation/deactivation so that the blkg containing an
active qnode is always pinned. As child pins the parent, this is
sufficient for keeping the relevant sub-tree pinned while bios are in
flight.
The starvation issue was spotted by Vivek Goyal.
v2: The original patch used the same throtl_grp->qnode_on_self/parent
for reads and writes causing RWs to be queued incorrectly if there
already are outstanding IOs in the other direction. They should
be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
can use different qnodes. Spotted by Vivek Goyal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
throtl_pending_timer_fn() currently assumes that the parent_sq is the
top level one and the bio's dispatched are ready to be issued;
however, this assumption will be wrong with proper hierarchy support.
This patch makes the following changes to make
throtl_pending_timer_fn() ready for hiearchy.
* If the parent_sq isn't the top-level one, update the parent
throtl_grp's dispatch time and schedule the next dispatch as
necessary. If the parent's dispatch time is now, repeat the
function for the parent throtl_grp.
* If the parent_sq is the top-level one, kick issue work_item as
before.
* The debug message printed by throtl_log() now prints out the
service_queue's nr_queued[] instead of the total nr_queued as the
latter becomes uninteresting and misleading with hierarchical
dispatch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
tg_dispatch_one_bio() currently assumes that the parent_sq is the top
level one and the bio being dispatched is ready to be issued; however,
this assumption will be wrong with proper hierarchy support. This
patch makes the following changes to make tg_dispatch_on_bio() ready
for hiearchy.
* throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
transfer a bio from a child tg to its parent.
* tg_dispatch_one_bio() is updated to distinguish whether its parent
is another throtl_grp or the throtl_data. If former, the bio is
transferred to the parent throtl_grp using throtl_add_bio_tg(). If
latter, the bio is ready to be issued and put on the top-level
service_queue's bio_lists[] and throtl_data->nr_queued is
decremented.
As all throtl_grps currently have the top level service_queue as their
->parent_sq, this patch in itself doesn't make any behavior
difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, blk_throtl_bio() issues the passed in bio directly if it's
within limits of its associated tg (throtl_grp). This behavior
becomes incorrect with hierarchy support as the bio should be
accounted to and throttled by the ancestor throtl_grps too.
This patch makes the direct issue path of blk_throtl_bio() to loop
until it reaches the top-level service_queue or gets throttled. If
the former, the bio can be issued directly; otherwise, it gets queued
at the first layer it was above limits.
As tg->parent_sq is always the top-level service queue currently, this
patch in itself doesn't make any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The current blk_throtl_drain() assumes that all active throtl_grps are
queued on throtl_data->service_queue, which won't be true once
hierarchy support is implemented.
This patch makes blk_throtl_drain() perform post-order walk of the
blkg hierarchy draining each associated throtl_grp, which guarantees
that all bios will eventually be pushed to the top-level service_queue
in throtl_data.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, blk_throtl_dispatch_work_fn() is responsible for both
dispatching bio's from throtl_grp's according to their limits and then
issuing the dispatched bios.
This patch moves the dispatch part to throtl_pending_timer_fn() so
that the work item is kicked iff there are bio's to issue. This is to
avoid work item execution at each step when hierarchy support is
enabled. bio's will be dispatched towards the top-level service_queue
from the timers at each layer and the work item will only be used to
issue the bio's which reached the top-level service_queue.
While fetching bio's to issue from bio_lists[],
blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While
the original code also dispatched READs first, if multiple throtl_grps
are dispatched on the same run, WRITEs from throtl_grp which is
dispatched first would precede READs from throtl_grps which are
dispatched later. While this is a behavior change, given that the
previous code already prioritized READs and block layer generally
prioritizes and segregates READs from WRITEs, this isn't likely to
make any noticeable differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
throtl_select_dispatch() only dispatches throtl_quantum bios on each
invocation. blk_throtl_dispatch_work_fn() in turn depends on
throtl_schedule_next_dispatch() scheduling the next dispatch window
immediately so that undue delays aren't incurred. This effectively
chains multiple dispatch work item executions back-to-back when there
are more than throtl_quantum bios to dispatch on a given tick.
There is no reason to finish the current work item just to repeat it
immediately. This patch makes throtl_schedule_next_dispatch() return
%false without doing anything if the current dispatch window is still
open and updates blk_throtl_dispatch_work_fn() repeat dispatching
after cpu_relax() on %false return.
This change will help implementing hierarchy support as dispatching
will be done from pending_timer and immediate reschedule of timer
function isn't supported and doesn't make much sense.
While this patch changes how dispatch behaves when there are more than
throtl_quantum bios to dispatch on a single tick, the behavior change
is immaterial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
throtl_data->dispatch_work
Currently, throtl_data->dispatch_work is a delayed_work item which
handles both delayed dispatch and issuing bios. The two tasks will be
separated to support proper hierarchy. To prepare for that, this
patch separates out the timer into throtl_service_queue->pending_timer
from throtl_data->dispatch_work and make the latter a work_struct.
* As the timer is now per-service_queue, it's initialized and
del_sync'd as its corresponding service_queue is created and
destroyed. The timer, when triggered, simply schedules
throtl_data->dispathc_work for execution.
* throtl_schedule_delayed_work() is renamed to
throtl_schedule_pending_timer() and takes @sq and @expires now.
* Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
should be the parent_sq of the service_queue which just got a new
bio or updated. As the parent_sq is always the top-level
service_queue now, this doesn't change anything at this point.
This patch doesn't introduce any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
update with it
With proper hierarchy support, a bio can be dispatched multiple times
until it reaches the top-level service_queue and we don't want to
update dispatch stats at each step. They are local stats and will be
kept local. If recursive stats are necessary, they should be
implemented separately and definitely not by updating counters
recursively on each dispatch.
This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
stats update with it so that dispatch stats are updated only on the
first time the bio is charged to a throtl_grp, which will always be
the throtl_grp the bio was originally queued to.
This means that REQ_THROTTLED would be set even for bios which don't
get throttled. As we don't want bios to leave blk-throtl with the
flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
and clear if the bio is being issued directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|