| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
cic is association between io_context and request_queue. A cic is
linked from both ioc and q and should be destroyed when either one
goes away. As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.
Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.
This rather unconventional use of RCU quickly devolves into extremely
fragile convolution. e.g. The following is from cfqd going away too
soon after ioc and q exits raced.
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
[ 88.503444]
Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
...
Call Trace:
[<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
[<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
[<ffffffff81389130>] exit_io_context+0x100/0x140
[<ffffffff81098a29>] do_exit+0x579/0x850
[<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
[<ffffffff81098de7>] sys_exit_group+0x17/0x20
[<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU. This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.
* From q side, the change is almost trivial as ioc->lock nests inside
queue_lock. It just needs to grab each ioc->lock as it walks
cic_list and unlink it.
* From ioc side, it's a bit more difficult because of inversed lock
order. ioc needs its lock to walk its cic_list but can't grab the
matching queue_lock and needs to perform unlock-relock dancing.
Unlinking is now wholly done from put_io_context() and fast path is
optimized by using the queue_lock the caller already holds, which is
by far the most common case. If the ioc accessed multiple devices,
it tries with trylock. In unlikely cases of fast path failure, it
falls back to full double-locking dance from workqueue.
Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.
This still leaves a lot of now unnecessary RCU logics. Future patches
will trim them.
-v2: Vivek pointed out that cic->q was being dereferenced after
cic->release() was called. Updated to use local variable @this_q
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* cfq_cic_lookup() may be called without queue_lock and multiple tasks
can execute it simultaneously for the same shared ioc. Nothing
prevents them racing each other and trying to drop the same dead cic
entry multiple times.
* smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
prevents cfq_cic_lookup() seeing stale cic->key. This usually
doesn't blow up because by the time cic is exited, all requests have
been drained and new requests are terminated before going through
elevator. However, it can still be triggered by plug merge path
which doesn't grab queue_lock and thus can't check DEAD state
reliably.
This patch updates lookup locking such that,
* Lookup is always performed under queue_lock. This doesn't add any
more locking. The only issue is cfq_allow_merge() which can be
called from plug merge path without holding any lock. For now, this
is worked around by using cic of the request to merge into, which is
guaranteed to have the same ioc. For longer term, I think it would
be best to separate out plug merge method from regular one.
* Spurious ioc->lock locking around cic lookup hint assignment
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
cfq_get_io_context() would fail if multiple tasks race to insert cic's
for the same association. This patch restructures
cfq_get_io_context() such that slow path insertion race is handled
properly.
Note that the restructuring also makes cfq_get_io_context() called
under queue_lock and performs both ioc and cfqd insertions while
holding both ioc and queue locks. This is part of on-going locking
tightening and will be used to simplify synchronization rules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ioprio/cgroup change was handled by marking the changed state in ioc
and, on the following access to the ioc, performing RCU-protected
iteration through all cic's grabbing the matching queue_lock.
This patch moves the changed state to each cic. When ioprio or cgroup
changes, the respective bit is set on all cic's of the ioc and when
each of those cic (not ioc) is accessed, change is applied for that
specific ioc-queue pair.
This also fixes the following two race conditions between setting and
clearing of changed states.
* Missing barrier between assign/load of ioprio and ioprio_changed
allowed applying old ioprio.
* Change requests could happen between application of change and
clearing of changed variables.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Make the following changes to prepare for ioc/cic management cleanup.
* Add cic->q so that ioc can determine the associated queue without
querying cfq. This will eventually replace ->key.
* Factor out cfq_release_cic() from cic_free_func(). This function
assumes that the caller handled locking.
* Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
take only @cic.
* Restructure cfq_cic_link() for future updates.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
failure instead of 0 / -errno or boolean. Update it such that it
returns %true on success and %false on failure.
* Make sure the caller checks for the return value.
* Separate out __blk_get_queue() which doesn't check whether @q is
dead and put it in blk.h. This will be used later.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio(). The former is
always called from local task while the latter can be called from
different task. The synchornization between them are peculiar and
dubious.
* current_io_context() doesn't grab task_lock() and assumes that if it
saw %NULL ->io_context, it would stay that way until allocation and
assignment is complete. It has smp_wmb() between alloc/init and
assignment.
* set_task_ioprio() grabs task_lock() for assignment and does
smp_read_barrier_depends() between "ioc = task->io_context" and "if
(ioc)". Unfortunately, this doesn't achieve anything - the latter
is not a dependent load of the former. ie, if ioc itself were being
dereferenced "ioc->xxx", it would mean something (not sure what tho)
but as the code currently stands, the dependent read barrier is
noop.
As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.
Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.
ioc get/put doesn't have any reason to be complex. The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive. All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.
This patch updates ioc get/put so that it becomes more conventional.
* alloc_io_context() is replaced with get_task_io_context(). This is
the only interface which can acquire access to ioc of another task.
On return, the caller has an explicit reference to the object which
should be put using put_io_context() afterwards.
* The functionality of current_io_context() remains the same but when
creating a new ioc, it shares the code path with
get_task_io_context() and always goes through task_lock().
* get_io_context() now means incrementing ref on an ioc which the
caller already has access to (be that an explicit refcnt or implicit
%current one).
* PF_EXITING inhibits creation of new io_context and once
exit_io_context() is finished, it's guaranteed that both ioc
acquisition functions return %NULL.
* All users are updated. Most are trivial but
smp_read_barrier_depends() removal from cfq_get_io_context() needs a
bit of explanation. I suppose the original intention was to ensure
ioc->ioprio is visible when set_task_ioprio() allocates new
io_context and installs it; however, this wouldn't have worked
because set_task_ioprio() doesn't have wmb between init and install.
There are other problems with this which will be fixed in another
patch.
* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
specification.
-v2: Vivek spotted contamination from debug patch. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* int return from put_io_context() wasn't used by anybody. Make it
return void like other put functions and docbook-fy the function
comment.
* Reorder dummy declarations for !CONFIG_BLOCK case a bit.
* Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
if block and drop 0'ing.
* Docbook-fy current_io_context() comment.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context. Move it to q->id and allocate on queue init and
free on queue release. This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
blk_insert_cloned_request(), blk_execute_rq_nowait() and
blk_flush_plug_list() either didn't check whether the queue was dead
or did it without holding queue_lock. Update them so that dead state
is checked while holding queue_lock.
AFAICS, this plugs all holes (requeue doesn't matter as the request is
transitioning atomically from in_flight to queued).
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When trying to drain all requests, blk_drain_queue() checked only
q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This
patch updates blk_drain_queue() such that it looks at all the counters
and queues so that request_queue is actually empty on completion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
macro and use it.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The only user left for blk_insert_request() is sx8 and it can be
trivially switched to use blk_execute_rq_nowait() - special requests
aren't included in io stat and sx8 doesn't use block layer tagging.
Switch sx8 and kill blk_insert_requeset().
This patch doesn't introduce any functional difference.
Only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
will pass the command to the underlying block device. This is
well-known, but it is also a large security problem when (via Unix
permissions, ACLs, SELinux or a combination thereof) a program or user
needs to be granted access only to part of the disk.
This patch lets partitions forward a small set of harmless ioctls;
others are logged with printk so that we can see which ioctls are
actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred.
Of course it was being sent to a (partition on a) hard disk, so it would
have failed with ENOTTY and the patch isn't changing anything in
practice. Still, I'm treating it specially to avoid spamming the logs.
In principle, this restriction should include programs running with
CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and
/dev/sdb, it still should not be able to read/write outside the
boundaries of /dev/sda2 independent of the capabilities. However, for
now programs with CAP_SYS_RAWIO will still be allowed to send the
ioctls. Their actions will still be logged.
This patch does not affect the non-libata IDE driver. That driver
however already tests for bd != bd->bd_contains before issuing some
ioctl; it could be restricted further to forbid these ioctls even for
programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Make it also print the command name when warning - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Introduce a wrapper around scsi_cmd_ioctl that takes a block device.
The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations. Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead. Most converions are straight-forward.
Noteworthy changes are,
* In cgroup_freezer, remove unnecessary NULL assignments to unused
methods. It's useless and very prone to get out of sync, which
already happened.
* In cpuset, PF_THREAD_BOUND test is checked for each task. This
doesn't make any practical difference but is conceptually cleaner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
|
| | \ | |
| | \ | |
| |\ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.
Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | |/ /
| |/| |
| | | |
| | | | |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | | |
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.
ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument"). It should be translated as ENOTTY
("Inappropriate ioctl for device").
That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad. We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually. In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 5e081591 "block: warn if tag is greater than real_max_depth"
cleaned up blk_queue_end_tag() to warn when the tag is truly invalid
(greater than real_max_depth). However, it changed behavior in the tag <
max_depth case to not end the request. Leading to triggering of
BUG_ON(blk_queued_rq(rq)) in the request completion path:
http://marc.info/?l=linux-kernel&m=132204370518629&w=2
In order to allow blk_queue_resize_tags() to shrink the tag space
blk_queue_end_tag() must always complete tags with a value less than
real_max_depth regardless of the current max_depth. The comment about
"handling the shrink case" seems to be what prompted changes in this
space, so remove it and BUG on all invalid tags (made even simpler by
Matthew's suggestion to use an unsigned compare).
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Reported-by: Meelis Roos <mroos@ut.ee>
Reported-by: Ed Nadolski <edmund.nadolski@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| | |
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While probing, fd sets up queue, probes hardware and tears down the
queue if probing fails. In the process, blk_drain_queue() kicks the
queue which failed to finish initialization and fd is unhappy about
that.
floppy0: no floppy controllers found
------------[ cut here ]------------
WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
Hardware name: To Be Filled By O.E.M.
VFS: do_fd_request called on non-open device
Modules linked in:
Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
Call Trace:
[<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
[<ffffffff813d657f>] do_fd_request+0xbf/0xd0
[<ffffffff81322b95>] blk_drain_queue+0x65/0x80
[<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
[<ffffffff818a809d>] floppy_init+0xdeb/0xe28
[<ffffffff818a72b2>] ? daring+0x6b/0x6b
[<ffffffff810002af>] do_one_initcall+0x3f/0x170
[<ffffffff81884b34>] kernel_init+0x9d/0x11e
[<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
[<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
[<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
[<ffffffff815dbb10>] ? gs_change+0xb/0xb
Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
something on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
cfq_cic_link() has race condition. When some processes which shared ioc
issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
sometimes. The race condition might stop I/O by following steps:
step 1: Process A: Issue an I/O to /dev/sda
step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not
linked with a cic for the device
step 3: Process A: Get a new cic for the device (cicA here) in
cfq_alloc_io_context()
step 4: Process B: Issue an I/O to /dev/sda
step 5: Process B: Get iocA in get_io_context() since process A and B share the
same ioc
step 6: Process B: Get a new cic for the device (cicB here) in
cfq_alloc_io_context() since iocA has not been linked with a
cic for the device yet
step 7: Process A: Link cicA to iocA in cfq_cic_link()
step 8: Process A: Dispatch I/O to driver and finish it
step 9: Process B: Try to link cicB to iocA in cfq_cic_link()
But it fails with showing "cfq: cic link failed!" kernel
message, since iocA has already linked with cicA at step 7.
step 10: Process B: Wait for finishig I/O in get_request_wait()
The function does not wake up, when there is no I/O to the
device.
When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If we fail allocating the blkpg stats, we free cfqd and cfgq.
But we need to free the IDA cfqd->cic_index as well.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
struct request_queue is allocated with __GFP_ZERO so its "node" field is
zero before initialization. This causes an oops if node 0 is offline in
the page allocator because its zonelists are not initialized. From Dave
Young's dmesg:
SRAT: Node 1 PXM 2 0-d0000000
SRAT: Node 1 PXM 2 100000000-330000000
SRAT: Node 0 PXM 1 330000000-630000000
Initmem setup node 1 0000000000000000-000000000affb000
...
Built 1 zonelists in Node order, mobility grouping on.
...
BUG: unable to handle kernel paging request at 0000000000001c08
IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870
and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
zonelist->_zonerefs.
The fix is to initialize q->node at the time of allocation so the correct
node is passed to the slab allocator later.
Since blk_init_allocated_queue_node() is no longer needed, merge it with
blk_init_allocated_queue().
[rientjes@google.com: changelog, initializing q->node]
Cc: stable@vger.kernel.org [2.6.37+]
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
After flush plug list, the list has no request, so we need to add a
trace_block_plug().
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
get_request_wait() could sleep and flush the plug list. If the list is
already flushed, don't flush again.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
Even after commit 5478755616ae2ef1ce144dded589b62b2a50d575
("block: check for proper length of iov entries earlier ...")
we still won't check for zero-length entries after an unaligned
entry. Remove the break-statement, so all entries are checked.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10.
The commit introduced alias for block devices which is intended to be
used during logging although actual usage hasn't been committed yet.
This approach adds very limited benefit (raw log might be easier to
follow) which can be trivially implemented in userland but has a lot
of problems.
It is much worse than netif renames because it doesn't rename the
actual device but just adds conveninence name which isn't used
universally or enforced. Everything internal including device lookup
and sysfs still uses the internal name and nothing prevents two
devices from using conflicting alias - ie. sda can have sdb as its
alias.
This has been nacked by people working on device driver core, block
layer and kernel-userland interface and shouldn't have been
upstreamed. Revert it.
http://thread.gmane.org/gmane.linux.kernel/1155104
http://thread.gmane.org/gmane.linux.scsi/68632
http://thread.gmane.org/gmane.linux.scsi/69776
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
|
| |
| |
| |
| |
| |
| |
| |
| | |
This file isn't using full modular functionality, and hence
can be "downgraded" to just using the export.h header.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...
Fic up trivial header file includsion conflict in drivers/block/loop.c
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).
This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions. disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.
-v2: Updated to make it clear that only partition scan is suppressed
not partitioning itself as suggested by Kay Sievers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|\ \ \
| |_|/
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...
Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
blk_cleanup_queue() may be called before elevator is set up on a
queue which triggers the following oops.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
...
Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
Bochs Bochs
RIP: 0010:[<ffffffff8125a69c>] [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
...
Call Trace:
[<ffffffff8125da92>] blk_drain_queue+0x42/0x70
[<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
[<ffffffff81469640>] md_free+0x50/0x70
[<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
[<ffffffff81270d56>] kref_put+0x36/0xa0
[<ffffffff8126f2b7>] kobject_put+0x27/0x60
[<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
[<ffffffff81083450>] process_one_work+0x100/0x3b0
[<ffffffff8108527f>] worker_thread+0x15f/0x3a0
[<ffffffff81089937>] kthread+0x87/0x90
[<ffffffff81621834>] kernel_thread_helper+0x4/0x10
Fix it by making blk_cleanup_queue() check whether q->elevator is set
up before invoking blk_drain_queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We can't use the latter if !CONFIG_LOCKDEP.
Reported-by: Sedat Dilek <sedat.dilek@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
blkcg->policy_list is protected by blkcg->lock. Its not rcu protected
list. So even for readers, they need to take blkcg->lock. There are
few functions which were reading the list without taking lock. Fix it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If a rule is being deleted, free up associated policy node. Otherwise
that memory is leaked.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In case tag depth is reduced, it is max_depth not real_max_depth.
So we should allow a request with tag >= max_depth, but for a
tag >= real_max_depth, there really should be some problem.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |\ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The following command sequence triggers an oops.
# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
Call Trace:
[<ffffffff810d2845>] lock_acquire+0x95/0x140
[<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
[<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
[<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
[<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
[<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
[<ffffffff811c40df>] blkdev_put+0x5f/0x190
[<ffffffff8118f18d>] kill_block_super+0x4d/0x80
[<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
[<ffffffff8119003a>] deactivate_super+0x4a/0x70
[<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
[<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
[<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
This is because bdev holds on to disk but disk doesn't pin the
associated queue. If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.
Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.
Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A dm-multipath user reported[1] a problem when trying to boot
a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
(block: fix flush machinery for stacking drivers with differring
flush flags) applied. It turns out that an empty flush request
can be sent into blk_insert_flush. When the BUG_ON was fixed
to allow for this, I/O on the underlying device would stall. The
reason is that blk_insert_cloned_request does not kick the queue.
In the aforementioned commit, I had added a special case to
kick the queue if data was sent down but the queue flags did
not require a flush. A better solution is to push the queue
kick up into blk_insert_cloned_request.
This patch, along with a follow-on which fixes the BUG_ON, fixes
the issue reported.
[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html
Reported-by: Christophe Saout <christophe@saout.de>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|