| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
With the machinery in place to throttle and unthrottle entities, as well as
handle their participation (or lack there of) we can now enable throttling.
There are 2 points that we must check whether it's time to set throttled state:
put_prev_entity() and enqueue_entity().
- put_prev_entity() is the typical throttle path, we reach it by exceeding our
allocated run-time within update_curr()->account_cfs_rq_runtime() and going
through a reschedule.
- enqueue_entity() covers the case of a wake-up into an already throttled
group. In this case we know the group cannot be on_rq and can throttle
immediately. Checks are added at time of put_prev_entity() and
enqueue_entity()
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task(). The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.
Resolve this by unthrottling offline cpus so that threads can be migrated.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Buddies allow us to select "on-rq" entities without actually selecting them
from a cfs_rq's rb_tree. As a result we must ensure that throttled entities
are not falsely nominated as buddies. The fact that entities are dequeued
within throttle_entity is not sufficient for clearing buddy status as the
nomination may occur after throttling.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
From the perspective of load-balance and shares distribution, throttled
entities should be invisible.
However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present. In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.
Instead, track hierarchicaal throttled state at time of transition. This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.
Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.
We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Extend walk_tg_tree to accept a positional argument
static int walk_tg_tree_from(struct task_group *from,
tg_visitor down, tg_visitor up, void *data)
Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
At the start of each period we refresh the global bandwidth pool. At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once
more (as quota permits).
Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.
Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled. A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.
Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.
We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.
One catch is that the direction of spread on sched_clock is undefined,
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.
Fortunately we can differentiate these by considering whether the
global deadline has advanced. If it has not, then we assume our clock to be
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.
Since the RT pool is using a similar timer there's some small refactoring to
share this support.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.
cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.
This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Add constraints validation for CFS bandwidth hierarchies.
Validate that:
max(child bandwidth) <= parent_bandwidth
In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER. Some basic code from the RT case is re-factored
for reuse.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.
- The global bandwidth is per task_group, it represents a pool of unclaimed
bandwidth that cfs_rqs can allocate from.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
the global pool bandwidth assigned to a specific cpu.
Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
to consume over period above.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.
The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.
This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.
Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
[ This patch actually compiles. Thanks to Mike Galbraith for pointing
that out. I compiled and booted this patch with no issues. ]
Re-examining the cpupri patch, I see there's a possible race because the
update of the two priorities vec->counts are not protected by a memory
barrier.
When a RT runqueue is overloaded and wants to push an RT task to another
runqueue, it scans the RT priority vectors in a loop from lowest
priority to highest.
When we queue or dequeue an RT task that changes a runqueue's highest
priority task, we update the vectors to show that a runqueue is rated at
a different priority. To do this, we first set the new priority mask,
and increment the vec->count, and then set the old priority mask by
decrementing the vec->count.
If we are lowering the runqueue's RT priority rating, it will trigger a
RT pull, and we do not care if we miss pushing to this runqueue or not.
But if we raise the priority, but the priority is still lower than an RT
task that is looking to be pushed, we must make sure that this runqueue
is still seen by the push algorithm (the loop).
Because the loop reads from lowest to highest, and the new priority is
set before the old one is cleared, we will either see the new or old
priority set and the vector will be checked.
But! Since there's no memory barrier between the updates of the two, the
old count may be decremented first before the new count is incremented.
This means the loop may see the old count of zero and skip it, and also
the new count of zero before it was updated. A possible runqueue that
the RT task could move to could be missed.
A conditional memory barrier is placed between the vec->count updates
and is only called when both updates are done.
The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
as they are not needed by archs that already synchronize
atomic_inc/dec().
The smp_rmb() has been moved to be called at every iteration of the loop
so that the race between seeing the two updates is visible by each
iteration of the loop, as an arch is free to optimize the reading of
memory of the counters in the loop.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
sched/cpupri: Remove the vec->lock
The cpupri vec->lock has been showing up as a top contention
lately. This is because of the RT push/pull logic takes an
agressive approach for migrating RT tasks. The cpupri logic is
in place to improve the performance of the push/pull when dealing
with large number CPU machines.
The problem though is a vec->lock is required, where a vec is a
global per RT priority structure. That is, if there are lots of
RT tasks at the same priority, every time they are added or removed
from the RT queue, this global vec->lock is taken. Now that more
kernel threads are becoming RT (RCU boost and threaded interrupts)
this is becoming much more of an issue.
There are two variables that are being synced by the vec->lock.
The cpupri bitmask, and the vec->counter. The cpupri bitmask
is one bit per priority. If a RT priority vec has a process queued,
then the vec->count is > 0 and the cpupri bitmask is set for that
RT priority.
If the cpupri bitmask gets out of sync with the vec->counter, we could
end up pushing a low proirity RT task to a high priority queue.
That RT task that could have run immediately could be queued on a
run queue with a higher priority task indefinitely.
The solution is not to use the cpupri bitmask and just look at the
vec->count directly when doing a pull. The cpupri bitmask is just
a fast way to scan the RT priorities when a pull is made. Instead
of using the bitmask, and just examine all RT priorities, and
look at the vec->counts, we could eliminate the vec->lock. The
scan of RT tasks is to find a run queue that we can push an RT task
to, and we do not push to a high priority queue, thus the scan only
needs to go from 1 to RT task->prio, and not all 100 RT priorities.
The push algorithm, which does the scan of RT priorities (and
scan of the bitmask) only happens when we have an overloaded RT run
queue (more than one RT task queued). The grabbing of the vec->lock
happens every time any RT task is queued or dequeued on the run
queue for that priority. The slowing down of the scan by not using
a bitmask is negligible by the speed up of removing the vec->lock
contention, and replacing it with an atomic counter and memory barrier.
To prove this, I wrote a patch that times both the loop and the code
that grabs the vec->locks. I passed the patches to various people
(and companies) to test and show the results. I let everyone choose
their own load to test, giving different loads on the system,
for various different setups.
Here's some of the results: (snipping to a few CPUs to not make
this change log huge, but the results were consistent across
the entire system).
System 1 (24 CPUs)
Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
[...]
cpu 20: loop 3057 1.766 0.061 0.642 1963.170
vec 6782949 90.469 0.089 0.414 2811760.503
cpu 21: loop 2617 1.723 0.062 0.641 1679.074
vec 6782810 90.499 0.089 0.291 1978499.900
cpu 22: loop 2212 1.863 0.063 0.699 1547.160
vec 6767244 85.685 0.089 0.435 2949676.898
cpu 23: loop 2320 2.013 0.062 0.594 1380.265
vec 6781694 87.923 0.088 0.431 2928538.224
After patch:
cpu 20: loop 2078 1.579 0.061 0.533 1108.006
vec 6164555 5.704 0.060 0.143 885185.809
cpu 21: loop 2268 1.712 0.065 0.575 1305.248
vec 6153376 5.558 0.060 0.187 1154960.469
cpu 22: loop 1542 1.639 0.095 0.533 823.249
vec 6156510 5.720 0.060 0.190 1172727.232
cpu 23: loop 1650 1.733 0.068 0.545 900.781
vec 6170784 5.533 0.060 0.167 1034287.953
All times are in microseconds. The 'loop' is the amount of time spent
doing the loop across the priorities (before patch uses bitmask).
the 'vec' is the amount of time in the code that requires grabbing
the vec->lock. The second patch just does not have the vec lock, but
encompasses the same code.
Amazingly the loop code even went down on average. The vec code went
from .5 down to .18, that's more than half the time spent!
Note, more than one test was run, but they all had the same results.
System 2 (64 CPUs)
Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 60: loop 0 0 0 0 0
vec 5410840 277.954 0.084 0.782 4232895.727
cpu 61: loop 0 0 0 0 0
vec 4915648 188.399 0.084 0.570 2803220.301
cpu 62: loop 0 0 0 0 0
vec 5356076 276.417 0.085 0.786 4214544.548
cpu 63: loop 0 0 0 0 0
vec 4891837 170.531 0.085 0.799 3910948.833
After patch:
cpu 60: loop 0 0 0 0 0
vec 5365118 5.080 0.021 0.063 340490.267
cpu 61: loop 0 0 0 0 0
vec 4898590 1.757 0.019 0.071 347903.615
cpu 62: loop 0 0 0 0 0
vec 5737130 3.067 0.021 0.119 687108.734
cpu 63: loop 0 0 0 0 0
vec 4903228 1.822 0.021 0.071 348506.477
The test run during the measurement did not have any (very few,
from other CPUs) RT tasks pushing. But this shows that it helped
out tremendously with the contention, as the contention happens
because the vec->lock is taken only on queuing at an RT priority,
and different CPUs that queue tasks at the same priority will
have contention.
I tested on my own 4 CPU machine with the following results:
Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 0: loop 2377 1.489 0.158 0.588 1398.395
vec 4484 770.146 2.301 4.396 19711.755
cpu 1: loop 2169 1.962 0.160 0.576 1250.110
vec 4425 152.769 2.297 4.030 17834.228
cpu 2: loop 2324 1.749 0.155 0.559 1299.799
vec 4368 779.632 2.325 4.665 20379.268
cpu 3: loop 2325 1.629 0.157 0.561 1306.113
vec 4650 408.782 2.394 4.348 20222.577
After patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 0: loop 2121 1.616 0.113 0.636 1349.189
vec 4303 1.151 0.225 0.421 1811.966
cpu 1: loop 2130 1.638 0.178 0.644 1372.927
vec 4627 1.379 0.235 0.428 1983.648
cpu 2: loop 2056 1.464 0.165 0.637 1310.141
vec 4471 1.311 0.217 0.433 1937.927
cpu 3: loop 2154 1.481 0.162 0.601 1295.083
vec 4236 1.253 0.230 0.425 1803.008
This was running my migrate.c code that can be found at:
http://lwn.net/Articles/425763/
The migrate code does stress the RT tasks a bit. This shows that
the loop did increase a little after the patch, but not by much.
The vec code dropped dramatically. From 4.3us down to .42us.
That's a 10x improvement!
Tested-by: Mike Galbraith <mgalbraith@suse.de>
Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>
Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Hillf Danton proposed a patch (see link) that cleaned up the
sched_rt code that calculates the priority of the next highest priority
task to be used in finding run queues to pull from.
His patch removed the calculating of the next prio to just use the current
prio when deteriming if we should examine a run queue to pull from. The problem
with his patch was that it caused more false checks. Because we check a run
queue for pushable tasks if the current priority of that run queue is higher
in priority than the task about to run on our run queue. But after grabbing
the locks and doing the real check, we find that there may not be a task
that has a higher prio task to pull. Thus the locks were taken with nothing to
do.
I added some trace_printks() to record when and how many times the run queue
locks were taken to check for pullable tasks, compared to how many times we
pulled a task.
With the current method, it was:
3806 locks taken vs 2812 pulled tasks
With Hillf's patch:
6728 locks taken vs 2804 pulled tasks
The number of times locks were taken to pull a task went up almost double with
no more success rate.
But his patch did get me thinking. When we look at the priority of the highest
task to consider taking the locks to do a pull, a failure to pull can be one
of the following: (in order of most likely)
o RT task was pushed off already between the check and taking the lock
o Waiting RT task can not be migrated
o RT task's CPU affinity does not include the target run queue's CPU
o RT task's priority changed between the check and taking the lock
And with Hillf's patch, the thing that caused most of the failures, is
the RT task to pull was not at the right priority to pull (not greater than
the current RT task priority on the target run queue).
Most of the above cases we can't help. But the current method does not check
if the next highest prio RT task can be migrated or not, and if it can not,
we still grab the locks to do the test (we don't find out about this fact until
after we have the locks). I thought about this case, and realized that the
pushable task plist that is maintained only holds RT tasks that can migrate.
If we move the calculating of the next highest prio task from the inc/dec_rt_task()
functions into the queuing of the pushable tasks, then we only measure the
priorities of those tasks that we push, and we get this basically for free.
Not only does this patch make the code a little more efficient, it cleans it
up and makes it a little simpler.
Thanks to Hillf Danton for inspiring me on this patch.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When a new task is woken, the code to balance the RT task is currently
skipped in the select_task_rq() call. But it will be pushed if the rq
is currently overloaded with RT tasks anyway. The issue is that we
already queued the task, and if it does get pushed, it will have to
be dequeued and requeued on the new run queue. The advantage with
pushing it first is that we avoid this requeuing as we are pushing it
off before the task is ever queued.
See commit 318e0893ce3f524 ("sched: pre-route RT tasks on wakeup")
for more details.
The return of select_task_rq() when it is not a wake up has also been
changed to return task_cpu() instead of smp_processor_id(). This is more
of a sanity because the current only other user of select_task_rq()
besides wake ups, is an exec, where task_cpu() should also be the same
as smp_processor_id(). But if it is used for other purposes, lets keep
the task on the same CPU. Why would we mant to migrate it to the current
CPU?
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hillf Danton <dhillf@gmail.com>
Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There's no reason to clean the exec_start in put_prev_task_rt() as it is reset
when the task gets back to the run queue. This saves us doing a store() in the
fast path.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Do not call dequeue_pushable_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mike Galbraith <mgalbraith@gmx.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Checking for the validity of sd is removed, since it is already
checked by the for_each_domain macro.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When computing the next priority for a given run-queue, the check for
RT priority of the task determined by the pick_next_highest_task_rt()
function could be removed, since only RT tasks are returned by the
function.
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
been handled for an RT parent gives birth to a deranged mutant child with
non-RT policy, but RT prio and sched_class.
Move PI leakage protection up, always set priorities and weight, and if the
child is leaving RT class, reset rt_priority to the proper value.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
and its outlived its use by a long long while.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|/ / / /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Since commit a2d47777 ("sched: fix stale value in average load per task")
the variable rq->avg_load_per_task is no longer required. Remove it.
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.
Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only. But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges. So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.
The NPROC can still be enforced in the common code flow of daemons
spawning user processes. Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.
Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag. With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.
Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve(). If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected. If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().
The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.
Similar check was introduced in -ow patches (without the process flag).
v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Check '/tmp/perf-' symbol file ownership
perf sched: Usage leftover from trace -> script rename
perf sched: Do not delete session object prematurely
perf tools: Check $HOME/.perfconfig ownership
perf, x86: Add model 45 SandyBridge support
perf tools: Add support to install perf python extension
perf tools: do not look at ./config for configuration
perf tools: Make clean leaves some files
perf lock: Dropping unsupported ':r' modifier
perf probe: Fix coredump introduced by probe module option
jump label: Reduce the cycle count by changing the link order
perf report: Use ui__warning in some more places
perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables
perf evlist: Introduce 'disable' method
trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED
perf buildid-cache: Zero out buffer of filenames when adding/removing buildid
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In the course of testing jump labels for use with the CFS
bandwidth controller, Paul Turner, discovered that using jump
labels reduced the branch count and the instruction count, but
did not reduce the cycle count or wall time.
I noticed that having the jump_label.o included in the kernel
but not used in any way still caused this increase in cycle
count and wall time. Thus, I moved jump_label.o in the
kernel/Makefile, thus changing the link order, and presumably
moving it out of hot icache areas. This brought down the cycle
count/time as expected.
In addition to Paul's testing, I've tested the patch using a
single 'static_branch()' in the getppid() path, and basically
running tight loops of calls to getppid(). Here are my results
for the branch disabled case:
With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
Performance counter stats for 'bash -c /tmp/getppid;true' (50 runs):
3,969,510,217 instructions # 0.864 IPC ( +-0.000% )
4,592,334,954 cycles ( +- 0.046% )
751,634,470 branches ( +- 0.000% )
1.722635797 seconds time elapsed ( +- 0.046% )
Jump labels turned off (CONFIG_JUMP_LABEL not set), branch
disabled:
Performance counter stats for 'bash -c /tmp/getppid;true' (50 runs):
4,009,611,846 instructions # 0.867 IPC ( +-0.000% )
4,622,210,580 cycles ( +- 0.012% )
771,662,904 branches ( +- 0.000% )
1.734341454 seconds time elapsed ( +- 0.022% )
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: rth@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20110805204040.GG2522@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Paul Turner <pjt@google.com>
|
| |\ \ \ \
| | | |/ /
| | |/| |
| | | | |
| | | | |
| | | | | |
Merge reason: Include most of the merge window trees, to do fixes on top.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
EVENT_POWER_TRACING_DEPRECATED
What was scheduled to be 2.6.41 is now going to be 3.1 .
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1107250929370.8080@swampdragon.chaosbits.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
syslog-ng versions before 3.3.0beta1 (2011-05-12) assume that
CAP_SYS_ADMIN is sufficient to access syslog, so ever since CAP_SYSLOG
was introduced (2010-11-25) they have triggered a warning.
Commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
improved matters a little by making syslog-ng work again, just keeping
the WARN_ONCE(). But still, this is a warning that writes a stack trace
we don't care about to syslog, sets a taint flag, and alarms sysadmins
when nothing worse has happened than use of an old userspace with a
recent kernel.
Convert the WARN_ONCE to a printk_once to avoid that while continuing to
give userspace developers a hint that this is an unwanted
backward-compatibility feature and won't be around forever.
Reported-by: Ralf Hildebrandt <ralf.hildebrandt@charite.de>
Reported-by: Niels <zorglub_olsen@hotmail.com>
Reported-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Liked-by: Gergely Nagy <algernon@madhouse-project.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
slab, lockdep: Annotate the locks before using them
lockdep: Clear whole lockdep_map on initialization
slab, lockdep: Annotate slab -> rcu -> debug_object -> slab
lockdep: Fix up warning
lockdep: Fix trace_hardirqs_on_caller()
futex: Fix regression with read only mappings
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
lockdep_init_map() only initializes parts of lockdep_map and triggers
kmemcheck warning when it is copied as a whole. There isn't anything
to be gained by clearing selectively. memset() the whole structure
and remove loop for ->class_cache[] clearing.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=35532
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Casteyde <casteyde.christian@free.fr>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=35532
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110714131909.GJ3455@htj.dyndns.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
On Sun, 2011-07-24 at 21:06 -0400, Arnaud Lacombe wrote:
> /src/linux/linux/kernel/lockdep.c: In function 'mark_held_locks':
> /src/linux/linux/kernel/lockdep.c:2471:31: warning: comparison of
> distinct pointer types lacks a cast
The warning is harmless in this case, but the below makes it go away.
Reported-by: Arnaud Lacombe <lacombar@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311588599.2617.56.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit dd4e5d3ac4a ("lockdep: Fix trace_[soft,hard]irqs_[on,off]()
recursion") made a bit of a mess of the various checks and error
conditions.
In particular it moved the check for !irqs_disabled() before the
spurious enable test, resulting in some warnings.
Reported-by: Arnaud Lacombe <lacombar@gmail.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311679697.24752.28.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
| |\ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It
prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW
MAP_PRIVATE futex operations by forcing the COW to occur by
unconditionally performing a write access get_user_pages_fast() to get
the page. The commit also introduced a user-mode regression in that it
broke futex operations on read-only memory maps. For example, this
breaks workloads that have one or more reader processes doing a
FUTEX_WAIT on a futex within a read only shared file mapping, and a
writer processes that has a writable mapping issuing the FUTEX_WAKE.
This fixes the regression for valid futex operations on RO mappings by
trying a RO get_user_pages_fast() when the RW get_user_pages_fast()
fails. This change makes it necessary to also check for invalid use
cases, such as anonymous RO mappings (which can never change) and the
ZERO_PAGE which the commit referenced above was written to address.
This patch does restore the original behavior with RO MAP_PRIVATE
mappings, which have inherent user-mode usage problems and don't really
make sense. With this patch performing a FUTEX_WAIT within a RO
MAP_PRIVATE mapping will be successfully woken provided another process
updates the region of the underlying mapped file. However, the mmap()
man page states that for a MAP_PRIVATE mapping:
It is unspecified whether changes made to the file after
the mmap() call are visible in the mapped region.
So user-mode users attempting to use futex operations on RO MAP_PRIVATE
mappings are depending on unspecified behavior. Additionally a
RO MAP_PRIVATE mapping could fail to wake up in the following case.
Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.
Once again doing something like this is just silly and users who do
something like this get what they deserve.
While RO MAP_PRIVATE mappings are nonsensical, checking for a private
mapping requires walking the vmas and was deemed too costly to avoid a
userspace hang.
This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.
Reported-by: David Oliver <david@rgmadvisors.com>
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: peterz@infradead.org
Cc: eric.dumazet@gmail.com
Cc: zvonler@rgmadvisors.com
Cc: hughd@google.com
Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The core device layer sends tons of uevent notifications for each device
it finds, and if the kernel has been built with a non-empty
CONFIG_UEVENT_HELPER_PATH that will make us try to execute the usermode
helper binary for all these events very early in the boot.
Not only won't the root filesystem even be mounted at that point, we
literally won't have necessarily even initialized all the process
handling data structures at that point, which causes no end of silly
problems even when the usermode helper doesn't actually succeed in
executing.
So just use our existing infrastructure to disable the usermodehelpers
to make the kernel start out with them disabled. We enable them when
we've at least initialized stuff a bit.
Problems related to an uninitialized
init_ipc_ns.ids[IPC_SHM_IDS].rw_mutex
reported by various people.
Reported-by: Manuel Lauss <manuel.lauss@googlemail.com>
Reported-by: Richard Weinberger <richard@nod.at>
Reported-by: Marc Zyngier <maz@misterjones.org>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When send_cpu_listeners() finds the orphaned listener it marks it as
!valid and drops listeners->sem. Before it takes this sem for writing,
s->pid can be reused and add_del_listener() can wrongly try to re-use
this entry.
Change add_del_listener() to check ->valid = T.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |/ / /
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
1. Commit 26c4caea9d69 "don't allow duplicate entries in listener mode"
changed add_del_listener(REGISTER) so that "next_cpu:" can reuse the
listener allocated for the previous cpu, this doesn't look exactly
right even if minor.
Change the code to kfree() in the already-registered case, this case
is unlikely anyway so the extra kmalloc_node() shouldn't hurt but
looke more correct and clean.
2. use the plain list_for_each_entry() instead of _safe() to scan
listeners->list.
3. Remove the unneeded INIT_LIST_HEAD(&s->list), we are going to
list_add(&s->list).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kdb,kgdb: Allow arbitrary kgdb magic knock sequences
kdb: Remove all references to DOING_KGDB2
kdb,kgdb: Implement switch and pass buffer from kdb -> gdb
kdb: cleanup unused variables missed in the original kdb merge
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The first packet that gdb sends when the kernel is in kdb mode seems
to change with every release of gdb. Instead of continuing to add
many different gdb packets, change kdb to automatically look for any
thing that looks like a gdb packet.
Example 1 cold start test:
echo g > /proc/sysrq-trigger
$D#44+
Example 2 cold start test:
echo g > /proc/sysrq-trigger
$3#33
The second one should re-enter kdb's shell right away and is purely a
test.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The DOING_KGDB2 was originally a state variable for one of the two
ways to automatically transition from kdb to kgdb. Purge all these
variables and just use one single state for the transition.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When switching from kdb mode to kgdb mode packets were getting lost
depending on the size of the fifo queue of the serial chip. When gdb
initially connects if it is in kdb mode it should entirely send any
character buffer over to the gdbstub when switching connections.
Previously kdb was zero'ing out the character buffer and this could
lead to gdb failing to connect at all, or a lengthy pause could occur
on the initial connect.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The BTARGS and BTSYMARG variables do not have any function in the
mainline version of kdb.
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
|\ \ \ \
| |_|_|/
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k/math-emu: Remove unnecessary code
m68k/math-emu: Remove commented out old code
m68k: Kill warning in setup_arch() when compiling for Sun3
m68k/atari: Prefix GPIO_{IN,OUT} with CODEC_
sparc: iounmap() and *_free_coherent() - Use lookup_resource()
m68k/atari: Reserve some ST-RAM early on for device buffer use
m68k/amiga: Chip RAM - Use lookup_resource()
resources: Add lookup_resource()
sparc: _sparc_find_resource() should check for exact matches
m68k/amiga: Chip RAM - Offset resource end by CHIP_PHYSADDR
m68k/amiga: Chip RAM - Use resource_size() to fix off-by-one error
m68k/amiga: Chip RAM - Change chipavail to an atomic_t
m68k/amiga: Chip RAM - Always allocate from the start of memory
m68k/amiga: Chip RAM - Convert from printk() to pr_*()
m68k/amiga: Chip RAM - Use tabs for indentation
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a function to find an existing resource by a resource start address.
This allows to implement simple allocators (with a malloc/free-alike API)
on top of the resource system.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (430 commits)
[media] ir-mce_kbd-decoder: include module.h for its facilities
[media] ov5642: include module.h for its facilities
[media] em28xx: Fix DVB-C maxsize for em2884
[media] tda18271c2dd: Fix saw filter configuration for DVB-C @6MHz
[media] v4l: mt9v032: Fix Bayer pattern
[media] V4L: mt9m111: rewrite set_pixfmt
[media] V4L: mt9m111: fix missing return value check mt9m111_reg_clear
[media] V4L: initial driver for ov5642 CMOS sensor
[media] V4L: sh_mobile_ceu_camera: fix Oops when USERPTR mapping fails
[media] V4L: soc-camera: remove soc-camera bus and devices on it
[media] V4L: soc-camera: un-export the soc-camera bus
[media] V4L: sh_mobile_csi2: switch away from using the soc-camera bus notifier
[media] V4L: add media bus configuration subdev operations
[media] V4L: soc-camera: group struct field initialisations together
[media] V4L: soc-camera: remove now unused soc-camera specific PM hooks
[media] V4L: pxa-camera: switch to using standard PM hooks
[media] NetUP Dual DVB-T/C CI RF: force card hardware revision by module param
[media] Don't OOPS if videobuf_dvb_get_frontend return NULL
[media] NetUP Dual DVB-T/C CI RF: load firmware according card revision
[media] omap3isp: Support configurable HS/VS polarities
...
Fix up conflicts:
- arch/arm/mach-omap2/board-rx51-peripherals.c:
cleanup regulator supply definitions in mach-omap2
vs
OMAP3: RX-51: define vdds_csib regulator supply
- drivers/staging/tm6000/tm6000-alsa.c (trivial)
|
| |/ /
| | |
| | |
| | |
| | | |
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc
* 'next/dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: (21 commits)
arm/dt: tegra devicetree support
arm/versatile: Add device tree support
dt/irq: add irq_domain_generate_simple() helper
irq: add irq_domain translation infrastructure
dmaengine: imx-sdma: add device tree probe support
dmaengine: imx-sdma: sdma_get_firmware does not need to copy fw_name
dmaengine: imx-sdma: use platform_device_id to identify sdma version
mmc: sdhci-esdhc-imx: add device tree probe support
mmc: sdhci-pltfm: dt device does not pass parent to sdhci_alloc_host
mmc: sdhci-esdhc-imx: get rid of the uses of cpu_is_mx()
mmc: sdhci-esdhc-imx: do not reference platform data after probe
mmc: sdhci-esdhc-imx: extend card_detect and write_protect support for mx5
net/fec: add device tree probe support
net: ibm_newemac: convert it to use of_get_phy_mode
dt/net: add helper function of_get_phy_mode
net/fec: gasket needs to be enabled for some i.mx
serial/imx: add device tree probe support
serial/imx: get rid of the uses of cpu_is_mx1()
arm/dt: Add dtb make rule
arm/dt: Add skeleton dtsi file
...
|