From b39c4fab259b216148e705344a892c96efe1946d Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 20 May 2005 13:59:15 -0700 Subject: [PATCH] cpusets+hotplug+preepmt broken This patch removes the entwining of cpusets and hotplug code in the "No more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu(). Since the hotplug code is holding a spinlock at this point, we cannot take the cpuset semaphore, cpuset_sem, as would seem to be required either to update the tasks cpuset, or to scan up the nested cpuset chain, looking for the nearest cpuset ancestor that still has some CPUs that are online. So we just punt and blast the tasks cpus_allowed with all bits allowed. This reverts these lines of code to what they were before the cpuset patch. And it updates the cpuset Doc file, to match. The one known alternative to this that seems to work came from Dinakar Guniguntala, and required the hotplug code to take the cpuset_sem semaphore much earlier in its processing. So far as we know, the increased locking entanglement between cpusets and hot plug of this alternative approach is not worth doing in this case. Signed-off-by: Paul Jackson Acked-by: Nathan Lynch Acked-by: Dinakar Guniguntala Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 0dc3158667a2..66b2ed784822 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4243,7 +4243,7 @@ static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *tsk) /* No more Mr. Nice Guy. */ if (dest_cpu == NR_CPUS) { - tsk->cpus_allowed = cpuset_cpus_allowed(tsk); + cpus_setall(tsk->cpus_allowed); dest_cpu = any_online_cpu(tsk->cpus_allowed); /* -- cgit v1.2.1 From 10f02d1c59e55f529140dda3a92f0099d748451c Mon Sep 17 00:00:00 2001 From: Samuel Thibault Date: Sat, 21 May 2005 17:50:15 +0200 Subject: [PATCH] spin_unlock_bh() and preempt_check_resched() In _spin_unlock_bh(lock): do { \ _raw_spin_unlock(lock); \ preempt_enable(); \ local_bh_enable(); \ __release(lock); \ } while (0) there is no reason for using preempt_enable() instead of a simple preempt_enable_no_resched() Since we know bottom halves are disabled, preempt_schedule() will always return at once (preempt_count!=0), and hence preempt_check_resched() is useless here... This fixes it by using "preempt_enable_no_resched()" instead of the "preempt_enable()", and thus avoids the useless preempt_check_resched() just before re-enabling bottom halves. Signed-off-by: Samuel Thibault Signed-off-by: Linus Torvalds --- kernel/spinlock.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/spinlock.c b/kernel/spinlock.c index e15ed17863f1..0c3f9d8bbe17 100644 --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -294,7 +294,7 @@ EXPORT_SYMBOL(_spin_unlock_irq); void __lockfunc _spin_unlock_bh(spinlock_t *lock) { _raw_spin_unlock(lock); - preempt_enable(); + preempt_enable_no_resched(); local_bh_enable(); } EXPORT_SYMBOL(_spin_unlock_bh); @@ -318,7 +318,7 @@ EXPORT_SYMBOL(_read_unlock_irq); void __lockfunc _read_unlock_bh(rwlock_t *lock) { _raw_read_unlock(lock); - preempt_enable(); + preempt_enable_no_resched(); local_bh_enable(); } EXPORT_SYMBOL(_read_unlock_bh); @@ -342,7 +342,7 @@ EXPORT_SYMBOL(_write_unlock_irq); void __lockfunc _write_unlock_bh(rwlock_t *lock) { _raw_write_unlock(lock); - preempt_enable(); + preempt_enable_no_resched(); local_bh_enable(); } EXPORT_SYMBOL(_write_unlock_bh); @@ -354,7 +354,7 @@ int __lockfunc _spin_trylock_bh(spinlock_t *lock) if (_raw_spin_trylock(lock)) return 1; - preempt_enable(); + preempt_enable_no_resched(); local_bh_enable(); return 0; } -- cgit v1.2.1 From c33880aaddbbab1ccf36f4457ed1090621f2e39a Mon Sep 17 00:00:00 2001 From: Kirill Korotaev Date: Tue, 24 May 2005 19:29:47 -0700 Subject: [PATCH] sigkill priority fix If SIGKILL does not have priority, we cannot instantly kill task before it makes some unexpected job. It can be critical, but we were unable to reproduce this easily until Heiko Carstens reported this problem on LKML. Signed-Off-By: Kirill Korotaev Signed-Off-By: Alexey Kuznetsov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 8f3debc77c5b..b3c24c732c5a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -522,7 +522,16 @@ static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, { int sig = 0; - sig = next_signal(pending, mask); + /* SIGKILL must have priority, otherwise it is quite easy + * to create an unkillable process, sending sig < SIGKILL + * to self */ + if (unlikely(sigismember(&pending->signal, SIGKILL))) { + if (!sigismember(mask, SIGKILL)) + sig = SIGKILL; + } + + if (likely(!sig)) + sig = next_signal(pending, mask); if (sig) { if (current->notifier) { if (sigismember(current->notifier_mask, sig)) { -- cgit v1.2.1 From 2efe86b809d97debaaf9fcc13b041aedf15bd3d2 Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 27 May 2005 02:02:43 -0700 Subject: [PATCH] cpuset exit NULL dereference fix There is a race in the kernel cpuset code, between the code to handle notify_on_release, and the code to remove a cpuset. The notify_on_release code can end up trying to access a cpuset that has been removed. In the most common case, this causes a NULL pointer dereference from the routine cpuset_path. However all manner of bad things are possible, in theory at least. The existing code decrements the cpuset use count, and if the count goes to zero, processes the notify_on_release request, if appropriate. However, once the count goes to zero, unless we are holding the global cpuset_sem semaphore, there is nothing to stop another task from immediately removing the cpuset entirely, and recycling its memory. The obvious fix would be to always hold the cpuset_sem semaphore while decrementing the use count and dealing with notify_on_release. However we don't want to force a global semaphore into the mainline task exit path, as that might create a scaling problem. The actual fix is almost as easy - since this is only an issue for cpusets using notify_on_release, which the top level big cpusets don't normally need to use, only take the cpuset_sem for cpusets using notify_on_release. This code has been run for hours without a hiccup, while running a cpuset create/destroy stress test that could crash the existing kernel in seconds. This patch applies to the current -linus git kernel. Signed-off-by: Paul Jackson Acked-by: Simon Derr Acked-by: Dinakar Guniguntala Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 961d74044deb..00e8f2575512 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -166,9 +166,8 @@ static struct super_block *cpuset_sb = NULL; * The hooks from fork and exit, cpuset_fork() and cpuset_exit(), don't * (usually) grab cpuset_sem. These are the two most performance * critical pieces of code here. The exception occurs on exit(), - * if the last task using a cpuset exits, and the cpuset was marked - * notify_on_release. In that case, the cpuset_sem is taken, the - * path to the released cpuset calculated, and a usermode call made + * when a task in a notify_on_release cpuset exits. Then cpuset_sem + * is taken, and if the cpuset count is zero, a usermode call made * to /sbin/cpuset_release_agent with the name of the cpuset (path * relative to the root of cpuset file system) as the argument. * @@ -1404,6 +1403,18 @@ void cpuset_fork(struct task_struct *tsk) * * Description: Detach cpuset from @tsk and release it. * + * Note that cpusets marked notify_on_release force every task + * in them to take the global cpuset_sem semaphore when exiting. + * This could impact scaling on very large systems. Be reluctant + * to use notify_on_release cpusets where very high task exit + * scaling is required on large systems. + * + * Don't even think about derefencing 'cs' after the cpuset use + * count goes to zero, except inside a critical section guarded + * by the cpuset_sem semaphore. If you don't hold cpuset_sem, + * then a zero cpuset use count is a license to any other task to + * nuke the cpuset immediately. + * **/ void cpuset_exit(struct task_struct *tsk) @@ -1415,10 +1426,13 @@ void cpuset_exit(struct task_struct *tsk) tsk->cpuset = NULL; task_unlock(tsk); - if (atomic_dec_and_test(&cs->count)) { + if (notify_on_release(cs)) { down(&cpuset_sem); - check_for_release(cs); + if (atomic_dec_and_test(&cs->count)) + check_for_release(cs); up(&cpuset_sem); + } else { + atomic_dec(&cs->count); } } -- cgit v1.2.1 From b60c1f6ffd88850079ae419aa933ab0eddbd5535 Mon Sep 17 00:00:00 2001 From: John Hawkes Date: Fri, 27 May 2005 12:53:00 -0700 Subject: [PATCH] drop note_interrupt() for per-CPU for proper scaling The "unhandled interrupts" catcher, note_interrupt(), increments a global desc->irq_count and grossly damages scaling of very large systems, e.g., >192p ia64 Altix, because of this highly contented cacheline, especially for timer interrupts. 384p is severely crippled, and 512p is unuseable. All calls to note_interrupt() can be disabled by booting with "noirqdebug", but this disables the useful interrupt checking for all interrupts. I propose eliminating note_interrupt() for all per-CPU interrupts. This was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code restructuring added a call to note_interrupt() for per-CPU interrupts. Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as the desc->irq_count++ increment isn't atomic (which, if done, would make scaling even worse). Signed-off-by: John Hawkes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/handle.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index 06b5a6323998..436c7d93c00a 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -119,8 +119,6 @@ fastcall unsigned int __do_IRQ(unsigned int irq, struct pt_regs *regs) */ desc->handler->ack(irq); action_ret = handle_IRQ_event(irq, regs, desc->action); - if (!noirqdebug) - note_interrupt(irq, desc, action_ret); desc->handler->end(irq); return 1; } -- cgit v1.2.1 From ae92ef8a442421356950a0a8dfdc35e8e783000e Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Tue, 31 May 2005 14:39:29 -0700 Subject: [PATCH] flush icache in correct context flush_icache_range() is used in two different situation - in binfmt_elf.c & co for user space mappings and module.c for kernel modules. On m68k flush_icache_range() doesn't know which data to flush, as it has separate address spaces and the pointer argument can be valid in either address space. First I considered splitting flush_icache_range(), but this patch is simpler. Setting the correct context gives flush_icache_range() enough information to flush the correct data. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 5734ab09d3f9..83b3d376708c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1758,6 +1758,7 @@ sys_init_module(void __user *umod, const char __user *uargs) { struct module *mod; + mm_segment_t old_fs = get_fs(); int ret = 0; /* Must have permission */ @@ -1775,6 +1776,9 @@ sys_init_module(void __user *umod, return PTR_ERR(mod); } + /* flush the icache in correct context */ + set_fs(KERNEL_DS); + /* Flush the instruction cache, since we've played with text */ if (mod->module_init) flush_icache_range((unsigned long)mod->module_init, @@ -1783,6 +1787,8 @@ sys_init_module(void __user *umod, flush_icache_range((unsigned long)mod->module_core, (unsigned long)mod->module_core + mod->core_size); + set_fs(old_fs); + /* Now sew it into the lists. They won't access us, since strong_try_module_get() will fail. */ stop_machine_run(__link_module, mod, NR_CPUS); -- cgit v1.2.1