summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] sched: remove lb_stopbalance counterChen, Kenneth W2006-12-101-7/+4
| | | | | | | | | | | | Remove scheduler stats lb_stopbalance counter. This counter can be calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to create gazillion counters while we can derive the value. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: decrease number of load balancesSiddha, Suresh B2006-12-101-12/+47
| | | | | | | | | | | | | | | | | | | | Currently at a particular domain, each cpu in the sched group will do a load balance at the frequency of balance_interval. More the cores and threads, more the cpus will be in each sched group at SMP and NUMA domain. And we endup spending quite a bit of time doing load balancing in those domains. Fix this by making only one cpu(first idle cpu or first cpu in the group if all the cpus are busy) in the sched group do the load balance at that particular sched domain and this load will slowly percolate down to the other cpus with in that group(when they do load balancing at lower domains). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: improve migration accuracyMike Galbraith2006-12-101-21/+20
| | | | | | | | | | | | | | | | | | | | | Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation reference timestamp at both tick and sched times to prevent said reference, formerly rq->timestamp_last_tick, from being behind task->last_ran at evaluation time, and to move said reference closer to current time on the remote processor, intent being to improve cache hot evaluation and timestamp adjustment accuracy for task migration. Fix minor sched_time double accounting error which occurs when a task passing through schedule() does not schedule off, and takes the next timer tick. [kenneth.w.chen@intel.com: cleanup] Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Don Mullis <dwm@meer.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: add option to serialize load balancingChristoph Lameter2006-12-101-0/+9
| | | | | | | | | | | | | | | | | | Large sched domains can be very expensive to scan. Add an option SD_SERIALIZE to the sched domain flags. If that flag is set then we make sure that no other such domain is being balanced. [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: call tasklet less frequentlyChristoph Lameter2006-12-101-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trigger softirq less frequently We trigger the softirq before this patch using offset of sd->interval. However, if the queue is busy then it is sufficient to schedule the softirq with sd->interval * busy_factor. So we modify the calculation of the next time to balance by taking the interval added to last_balance again. This is only the right value if the idle/busy situation continues as is. There are two potential trouble spots: - If the queue was idle and now gets busy then we call rebalance early. However, that is not a problem because we will then use the longer interval for the next period. - If the queue was busy and becomes idle then we potentially wait too long before rebalancing. However, when the task goes idle then idle_balance is called. We add another calculation of the next balance time based on sd->interval in idle_balance so that we will rebalance soon. V2->V3: - Calculate rebalance time based on current jiffies and not based on the jiffies at the last time we load balanced. We no longer rely on staggering and therefore we can affort to do this now. V3->V4: - Use functions to do jiffy comparisons. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: use softirq for load balancingChristoph Lameter2006-12-101-5/+17
| | | | | | | | | | | | | | | | | | | | Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced softirq. We calculate the earliest time for each layer of sched domains to be rescanned (this is the rescan time for idle) and use the earliest of those to schedule the softirq via a new field "next_balance" added to struct rq. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: move idle status calculation into rebalance_tick()Christoph Lameter2006-12-101-21/+16
| | | | | | | | | | | | | | | | | | | | | | Perform the idle state determination in rebalance_tick. If we separate balancing from sched_tick then we also need to determine the idle state in rebalance_tick. V2->V3 Remove useless idlle != 0 check. Checking nr_running seems to be sufficient. Thanks Suresh. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: extract load calculation from rebalance_tickChristoph Lameter2006-12-101-42/+54
| | | | | | | | | | | | | | | | | | | | | | A load calculation is always done in rebalance_tick() in addition to the real load balancing activities that only take place when certain jiffie counts have been reached. Move that processing into a separate function and call it directly from scheduler_tick(). Also extract the time slice handling from scheduler_tick and put it into a separate function. Then we can clean up scheduler_tick significantly. It will no longer have any gotos. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: disable interrupts for locking in load_balance()Christoph Lameter2006-12-101-5/+6
| | | | | | | | | | | | | | | | Interrupts must be disabled for request queue locks if we want to run load_balance() with interrupts enabled. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: remove staggering of load balancingChristoph Lameter2006-12-101-8/+2
| | | | | | | | | | | | | | | | | | | | | Timer interrupts already are staggered. We do not need an additional layer of time staggering for short load balancing actions that take a reasonably small portion of the time slice. For load balancing on large sched_domains we will add a serialization later that avoids concurrent load balance operations and thus has the same effect as load staggering. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched: avoid taking rq lock in wake_priority_sleeperChristoph Lameter2006-12-101-0/+3
| | | | | | | | | | | | | | | | Avoid taking the request queue lock in wake_priority_sleeper if there are no running processes. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] move_task_off_dead_cpu() should be called with disabled intsKirill Korotaev2006-12-101-3/+14
| | | | | | | | | | | | | move_task_off_dead_cpu() requires interrupts to be disabled, while migrate_dead() calls it with enabled interrupts. Added appropriate comments to functions and added BUG_ON(!irqs_disabled()) into double_rq_lock() and double_lock_balance() which are the origin sources of such bugs. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ched domain: move sched group allocations to percpu areaSiddha, Suresh B2006-12-101-68/+64
| | | | | | | | | | | | Move the sched group allocations to percpu area. This will minimize cross node memory references and also cleans up the sched groups allocation for allnodes sched domain. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sched.c: correct comment for this_rq_lock()Robert P. J. Day2006-12-101-1/+1
| | | | | | | Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] io-accounting: via taskstatsAndrew Morton2006-12-101-0/+9
| | | | | | | | | | | | | | Deliver IO accounting via taskstats. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] io-accounting: core statisticsAndrew Morton2006-12-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The present per-task IO accounting isn't very useful. It simply counts the number of bytes passed into read() and write(). So if a process reads 1MB from an already-cached file, it is accused of having performed 1MB of I/O, which is wrong. (David Wright had some comments on the applicability of the present logical IO accounting: For billing purposes it is useless but for workload analysis it is very useful read_bytes/read_calls average read request size write_bytes/write_calls average write request size read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing write_bytes/write_blocks ie logical/physical guess since pdflush writes can be missed I often look for logical larger than physical to see filesystem cache problems. And the bytes/cpusec can help find applications that are dominating the cache and causing slow interactive response from page cache contention. I want to find the IO intensive applications and make sure they are doing efficient IO. Thus the acctcms(sysV) or csacms command would give the high IO commands). This patchset adds new accounting which tries to be more accurate. We account for three things: reads: attempt to count the number of bytes which this process really did cause to be fetched from the storage layer. Done at the submit_bio() level, so it is accurate for block-backed filesystems. I also attempt to wire up NFS and CIFS. writes: attempt to count the number of bytes which this process caused to be sent to the storage layer. This is done at page-dirtying time. The big inaccuracy here is truncate. If a process writes 1MB to a file and then deletes the file, it will in fact perform no writeout. But it will have been accounted as having caused 1MB of write. So... cancelled_writes: account the number of bytes which this process caused to not happen, by truncating pagecache. We _could_ just subtract this from the process's `write' accounting. But that means that some processes would be reported to have done negative amounts of write IO, which is silly. So we just report the raw number and punt this decision up to userspace. Now, we _could_ account for writes at the physical I/O level. But - This would require that we track memory-dirtying tasks at the per-page level (would require a new pointer in struct page). - It would mean that IO statistics for a process are usually only available long after that process has exitted. Which means that we probably cannot communicate this info via taskstats. This patch: Wire up the kernel-private data structures and the accessor functions to manipulate them. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sysctl: remove unused "context" paramAlexey Dobriyan2006-12-101-25/+22
| | | | | | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sysctl: remove some OPsAlexey Dobriyan2006-12-101-10/+0
| | | | | | | | | | kernel.cap-bound uses only OP_SET and OP_AND Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ipc-procfs-sysctl mixupsRandy Dunlap2006-12-101-0/+11
| | | | | | | | | | | | | | | | | | | When CONFIG_PROC_FS=n and CONFIG_PROC_SYSCTL=n but CONFIG_SYSVIPC=y, we get this build error: kernel/built-in.o:(.data+0xc38): undefined reference to `proc_ipc_doulongvec_minmax' kernel/built-in.o:(.data+0xc88): undefined reference to `proc_ipc_doulongvec_minmax' kernel/built-in.o:(.data+0xcd8): undefined reference to `proc_ipc_dointvec' kernel/built-in.o:(.data+0xd28): undefined reference to `proc_ipc_dointvec' kernel/built-in.o:(.data+0xd78): undefined reference to `proc_ipc_dointvec' kernel/built-in.o:(.data+0xdc8): undefined reference to `proc_ipc_dointvec' kernel/built-in.o:(.data+0xe18): undefined reference to `proc_ipc_dointvec' make: *** [vmlinux] Error 1 Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] WorkStruct: Use direct assignment rather than cmpxchg()David Howells2006-12-091-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use direct assignment rather than cmpxchg() as the latter is unavailable and unimplementable on some platforms and is actually unnecessary. The use of cmpxchg() was to guard against two possibilities, neither of which can actually occur: (1) The pending flag may have been unset or may be cleared. However, given where it's called, the pending flag is _always_ set. I don't think it can be unset whilst we're in set_wq_data(). Once the work is enqueued to be actually run, the only way off the queue is for it to be actually run. If it's a delayed work item, then the bit can't be cleared by the timer because we haven't started the timer yet. Also, the pending bit can't be cleared by cancelling the delayed work _until_ the work item has had its timer started. (2) The workqueue pointer might change. This can only happen in two cases: (a) The work item has just been queued to actually run, and so we're protected by the appropriate workqueue spinlock. (b) A delayed work item is being queued, and so the timer hasn't been started yet, and so no one else knows about the work item or can access it (the pending bit protects us). Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so it can be assigned instead. So, replacing the set_wq_data() with a straight assignment would be okay in most cases. The problem is where we end up tangling with test_and_set_bit() emulated using spinlocks, and even then it's not a problem _provided_ test_and_set_bit() doesn't attempt to modify the word if the bit was set. If that's a problem, then a bitops-proofed assignment will be required - equivalent to atomic_set() vs other atomic_xxx() ops. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sysctl: fix sys_sysctl interface of ipc sysctlsEric W. Biederman2006-12-081-0/+60
| | | | | | | | | | | | | Currently there is a regression and the ipc sysctls don't show up in the binary sysctl namespace. This patch adds sysctl_ipc_data to read data/write from the appropriate namespace and deliver it in the expected manner. [akpm@osdl.org: warning fix] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sysctl: simplify ipc ns specific sysctlsEric W. Biederman2006-12-081-57/+49
| | | | | | | | | | | | | | | Refactor the ipc sysctl support so that it is simpler, more readable, and prepares for fixing the bug with the wrong values being returned in the sys_sysctl interface. The function proc_do_ipc_string() was misnamed as it never handled strings. It's magic of when to work with strings and when to work with longs belonged in the sysctl table. I couldn't tell if the code would work if you disabled the ipc namespace but it certainly looked like it would have problems. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sysctl: implement sysctl_uts_string()Eric W. Biederman2006-12-081-5/+32
| | | | | | | | | | | | The problem: When using sys_sysctl we don't read the proper values for the variables exported from the uts namespace, nor do we do the proper locking. This patch introduces sysctl_uts_string which properly fetches the values and does the proper locking. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sysctl: simplify sysctl_uts_stringEric W. Biederman2006-12-081-102/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The binary interface to the namespace sysctls was never implemented resulting in some really weird things if you attempted to use sys_sysctl to read your hostname for example. This patch series simples the code a little and implements the binary sysctl interface. In testing this patch series I discovered that our 32bit compatibility for the binary sysctl interface is imperfect. In particular KERN_SHMMAX and KERN_SMMALL are size_t sized quantities and are returned as 8 bytes on to 32bit binaries using a x86_64 kernel. However this has existing for a long time so it is not a new regression with the namespace work. Gads the whole sysctl thing needs work before it stops being easy to shoot yourself in the foot. Looking forward a little bit we need a better way to handle sysctls and namespaces as our current technique will not work for the network namespace. I think something based on the current overlapping sysctl trees will work but the proc side needs to be redone before we can use it. This patch: Introduce get_uts() and put_uts() (used later) and remove most of the special cases for when UTS namespace is compiled in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] session_of_pgrp: kill unnecessary do_each_task_pid(PIDTYPE_PGID)Oleg Nesterov2006-12-081-11/+8
| | | | | | | | | | | | | | All members of the process group have the same sid and it can't be == 0. NOTE: this code (and a similar one in sys_setpgid) was needed because it was possibe to have ->session == 0. It's not possible any longer since [PATCH] pidhash: don't use zero pids Commit: c7c6464117a02b0d54feb4ebeca4db70fa493678 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sys_setpgid: eliminate unnecessary do_each_task_pid(PIDTYPE_PGID)Oleg Nesterov2006-12-081-7/+4
| | | | | | | | | | All tasks in the process group have the same sid, we don't need to iterate them all to check that the caller of sys_setpgid() doesn't change its session. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add child reaper to pid_namespaceSukadev Bhattiprolu2006-12-083-11/+26
| | | | | | | | | | | | | | | | | Add a per pid_namespace child-reaper. This is needed so processes are reaped within the same pid space and do not spill over to the parent pid space. Its also needed so containers preserve existing semantic that pid == 1 would reap orphaned children. This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285 Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] use current->nsproxy->pid_nsCedric Le Goater2006-12-081-3/+3
| | | | | | | | | | Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] to nsproxyCedric Le Goater2006-12-082-7/+42
| | | | | | | | | | | | | | | | | Add the pid namespace framework to the nsproxy object. The copy of the pid namespace only increases the refcount on the global pid namespace, init_pid_ns, and unshare is not implemented. There is no configuration option to activate or deactivate this feature because this not relevant for the moment. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] rename struct pspace to struct pid_namespaceSukadev Bhattiprolu2006-12-081-24/+25
| | | | | | | | | | | | | | | Rename struct pspace to struct pid_namespace for consistency with other namespaces (uts_namespace and ipc_namespace). Also rename include/linux/pspace.h to include/linux/pid_namespace.h and variables from pspace to pid_ns. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] identifier to nsproxyCedric Le Goater2006-12-081-1/+3
| | | | | | | | | | | | | | | Add an identifier to nsproxy. The default init_ns_proxy has identifier 0 and allocated nsproxies are given -1. This identifier will be used by a new syscall sys_bind_ns. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] rename struct namespace to struct mnt_namespaceKirill Korotaev2006-12-084-20/+21
| | | | | | | | | | | | | | Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with other namespaces being developped for the containers : pid, uts, ipc, etc. 'namespace' variables and attributes are also renamed to 'mnt_ns' Signed-off-by: Kirill Korotaev <dev@sw.ru> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add process_session() helper routine: deprecate old fieldCedric Le Goater2006-12-082-2/+2
| | | | | | | | | | | | Add an anonymous union and ((deprecated)) to catch direct usage of the session field. [akpm@osdl.org: fix various missed conversions] [jdike@addtoit.com: fix UML bug] Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add process_session() helper routineCedric Le Goater2006-12-084-16/+17
| | | | | | | | | | | | | | | Replace occurences of task->signal->session by a new process_session() helper routine. It will be useful for pid namespaces to abstract the session pid number. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] struct path: convert kernelJosef Sipek2006-12-082-13/+13
| | | | | | Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kernel: change uses of f_{dentry, vfsmnt} to use f_pathJosef "Jeff" Sipek2006-12-085-16/+16
| | | | | | | | | Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/kernel/. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] lockdep: avoid lockdep warning in mdNeilBrown2006-12-081-0/+9
| | | | | | | | | | | | | | | | | | | | md_open takes ->reconfig_mutex which causes lockdep to complain. This (normally) doesn't have deadlock potential as the possible conflict is with a reconfig_mutex in a different device. I say "normally" because if a loop were created in the array->member hierarchy a deadlock could happen. However that causes bigger problems than a deadlock and should be fixed independently. So we flag the lock in md_open as a nested lock. This requires defining mutex_lock_interruptible_nested. Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sys_unshare: remove a broken CLONE_SIGHAND codeOleg Nesterov2006-12-081-12/+4
| | | | | | | | | | | | | | | sys_unshare(CLONE_SIGHAND) is broken, the code under 'if (new_sigh)' is never executed but very wrong. Just remove it to avoid a confusion, task_lock() has nothing to do with ->sighand changing. Also, change the comment in unshare_sighand(). Yes, CLONE_THREAD implies CLONE_SIGHAND, but still it looks confusing. Also, we don't need to check current->sighand != NULL. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] make set_special_pids() staticOleg Nesterov2006-12-081-1/+1
| | | | | | | | Make set_special_pids() static, the only caller is daemonize(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] do_acct_process(): don't take tty_mutexOleg Nesterov2006-12-081-5/+2
| | | | | | | | | | No need to take the global tty_mutex, signal->tty->driver can't go away while we are holding ->siglock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] tty: ->signal->tty lockingPeter Zijlstra2006-12-084-11/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the locking of signal->tty. Use ->sighand->siglock to protect ->signal->tty; this lock is already used by most other members of ->signal/->sighand. And unless we are 'current' or the tasklist_lock is held we need ->siglock to access ->signal anyway. (NOTE: sys_unshare() is broken wrt ->sighand locking rules) Note that tty_mutex is held over tty destruction, so while holding tty_mutex any tty pointer remains valid. Otherwise the lifetime of ttys are governed by their open file handles. This leaves some holes for tty access from signal->tty (or any other non file related tty access). It solves the tty SLAB scribbles we were seeing. (NOTE: the change from group_send_sig_info to __group_send_sig_info needs to be examined by someone familiar with the security framework, I think it is safe given the SEND_SIG_PRIV from other __group_send_sig_info invocations) [schwidefsky@de.ibm.com: 3270 fix] [akpm@osdl.org: various post-viro fixes] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Alan Cox <alan@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Roland McGrath <roland@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] CPEI gets warning at kernel/irq/migration.c:27/move_masked_irq()Hidetoshi Seto2006-12-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running my MCA test (hardware error injection) on 2.6.19, I got some warning like following: > BUG: warning at kernel/irq/migration.c:27/move_masked_irq() > > Call Trace: > [<a000000100013d20>] show_stack+0x40/0xa0 > sp=e00000006b2578d0 bsp=e00000006b2510b0 > [<a000000100013db0>] dump_stack+0x30/0x60 > sp=e00000006b257aa0 bsp=e00000006b251098 > [<a0000001000de430>] move_masked_irq+0xb0/0x240 > sp=e00000006b257aa0 bsp=e00000006b251070 > [<a0000001000de6a0>] move_native_irq+0xe0/0x180 > sp=e00000006b257aa0 bsp=e00000006b251040 > [<a00000010004ff50>] iosapic_end_level_irq+0x30/0xe0 > sp=e00000006b257aa0 bsp=e00000006b251020 > [<a0000001000d94d0>] __do_IRQ+0x170/0x400 > sp=e00000006b257aa0 bsp=e00000006b250fd8 > [<a0000001000116f0>] ia64_handle_irq+0x1b0/0x260 > sp=e00000006b257aa0 bsp=e00000006b250fa8 > [<a00000010000c3a0>] ia64_leave_kernel+0x0/0x280 > sp=e00000006b257aa0 bsp=e00000006b250fa8 > [<a000000100690cf0>] _spin_unlock_irqrestore+0x30/0x60 > sp=e00000006b257c70 bsp=e00000006b250f90 It comes from: [kernel/irq/migration.c] 26 if (CHECK_IRQ_PER_CPU(desc->status)) { 27 WARN_ON(1); 28 return; 29 } By putting some printk in kernel, I found that irqbalance is trying to move CPEI which is handled as PER_CPU irq. That's why. CPEI(Corrected Platform Error Interrupt) is ia64 specific irq, is allowed to pin to particular processor which selected by the platform, and even it is PER_CPU but it has set_affinity handler (=iosapic_set_affinity) as same as other IO-SAPIC-level interrupts. (I don't know why, but I guess that there would be typical situation where the handler for migration is needed, such as hotplug - the processor going to be offline/hot-removed.) To shut up this warning, there are 2 way at least: a) fix CPEI stuff b) prohibit setting affinity to PER_CPU irq I'm not sure what stuff of CPEI need to be fixed, but I think that returning error to attempting move PER_CPU irq is useful for all applications since it will never work. Following small patch takes b) style. It works, the warning disappeared and irqbalance still runs well. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Arjan van de Ven <arjan@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] move kallsyms data to .rodataJan Beulich2006-12-081-8/+8
| | | | | | | | | Kallsyms data is never written to, so it can as well benefit from CONFIG_DEBUG_RODATA. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Merge branch 'release' of ↵Linus Torvalds2006-12-071-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] replace kmalloc+memset with kzalloc [IA64] resolve name clash by renaming is_available_memory() [IA64] Need export for csum_ipv6_magic [IA64] Fix DISCONTIGMEM without VIRTUAL_MEM_MAP [PATCH] Add support for type argument in PAL_GET_PSTATE [IA64] tidy up return value of ip_fast_csum [IA64] implement csum_ipv6_magic for ia64. [IA64] More Itanium PAL spec updates [IA64] Update processor_info features [IA64] Add se bit to Processor State Parameter structure [IA64] Add dp bit to cache and bus check structs [IA64] SN: Correctly update smp_affinty mask [IA64] sparse cleanups [IA64] IA64 Kexec/kdump
| * [IA64] IA64 Kexec/kdumpZou Nan hai2006-12-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes and updates. 1. Remove fake rendz path and related code according to discuss with Khalid Aziz. 2. fc.i offset fix in relocate_kernel.S. 3. iospic shutdown code eoi and mask race fix from Fujitsu. 4. Warm boot hook in machine_kexec to SN SAL code from Jack Steiner. 5. Send slave to SAL slave loop patch from Jay Lan. 6. Kdump on non-recoverable MCA event patch from Jay Lan 7. Use CTL_UNNUMBERED in kdump_on_init sysctl. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | Add "run_scheduled_work()" workqueue functionLinus Torvalds2006-12-071-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows workqueue users to run just their own pending work, rather than wait for the whole workqueue to finish running. This solves the deadlock with networking libphy that was due to other workqueue entries possibly needing a lock that was held by the routine that wanted to flush its own work. It's not wonderful: if you absolutely need to synchronize with the work function having been executed, any user strictly speaking should have its own completion tracking logic, since when we run things explicitly by hand, the generic workqueue layer can no longer help us synchronize. Also, this is strictly only usable for work that has been scheduled without any delayed timers. You can not mix the new interface with schedule_delayed_work(). But it's better than what we had currently. Acked-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2006-12-071-1/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits) [NETFILTER]: Fix non-ANSI func. decl. [TG3]: Identify Serdes devices more clearly. [TG3]: Use msleep. [TG3]: Use netif_msg_*. [TG3]: Allow partial speed advertisement. [TG3]: Add TG3_FLG2_IS_NIC flag. [TG3]: Add 5787F device ID. [TG3]: Fix Phy loopback. [WANROUTER]: Kill kmalloc debugging code. [TCP] inet_twdr_hangman: Delete unnecessary memory barrier(). [NET]: Memory barrier cleanups [IPSEC]: Fix inetpeer leak in ipv4 xfrm dst entries. audit: disable ipsec auditing when CONFIG_AUDITSYSCALL=n audit: Add auditing to ipsec [IRDA] irlan: Fix compile warning when CONFIG_PROC_FS=n [IrDA]: Incorrect TTP header reservation [IrDA]: PXA FIR code device model conversion [GENETLINK]: Fix misplaced command flags. [NETLIK]: Add a pointer to the Generic Netlink wiki page. [IPV6] RAW: Don't release unlocked sock. ...
| * | audit: Add auditing to ipsecJoy Latten2006-12-061-1/+5
| |/ | | | | | | | | | | | | | | | | An audit message occurs when an ipsec SA or ipsec policy is created/deleted. Signed-off-by: Joy Latten <latten@austin.ibm.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6Linus Torvalds2006-12-074-45/+173
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (156 commits) [PATCH] x86-64: Export smp_call_function_single [PATCH] i386: Clean up smp_tune_scheduling() [PATCH] unwinder: move .eh_frame to RODATA [PATCH] unwinder: fully support linker generated .eh_frame_hdr section [PATCH] x86-64: don't use set_irq_regs() [PATCH] x86-64: check vector in setup_ioapic_dest to verify if need setup_IO_APIC_irq [PATCH] x86-64: Make ix86 default to HIGHMEM4G instead of NOHIGHMEM [PATCH] i386: replace kmalloc+memset with kzalloc [PATCH] x86-64: remove remaining pc98 code [PATCH] x86-64: remove unused variable [PATCH] x86-64: Fix constraints in atomic_add_return() [PATCH] x86-64: fix asm constraints in i386 atomic_add_return [PATCH] x86-64: Correct documentation for bzImage protocol v2.05 [PATCH] x86-64: replace kmalloc+memset with kzalloc in MTRR code [PATCH] x86-64: Fix numaq build error [PATCH] x86-64: include/asm-x86_64/cpufeature.h isn't a userspace header [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder [PATCH] x86-64: Clarify error message in GART code [PATCH] x86-64: Fix interrupt race in idle callback (3rd try) [PATCH] x86-64: Remove unwind stack pointer alignment forcing again ... Fixed conflict in include/linux/uaccess.h manually Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] unwinder: move .eh_frame to RODATAJan Beulich2006-12-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The .eh_frame section contents is never written to, so it can as well benefit from CONFIG_DEBUG_RODATA. Diff-ed against firstfloor tree. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
OpenPOWER on IntegriCloud