summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm: extract exe_file handling from procfsJiri Slaby2011-05-266-82/+53
| | | | | | | | | | | | | | | | | Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/. This was because exe_file was needed only for /proc/<pid>/exe. Since we will need the exe_file functionality also for core dumps (so core name can contain full binary path), built this functionality always into the kernel. To achieve that move that out of proc FS to the kernel/ where in fact it should belong. By doing that we can make dup_mm_exe_file static. Also we can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kgdbts: unify/generalize gdb breakpoint adjustmentMike Frysinger2011-05-264-18/+14
| | | | | | | | | | | | | | | | | | | The Blackfin arch, like the x86 arch, needs to adjust the PC manually after a breakpoint is hit as normally this is handled by the remote gdb. However, rather than starting another arch ifdef mess, create a common GDB_ADJUSTS_BREAK_OFFSET define for any arch to opt-in via their kgdb.h. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Dongdong Deng <dongdong.deng@windriver.com> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sh: convert to asm-generic ptrace.hMike Frysinger2011-05-261-2/+4
| | | | | | | | | | | | | | Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Cc: Dongdong Deng <dongdong.deng@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86: convert to asm-generic ptrace.hMike Frysinger2011-05-261-13/+5
| | | | | | | | | | | | | | Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Cc: Dongdong Deng <dongdong.deng@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Blackfin: convert to asm-generic ptrace.hMike Frysinger2011-05-261-3/+2
| | | | | | | | | | | | | Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* asm-generic/ptrace.h: start a common low level ptrace helperMike Frysinger2011-05-261-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | This is a series of low level ptrace unification steps to make it easier for common code (like KGDB) to poke at register state. This also avoids having to duplicate higher level operations for most ports which don't have special needs for accessing things. This patch: This implements a bunch of helper funcs for poking the registers of a ptrace structure. Now common code should be able to portably update specific registers (like kgdb updating the PC). Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Cc: Dongdong Deng <dongdong.deng@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add the pagefault count into memcg statsYing Han2011-05-266-5/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two new stats in per-memcg memory.stat which tracks the number of page faults and number of major page faults. "pgfault" "pgmajfault" They are different from "pgpgin"/"pgpgout" stat which count number of pages charged/discharged to the cgroup and have no meaning of reading/ writing page to disk. It is valuable to track the two stats for both measuring application's performance as well as the efficiency of the kernel page reclaim path. Counting pagefaults per process is useful, but we also need the aggregated value since processes are monitored and controlled in cgroup basis in memcg. Functional test: check the total number of pgfault/pgmajfault of all memcgs and compare with global vmstat value: $ cat /proc/vmstat | grep fault pgfault 1070751 pgmajfault 553 $ cat /dev/cgroup/memory.stat | grep fault pgfault 1071138 pgmajfault 553 total_pgfault 1071142 total_pgmajfault 553 $ cat /dev/cgroup/A/memory.stat | grep fault pgfault 199 pgmajfault 0 total_pgfault 199 total_pgmajfault 0 Performance test: run page fault test(pft) wit 16 thread on faulting in 15G anon pages in 16G container. There is no regression noticed on the "flt/cpu/s" Sample output from pft: TAG pft:anon-sys-default: Gb Thr CLine User System Wall flt/cpu/s fault/wsec 15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260 +-------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 16682.962 17344.027 16913.524 16928.812 166.5362 + 10 16695.568 16923.896 16820.604 16824.652 84.816568 No difference proven at 95.0% confidence [akpm@linux-foundation.org: fix build] [hughd@google.com: shmem fix] Signed-off-by: Ying Han <yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add memory.numastat api for numa statisticsYing Han2011-05-261-0/+155
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new API exports numa_maps per-memcg basis. This is a piece of useful information where it exports per-memcg page distribution across real numa nodes. One of the usecases is evaluating application performance by combining this information w/ the cpu allocation to the application. The output of the memory.numastat tries to follow w/ simiar format of numa_maps like: total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... And we have per-node: total = file + anon + unevictable $ cat /dev/cgroup/memory/memory.numa_stat total=250020 N0=87620 N1=52367 N2=45298 N3=64735 file=225232 N0=83402 N1=46160 N2=40522 N3=55148 anon=21053 N0=3424 N1=6207 N2=4776 N3=6646 unevictable=3735 N0=794 N1=0 N2=0 N3=2941 Signed-off-by: Ying Han <yinghan@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages()Ying Han2011-05-263-9/+9
| | | | | | | | | | | | | | | The caller of the function has been renamed to zone_nr_lru_pages(), and this is just fixing up in the memcg code. The current name is easily to be mis-read as zone's total number of pages. Signed-off-by: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove unused retry signal from reclaimJohannes Weiner2011-05-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | If the memcg reclaim code detects the target memcg below its limit it exits and returns a guaranteed non-zero value so that the charge is retried. Nowadays, the charge side checks the memcg limit itself and does not rely on this non-zero return value trick. This patch removes it. The reclaim code will now always return the true number of pages it reclaimed on its own. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel<riel@redhat.com> Acked-by: Ying Han<yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix get_scan_count() for small targetsKAMEZAWA Hiroyuki2011-05-263-35/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During memory reclaim we determine the number of pages to be scanned per zone as (anon + file) >> priority. Assume scan = (anon + file) >> priority. If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and priority gets higher. This has some problems. 1. This increases priority as 1 without any scan. To do scan in this priority, amount of pages should be larger than 512M. If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be batched, later. (But we lose 1 priority.) If memory size is below 16M, pages >> priority is 0 and no scan in DEF_PRIORITY forever. 2. If zone->all_unreclaimabe==true, it's scanned only when priority==0. So, x86's ZONE_DMA will never be recoverred until the user of pages frees memory by itself. 3. With memcg, the limit of memory can be small. When using small memcg, it gets priority < DEF_PRIORITY-2 very easily and need to call wait_iff_congested(). For doing scan before priorty=9, 64MB of memory should be used. Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when 1. the target is enough small. 2. it's kswapd or memcg reclaim. Then we can avoid rapid priority drop and may be able to recover all_unreclaimable in a small zones. And this patch removes nr_saved_scan. This will allow scanning in this priority even when pages >> priority is very small. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Ying Han <yinghan@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: reclaim memory from nodes in round-robin orderYing Han2011-05-263-7/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently, memory cgroup's direct reclaim frees memory from the current node. But this has some troubles. Usually when a set of threads works in a cooperative way, they tend to operate on the same node. So if they hit limits under memcg they will reclaim memory from themselves, damaging the active working set. For example, assume 2 node system which has Node 0 and Node 1 and a memcg which has 1G limit. After some work, file cache remains and the usages are Node 0: 1M Node 1: 998M. and run an application on Node 0, it will eat its foot before freeing unnecessary file caches. This patch adds round-robin for NUMA and adds equal pressure to each node. When using cpuset's spread memory feature, this will work very well. But yes, a better algorithm is needed. [akpm@linux-foundation.org: comment editing] [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons] Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* MAINTAINERS: add mm/page_cgroup.c into memcg subsystemNamhyung Kim2011-05-261-0/+1
| | | | | | | | | | | | | AFAICS mm/page_cgroup.c is for memcg subsystem, but it was directed only to generic cgroup maintainers. Fix it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: move page-freeing code out of lockNamhyung Kim2011-05-261-9/+13
| | | | | | | | | | | | | | | Move page-freeing code out of swap_cgroup_mutex in the hope that it could reduce few of theoretical contentions between swapons and/or swapoffs. This is just a cleanup, no functional changes. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix off-by-one when calculating swap cgroup map lengthNamhyung Kim2011-05-261-1/+1
| | | | | | | | | | | | | It allocated one more page than necessary if @max_pages was a multiple of SC_PER_PAGE. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: mark init_section_page_cgroup() properlyNamhyung Kim2011-05-261-2/+2
| | | | | | | | | | | | | | | | | Commit ca371c0d7e23 ("memcg: fix page_cgroup fatal error in FLATMEM") removes call to alloc_bootmem() in the function so that it can be marked as __meminit to reduce memory usage when MEMORY_HOTPLUG=n. Also as the new helper function alloc_page_cgroup() is called only in the function, it should be marked too. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove pointless next_mz nullification in mem_cgroup_soft_limit_reclaim()Michal Hocko2011-05-261-3/+2
| | | | | | | | | | | | | | | | next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node selects the same mz. This doesn't make much sense as we assign to the variable right in the next loop. Compiler will probably optimize this out but it is little bit confusing for the code reading. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add the soft_limit reclaim in global direct reclaim.Ying Han2011-05-261-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | We recently added the change in global background reclaim which counts the return value of soft_limit reclaim. Now this patch adds the similar logic on global direct reclaim. We should skip scanning global LRU on shrink_zone if soft_limit reclaim does enough work. This is the first step where we start with counting the nr_scanned and nr_reclaimed from soft_limit reclaim into global scan_control. Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: count the soft_limit reclaim in global background reclaimYing Han2011-05-264-15/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The global kswapd scans per-zone LRU and reclaims pages regardless of the cgroup. It breaks memory isolation since one cgroup can end up reclaiming pages from another cgroup. Instead we should rely on memcg-aware target reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under memory pressure. In the global background reclaim, we do soft reclaim before scanning the per-zone LRU. However, the return value is ignored. This patch is the first step to skip shrink_zone() if soft_limit reclaim does enough work. This is part of the effort which tries to reduce reclaiming pages in global LRU in memcg. The per-memcg background reclaim patchset further enhances the per-cgroup targetting reclaim, which I should have V4 posted shortly. Try running multiple memory intensive workloads within seperate memcgs. Watch the counters of soft_steal in memory.stat. $ cat /dev/cgroup/A/memory.stat | grep 'soft' soft_steal 240000 soft_scan 240000 total_soft_steal 240000 total_soft_scan 240000 This patch: In the global background reclaim, we do soft reclaim before scanning the per-zone LRU. However, the return value is ignored. We would like to skip shrink_zone() if soft_limit reclaim does enough work. Also, we need to make the memory pressure balanced across per-memcg zones, like the logic vm-core. This patch is the first step where we start with counting the nr_scanned and nr_reclaimed from soft_limit reclaim into the global scan_control. Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move enum vm_event_item into a standalone header fileAndrew Morton2011-05-262-61/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | enums are problematic because they cannot be forward-declared: akpm2:/home/akpm> cat t.c enum foo; static inline void bar(enum foo f) { } akpm2:/home/akpm> gcc -c t.c t.c:4: error: parameter 1 ('f') has incomplete type So move the enum's definition into a standalone header file which can be used wherever its definition is needed. Cc: Ying Han <yinghan@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroup: remove the ns_cgroupDaniel Lezcano2011-05-2622-287/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and leads to some problems: * cgroup creation is out-of-control * cgroup name can conflict when pids are looping * it is not possible to have a single process handling a lot of namespaces without falling in a exponential creation time * we may want to create a namespace without creating a cgroup The ns_cgroup was replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file. This patch removes the ns_cgroup as suggested in the following thread: https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html The 'cgroup_clone' function is removed because it is no longer used. This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a printk warning users that the feature is planned for removal. Since that time we have heard from XXX users who were affected by this. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroups: use flex_array in attach_procBen Blum2011-05-261-9/+24
| | | | | | | | | | | | | | | | | | | | | | | Convert cgroup_attach_proc to use flex_array. The cgroup_attach_proc implementation requires a pre-allocated array to store task pointers to atomically move a thread-group, but asking for a monolithic array with kmalloc() may be unreliable for very large groups. Using flex_array provides the same functionality with less risk of failure. This is a post-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroups: make procs file writableBen Blum2011-05-262-47/+401
| | | | | | | | | | | | | | | | | | | | | Make procs file writable to move all threads by tgid at once. Add functionality that enables users to move all threads in a threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This current implementation makes use of a per-threadgroup rwsem that's taken for reading in the fork() path to prevent newly forking threads within the threadgroup from "escaping" while the move is in progress. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroups: add per-thread subsystem callbacksBen Blum2011-05-269-142/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroups: read-write lock CLONE_THREAD forking per threadgroupBen Blum2011-05-263-0/+55
| | | | | | | | | | | | | | | | | | | | | | | Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Add an rwsem that lives in a threadgroup's signal_struct that's taken for reading in the fork path, under CONFIG_CGROUPS. If another part of the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and the other system would both depend on. This is a pre-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Documentation: configfs examples crash fixJiri Slaby2011-05-262-8/+4
| | | | | | | | | | | | When configfs_register_subsystem() fails, we unregister too many subsystems in configfs_example_init. Decrement i by one to not unregister non-registered subsystem. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* getdelays: show average CPU/IO/SWAP/RECLAIM delaysWu Fengguang2011-05-261-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | I find it very handy to show the average delays in milliseconds. Example output (on 100 concurrent dd reading sparse files): CPU count real total virtual total delay total delay average 986 3223509952 3207643301 38863410579 39.415ms IO count delay total delay average 0 0 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 1059 5131834899 4ms dd: read=0, write=0, cancelled_write=0 Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Mel Gorman <mel@linux.vnet.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Satoru Moriya <satoru.moriya@hds.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Documentation/accounting/getdelays.c: handle sendto() failuresAndrew Morton2011-05-261-0/+2
| | | | | | | | | | | | Fixes Documentation/accounting/getdelays.c: In function `get_family_id': Documentation/accounting/getdelays.c:172:14: warning: variable `rc' set but not used [-Wunused-but-set-variable] Reported-by: "Justin P. Mattock" <justinmattock@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Documentation/accounting/getdelays.c: fix unused var warningJustin P. Mattock2011-05-261-3/+0
| | | | | | | | | | | | Fixes Documentation/accounting/getdelays.c: In function `main': Documentation/accounting/getdelays.c:436:7: warning: variable `i' set but not used [-Wunused-but-set-variable] Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Documentation/atomic_ops.txt: avoid volatile in sample codeNikanth Karthikesan2011-05-261-1/+1
| | | | | | | | | As declaring counter as volatile is discouraged, it is best not to use it in sample code as well. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ufs: fix truncated values handling 64 bit metadataDan Carpenter2011-05-262-5/+6
| | | | | | | | | | | | | Originally i_lastfrag was 32 bits but then we added support for handling 64 bit metadata and it became a 64 bit variable. That was during 2007, in 54fb996ac15c "[PATCH] ufs2 write: block allocation update". Unfortunately these casts got left behind so the value got truncated to 32 bit again. [akpm@linux-foundation.org: remove now-unneeded min_t/max_t casting] Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/rtc/rtc-mxc.c: remove defines already included in rtc.hWolfram Sang2011-05-262-10/+4
| | | | | | | | | [akpm@linux-foundation.org: retain the code comments] Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Vladimir Zapolskiy <vzapolskiy@gmail.com> Cc: Alessandro Zummo <alessandro.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/rtc/rtc-pcf50633.c: don't request update IRQLars-Peter Clausen2011-05-261-20/+3
| | | | | | | | | | | | | | Commit 51ba60c5 ("RTC: Cleanup rtc_class_ops->update_irq_enable()") removed the only user of the update IRQ, so there is no need to manage it any more. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Marcelo Roberto Jimenez <mroberto@cpti.cetuc.puc-rio.br> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: add support for spear rtcRajeev Kumar2011-05-263-0/+543
| | | | | | | | Signed-off-by: Rajeev Kumar <rajeev-dlh.kumar@st.com> Signed-off-by: Viresh Kumar <viresh.kumar@st.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/rtc/rtc-mrst.c: use release_mem_region after request_mem_regionJulia Lawall2011-05-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | The memory allocated using request_mem_region should be released using release_mem_region, not release_region. The semantic patch that fixes part of this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1,E2,E3; @@ request_mem_region(E1,E2,E3) ... ?- release_region(E1,E2) + release_mem_region(E1,E2) // </smpl> [akpm@linux-foundation.org: use resource_size()] Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: add basic support for ST M41T93 SPI RTCVoss, Nikolaus2011-05-263-0/+235
| | | | | | | | | | | Add basic support for ST m41t93 SPI RTCs. Tested with factory-new and with "run-in" species with and without backup batteries. Signed-off-by: Nikolaus Voss <n.voss@weinmann.de> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: add rv3029c2 RTC supportHeiko Schocher2011-05-263-0/+464
| | | | | | | | | | | | Add support for the Micro Crystal RV3029-C2 RTC chips. Signed-off-by: Heiko Schocher <hs@denx.de> Signed-off-by: Gregory Hermant <gregory.hermant@calao-systems.com> Cc: Wan ZongShun <mcuos.com@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: add EM3027 rtc driverMike Rapoport2011-05-263-0/+171
| | | | | | | | | Add support for EM Microelectronic EM3027 RTC chip. Signed-off-by: Mike Rapoport <mike@compulab.co.il> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: add support for the RTC in VIA VT8500 and compatiblesAlexey Charkov2011-05-263-0/+374
| | | | | | | | | | | | | This adds a driver for the RTC devices in VIA and WonderMedia Systems-on-Chip. Alarm, 1Hz interrupts, reading and setting time are supported. Signed-off-by: Alexey Charkov <alchark@gmail.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Alexey Charkov <alchark@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* flex_array: avoid divisions when accessing elementsJesse Gross2011-05-262-22/+31
| | | | | | | | | | | | | | | | | | On most architectures division is an expensive operation and accessing an element currently requires four of them. This performance penalty effectively precludes flex arrays from being used on any kind of fast path. However, two of these divisions can be handled at creation time and the others can be replaced by a reciprocal divide, completely avoiding real divisions on access. [eparis@redhat.com: rebase on top of changes to support 0 len elements] [eparis@redhat.com: initialize part_nr when array fits entirely in base] Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Eric Paris <eparis@redhat.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* m32r: remove redundant declarationKOSAKI Motohiro2011-05-261-11/+2
| | | | | | | | | They have no meaning. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* m32r: fix spin_lock_irqsave() misuseKOSAKI Motohiro2011-05-261-1/+2
| | | | | | | | | spin_lock_irqsave() requires unsigned long. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* m32r: convert cpumask apiKOSAKI Motohiro2011-05-263-52/+51
| | | | | | | | | | | | | We plan to remove cpus_xx() old cpumask APIs later. Also, we plan to change mm_cpu_mask() implementation, allocate only nr_cpu_ids, thus *mm_cpu_mask() is dangerous operation. Then, this patch convert them. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/bcma/host_pci.c needs slab.hAndrew Morton2011-05-261-0/+1
| | | | | | | | | | | | | alpha allmodconfig: drivers/bcma/host_pci.c: In function 'bcma_host_pci_probe': drivers/bcma/host_pci.c:102: error: implicit declaration of function 'kzalloc' drivers/bcma/host_pci.c:102: warning: assignment makes pointer from integer without a cast Cc: <zajec5@gmail.com> Cc: John W. Linville <linville@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/video/mb862xx/mb862xxfbdrv.c needs uaccess.hAndrew Morton2011-05-261-0/+1
| | | | | | | | | | | | | alpha allmodconfig: drivers/video/mb862xx/mb862xxfbdrv.c: In function 'mb862xxfb_ioctl': drivers/video/mb862xx/mb862xxfbdrv.c:323: error: implicit declaration of function 'copy_to_user' drivers/video/mb862xx/mb862xxfbdrv.c:327: error: implicit declaration of function 'copy_from_user' Cc: Paul Mundt <lethal@linux-sh.org> Cc: Anatolij Gustschin <agust@denx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Set cred->user_ns in key_replace_session_keyringSerge E. Hallyn2011-05-261-0/+1
| | | | | | | | | | | | | | Since this cred was not created with copy_creds(), it needs to get initialized. Otherwise use of syscall(__NR_keyctl, KEYCTL_SESSION_TO_PARENT); can lead to a NULL deref. Thanks to Robert for finding this. But introduced by commit 47a150edc2a ("Cache user_ns in struct cred"). Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Reported-by: Robert Święcki <robert@swiecki.net> Cc: David Howells <dhowells@redhat.com> Cc: stable@kernel.org (2.6.39) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'trivial' of ↵Linus Torvalds2011-05-2640-54/+44
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: gfs2: Drop __TIME__ usage isdn/diva: Drop __TIME__ usage atm: Drop __TIME__ usage dlm: Drop __TIME__ usage wan/pc300: Drop __TIME__ usage parport: Drop __TIME__ usage hdlcdrv: Drop __TIME__ usage baycom: Drop __TIME__ usage pmcraid: Drop __DATE__ usage edac: Drop __DATE__ usage rio: Drop __DATE__ usage scsi/wd33c93: Drop __TIME__ usage scsi/in2000: Drop __TIME__ usage aacraid: Drop __TIME__ usage media/cx231xx: Drop __TIME__ usage media/radio-maxiradio: Drop __TIME__ usage nozomi: Drop __TIME__ usage cyclades: Drop __TIME__ usage
| * gfs2: Drop __TIME__ usageMichal Marek2011-05-261-1/+1
| | | | | | | | | | | | | | | | | | | | The kernel already prints its build timestamp during boot, no need to repeat it in random drivers and produce different object files each time. Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Michal Marek <mmarek@suse.cz>
| * isdn/diva: Drop __TIME__ usageMichal Marek2011-05-261-3/+2
| | | | | | | | | | | | | | | | | | | | The kernel already prints its build timestamp during boot, no need to repeat it in random drivers and produce different object files each time. Cc: Armin Schindler <mac@melware.de> Cc: netdev@vger.kernel.org Signed-off-by: Michal Marek <mmarek@suse.cz>
| * atm: Drop __TIME__ usageMichal Marek2011-05-262-2/+2
| | | | | | | | | | | | | | | | | | | | The kernel already prints its build timestamp during boot, no need to repeat it in random drivers and produce different object files each time. Acked-by: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Michal Marek <mmarek@suse.cz>
OpenPOWER on IntegriCloud