summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* rapidio: modify mport ID assignmentAlexandre Bounine2011-03-233-25/+27
| | | | | | | | | | | | | | | | | Changes mport ID and host destination ID assignment to implement unified method common to all mport drivers. Makes "riohdid=" kernel command line parameter common for all architectures with support for more that one host destination ID assignment. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rapidio: modify subsystem and driver initialization sequenceAlexandre Bounine2011-03-234-13/+6
| | | | | | | | | | | | | | | | Subsystem initialization sequence modified to support presence of multiple RapidIO controllers in the system. The new sequence is compatible with initialization of PCI devices. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rapidio: modify configuration to support PCI-SRIO controllerAlexandre Bounine2011-03-237-5/+34
| | | | | | | | | | | | | | | | 1. Add an option to include RapidIO support if the PCI is available. 2. Add FSL_RIO configuration option to enable controller selection. 3. Add RapidIO support option into x86 and MIPS architectures. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Acked-by: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rapidio: add architecture specific callbacksAlexandre Bounine2011-03-234-45/+75
| | | | | | | | | | | | | | | | | | | | | | This set of patches eliminates RapidIO dependency on PowerPC architecture and makes it available to other architectures (x86 and MIPS). It also enables support of new platform independent RapidIO controllers such as PCI-to-SRIO and PCI Express-to-SRIO. This patch: Extend number of mport callback functions to eliminate direct linking of architecture specific mport operations. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rapidio: add RapidIO documentationAlexandre Bounine2011-03-232-0/+263
| | | | | | | | | | | | Add RapidIO documentation files as it was discussed earlier (see thread http://marc.info/?l=linux-kernel&m=129202338918062&w=2) Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rapidio: add new sysfs attributesAlexandre Bounine2011-03-231-1/+40
| | | | | | | | | | | | | | | | | | | | | | | | | Add new sysfs attributes. 1. Routing information required to to reach the RIO device: destid - device destination ID (real for for endpoint, route for switch) hopcount - hopcount for maintenance requests (switches only) 2. device linking information: lprev - name of device that precedes the given device in the enumeration or discovery order (displayed along with of the port to which it is attached). lnext - names of devices (with corresponding port numbers) that are attached to the given device as next in the enumeration or discovery order (switches only) Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Micha Nelissen <micha@neli.hopto.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/char/mem.c: clean up the codeChangli Gao2011-03-231-4/+1
| | | | | | | | Reduce the lines of code and simplify the logic. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/staging/tty/specialix.c: convert func_enter to func_exitJulia Lawall2011-03-231-4/+4
| | | | | | | | | | | | | | | | | | | | | Convert calls to func_enter on leaving a function to func_exit. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ @@ - func_enter(); + func_exit(); return...; // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Roger Wolff <R.E.Wolff@BitWizard.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/tty/bfin_jtag_comm.c: avoid calling put_tty_driver on NULLJulia Lawall2011-03-231-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | put_tty_driver calls tty_driver_kref_put on its argument, and then tty_driver_kref_put calls kref_put on the address of a field of this argument. kref_put checks for NULL, but in this case the field is likely to have some offset and so the result of taking its address will not be NULL. Labels are added to be able to skip over the call to put_tty_driver when the argument will be NULL. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression *x; @@ *if (x == NULL) { ... * put_tty_driver(x); ... return ...; } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Torben Hohn <torbenh@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/char: add MSM smd_pkt driverNiranjana Vishwanathapura2011-03-233-0/+475
| | | | | | | | | | | | Add smd_pkt driver which provides device interface to smd packet ports. Signed-off-by: Niranjana Vishwanathapura <nvishwan@codeaurora.org> Cc: Brian Swetland <swetland@google.com> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Brown <davidb@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/char/ipmi/ipmi_si_intf.c: fix cleanup_one_si section mismatchSergey Senozhatsky2011-03-231-1/+1
| | | | | | | | | | | | | commit d2478521afc2022 ("char/ipmi: fix OOPS caused by pnp_unregister_driver on unregistered driver") introduced a section mismatch by calling __exit cleanup_ipmi_si from __devinit init_ipmi_si. Remove __exit annotation from cleanup_ipmi_si. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: protect mm start_code/end_code in /proc/pid/statKees Cook2011-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | While mm->start_stack was protected from cross-uid viewing (commit f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged processes")), the start_code and end_code values were not. This would allow the text location of a PIE binary to leak, defeating ASLR. Note that the value "1" is used instead of "0" for a protected value since "ps", "killall", and likely other readers of /proc/pid/stat, take start_code of "0" to mean a kernel thread and will misbehave. Thanks to Brad Spengler for pointing this out. Addresses CVE-2011-0726 Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: <stable@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Eugene Teo <eugeneteo@kernel.sg> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: make struct proc_dir_entry::namelen unsigned intAlexey Dobriyan2011-03-232-5/+5
| | | | | | | | | | | | | | | | | | | | | | | 1. namelen is declared "unsigned short" which hints for "maybe space savings". Indeed in 2.4 struct proc_dir_entry looked like: struct proc_dir_entry { unsigned short low_ino; unsigned short namelen; Now, low_ino is "unsigned int", all savings were gone for a long time. "struct proc_dir_entry" is not that countless to worry about it's size, anyway. 2. converting from unsigned short to int/unsigned int can only create problems, we better play it safe. Space is not really conserved, because of natural alignment for the next field. sizeof(struct proc_dir_entry) remains the same. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* procfs: fix some wrong error code usageJovi Zhang2011-03-231-2/+5
| | | | | | | | | | | | | | [root@wei 1]# cat /proc/1/mem cat: /proc/1/mem: No such process error code -ESRCH is wrong in this situation. Return -EPERM instead. Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* procfs: fix /proc/<pid>/maps heap checkAaro Koskinen2011-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code fails to print the "[heap]" marking if the heap is split into multiple mappings. Fix the check so that the marking is displayed in all possible cases: 1. vma matches exactly the heap 2. the heap vma is merged e.g. with bss 3. the heap vma is splitted e.g. due to locked pages Test cases. In all cases, the process should have mapping(s) with [heap] marking: (1) vma matches exactly the heap #include <stdio.h> #include <unistd.h> #include <sys/types.h> int main (void) { if (sbrk(4096) != (void *)-1) { printf("check /proc/%d/maps\n", (int)getpid()); while (1) sleep(1); } return 0; } # ./test1 check /proc/553/maps [1] + Stopped ./test1 # cat /proc/553/maps | head -4 00008000-00009000 r-xp 00000000 01:00 3113640 /test1 00010000-00011000 rw-p 00000000 01:00 3113640 /test1 00011000-00012000 rw-p 00000000 00:00 0 [heap] 4006f000-40070000 rw-p 00000000 00:00 0 (2) the heap vma is merged #include <stdio.h> #include <unistd.h> #include <sys/types.h> char foo[4096] = "foo"; char bar[4096]; int main (void) { if (sbrk(4096) != (void *)-1) { printf("check /proc/%d/maps\n", (int)getpid()); while (1) sleep(1); } return 0; } # ./test2 check /proc/556/maps [2] + Stopped ./test2 # cat /proc/556/maps | head -4 00008000-00009000 r-xp 00000000 01:00 3116312 /test2 00010000-00012000 rw-p 00000000 01:00 3116312 /test2 00012000-00014000 rw-p 00000000 00:00 0 [heap] 4004a000-4004b000 rw-p 00000000 00:00 0 (3) the heap vma is splitted (this fails without the patch) #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> int main (void) { if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) && (sbrk(4096) != (void *)-1)) { printf("check /proc/%d/maps\n", (int)getpid()); while (1) sleep(1); } return 0; } # ./test3 check /proc/559/maps [1] + Stopped ./test3 # cat /proc/559/maps|head -4 00008000-00009000 r-xp 00000000 01:00 3119108 /test3 00010000-00011000 rw-p 00000000 01:00 3119108 /test3 00011000-00012000 rw-p 00000000 00:00 0 [heap] 00012000-00013000 rw-p 00000000 00:00 0 [heap] It looks like the bug has been there forever, and since it only results in some information missing from a procfile, it does not fulfil the -stable "critical issue" criteria. Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: hide kernel addresses via %pK in /proc/<pid>/stackKonstantin Khlebnikov2011-03-231-1/+1
| | | | | | | | | | | This file is readable for the task owner. Hide kernel addresses from unprivileged users, leave them function names and offsets. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Kees Cook <kees.cook@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset: hold callback_mutex in cpuset_post_clone()Li Zefan2011-03-231-0/+2
| | | | | | | | | | | | | | | | Chaning cpuset->mems/cpuset->cpus should be protected under callback_mutex. cpuset_clone() doesn't follow this rule. It's ok because it's called when creating and initializing a cgroup, but we'd better hold the lock to avoid subtil break in the future. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset: fix unchecked calls to NODEMASK_ALLOC()Li Zefan2011-03-231-35/+16
| | | | | | | | | | | | | | | | Those functions that use NODEMASK_ALLOC() can't propagate errno to users, but will fail silently. Fix it by using a static nodemask_t variable for each function, and those variables are protected by cgroup_mutex; [akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_attach()Li Zefan2011-03-231-5/+2
| | | | | | | | | | | | | oldcs->mems_allowed is not modified during cpuset_attach(), so we don't have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it to cpuset_migrate_mm(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_sprintf_memlist()Li Zefan2011-03-231-16/+8
| | | | | | | | | | | | | | | | | It's not necessary to copy cpuset->mems_allowed to a buffer allocated by NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf(). As spotted by Paul, a side effect is we fix a bug that the function can return -ENOMEM but the caller doesn't expect negative return value. Therefore change the return value of cpuset_sprintf_cpulist() and cpuset_sprintf_memlist() from int to size_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: give current access to memory reserves if it's trying to dieDavid Rientjes2011-03-231-0/+11
| | | | | | | | | | | | | | | | | | | When a memcg is oom and current has already received a SIGKILL, then give it access to memory reserves with a higher scheduling priority so that it may quickly exit and free its memory. This is identical to the global oom killer and is done even before checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is selected is guaranteed to have come from userspace; the thread only needs access to memory reserves to exit and thus we don't unnecessarily panic the machine until the kernel has no last resort to free memory. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix leak on wrong LRU with FUSEKAMEZAWA Hiroyuki2011-03-231-18/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fs/fuse/dev.c::fuse_try_move_page() does (1) remove a page by ->steal() (2) re-add the page to page cache (3) link the page to LRU if it was not on LRU at (1) This implies the page is _on_ LRU when it's added to radix-tree. So, the page is added to memory cgroup while it's on LRU. because LRU is lazy and no one flushs it. This is the same behavior as SwapCache and needs special care as - remove page from LRU before overwrite pc->mem_cgroup. - add page to LRU after overwrite pc->mem_cgroup. And we need to taking care of pagevec. If PageLRU(page) is set before we add PCG_USED bit, the page will not be added to memcg's LRU (in short period). So, regardlress of PageLRU(page) value before commit_charge(), we need to check PageLRU(page) after commit_charge(). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Balbir Singh <balbir@in.ibm.com> Reported-by: Daniel Poelzleithner <poelzi@poelzi.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: page_cgroup array is never stored on reserved pagesMichal Hocko2011-03-231-5/+5
| | | | | | | | | | | | | | KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for PageReserved because we never store the array on reserved pages (neither alloc_pages_exact nor vmalloc use those pages). So we can replace the check by a BUG_ON. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page_cgroup: reduce allocation overhead for page_cgroup array for ↵Michal Hocko2011-03-231-22/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_SPARSEMEM Currently we are allocating a single page_cgroup array per memory section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected. This is correct but memory inefficient solution because the allocated memory (unless we fall back to vmalloc) is not kmalloc friendly: - 32b - 16384 entries (20B per entry) fit into 327680B so the 524288B slab cache is used - 32b with PAE - 131072 entries with 2621440B fit into 4194304B - 64b - 32768 entries (40B per entry) fit into 2097152 cache This is ~37% wasted space per memory section and it sumps up for the whole memory. On a x86_64 machine it is something like 6MB per 1GB of RAM. We can reduce the internal fragmentation by using alloc_pages_exact which allocates PAGE_SIZE aligned blocks so we will get down to <4kB wasted memory per section which is much better. We still need a fallback to vmalloc because we have no guarantees that we will have a continuous memory of that size (order-10) later on during the hotplug events. [hannes@cmpxchg.org: do not define unused free_page_cgroup() without memory hotplug] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memcontrol.c: suppress uninitialized-var warning with older gcc'sAndrew Morton2011-03-231-1/+1
| | | | | | | | | | | | | | | mm/memcontrol.c: In function 'mem_cgroup_force_empty': mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function It's a false positive. Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: use native word page statistics countersJohannes Weiner2011-03-231-29/+59
| | | | | | | | | | | | | | | | | | The statistic counters are in units of pages, there is no reason to make them 64-bit wide on 32-bit machines. Make them native words. Since they are signed, this leaves 31 bit on 32-bit machines, which can represent roughly 8TB assuming a page size of 4k. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: break out event counters from other statsJohannes Weiner2011-03-231-12/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | For increasing and decreasing per-cpu cgroup usage counters it makes sense to use signed types, as single per-cpu values might go negative during updates. But this is not the case for only-ever-increasing event counters. All the counters have been signed 64-bit so far, which was enough to count events even with the sign bit wasted. This patch: - divides s64 counters into signed usage counters and unsigned monotonically increasing event counters. - converts unsigned event counters into 'unsigned long' rather than 'u64'. This matches the type used by the /proc/vmstat event counters. The next patch narrows the signed usage counters type (on 32-bit CPUs, that is). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: unify charge/uncharge quantities to units of pagesJohannes Weiner2011-03-231-70/+65
| | | | | | | | | | | | | | | There is no clear pattern when we pass a page count and when we pass a byte count that is a multiple of PAGE_SIZE. We never charge or uncharge subpage quantities, so convert it all to page counts. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: convert uncharge batching from bytes to page granularityJohannes Weiner2011-03-232-10/+12
| | | | | | | | | | | We never uncharge subpage quantities. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: convert per-cpu stock from bytes to page granularityJohannes Weiner2011-03-231-11/+13
| | | | | | | | | | | We never keep subpage quantities in the per-cpu stock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: keep only one charge cancelling functionJohannes Weiner2011-03-231-13/+9
| | | | | | | | | | | | | We have two charge cancelling functions: one takes a page count, the other a page size. The second one just divides the parameter by PAGE_SIZE and then calls the first one. This is trivial, no need for an extra function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove memcg->reclaim_param_lockJohannes Weiner2011-03-231-17/+1
| | | | | | | | | | | | | The reclaim_param_lock is only taken around single reads and writes to integer variables and is thus superfluous. Drop it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: charged pages always have valid per-memcg zone infoJohannes Weiner2011-03-231-3/+0
| | | | | | | | | | | | | page_cgroup_zoneinfo() will never return NULL for a charged page, remove the check for it in mem_cgroup_get_reclaim_stat_from_page(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove direct page_cgroup-to-page pointerJohannes Weiner2011-03-234-55/+117
| | | | | | | | | | | | | | | | | | | In struct page_cgroup, we have a full word for flags but only a few are reserved. Use the remaining upper bits to encode, depending on configuration, the node or the section, to enable page_cgroup-to-page lookups without a direct pointer. This saves a full word for every page in a system with memory cgroups enabled. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: condense page_cgroup-to-page lookup pointsJohannes Weiner2011-03-231-15/+23
| | | | | | | | | | | | | | | | | | | | | The per-cgroup LRU lists string up 'struct page_cgroup's. To get from those structures to the page they represent, a lookup is required. Currently, the lookup is done through a direct pointer in struct page_cgroup, so a lot of functions down the callchain do this lookup by themselves instead of receiving the page pointer from their callers. The next patch removes this pointer, however, and the lookup is no longer that straight-forward. In preparation for that, this patch only leaves the non-optional lookups when coming directly from the LRU list and passes the page down the stack. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fold __mem_cgroup_move_account into callerJohannes Weiner2011-03-232-42/+29
| | | | | | | | | | | | | | | It is one logical function, no need to have it split up. Also, get rid of some checks from the inner function that ensured the sanity of the outer function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: change page_cgroup_zoneinfo signatureJohannes Weiner2011-03-232-20/+9
| | | | | | | | | | | | | | | | | | | | Instead of passing a whole struct page_cgroup to this function, let it take only what it really needs from it: the struct mem_cgroup and the page. This has the advantage that reading pc->mem_cgroup is now done at the same place where the ordering rules for this pointer are enforced and explained. It is also in preparation for removing the pc->page backpointer. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: no uncharged pages reach page_cgroup_zoneinfoJohannes Weiner2011-03-231-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch series removes the direct page pointer from struct page_cgroup, which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu enable memcg per default, openSUSE apparently too). The node id or section number is encoded in the remaining free bits of pc->flags which allows calculating the corresponding page without the extra pointer. I ran, what I think is, a worst-case microbenchmark that just cats a large sparse file to /dev/null, because it means that walking the LRU list on behalf of per-cgroup reclaim and looking up pages from page_cgroups is happening constantly and at a high rate. But it made no measurable difference. A profile reported a 0.11% share of the new lookup_cgroup_page() function in this benchmark. This patch: All callsites check PCG_USED before passing pc->mem_cgroup, so the latter is never NULL. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add memcg sanity checks at allocating and freeing pagesDaisuke Nishimura2011-03-233-2/+69
| | | | | | | | | | | | | | | | | | | Add checks at allocating or freeing a page whether the page is used (iow, charged) from the view point of memcg. This check may be useful in debugging a problem and we did similar checks before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot). This patch adds some overheads at allocating or freeing memory, so it's enabled only when CONFIG_DEBUG_VM is enabled. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove NULL check from lookup_page_cgroup() resultJohannes Weiner2011-03-231-4/+1
| | | | | | | | | | | | | The page_cgroup array is set up before even fork is initialized. I seriously doubt that this code executes before the array is alloc'd. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove impossible conditional when committingJohannes Weiner2011-03-231-4/+0
| | | | | | | | | | | | | No callsite ever passes a NULL pointer for a struct mem_cgroup * to the committing function. There is no need to check for it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove unused page flag bitfield definesJohannes Weiner2011-03-231-7/+0
| | | | | | | | | | | | | These definitions have been unused since '4b3bde4 memcg: remove the overhead associated with the root cgroup'. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: simplify the way memory limits are checkedJohannes Weiner2011-03-232-90/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | Since transparent huge pages, checking whether memory cgroups are below their limits is no longer enough, but the actual amount of chargeable space is important. To not have more than one limit-checking interface, replace memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a single memory_cgroup_margin() that returns the chargeable space and leaves the comparison to the callsite. Soft limits are now checked the other way round, by using the already existing function that returns the amount by which soft limits are exceeded: res_counter_soft_limit_excess(). Also remove all the corresponding functions on the res_counter side that are now no longer used. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: soft limit reclaim should end at limit not belowJohannes Weiner2011-03-232-3/+3
| | | | | | | | | | | | | | Soft limit reclaim continues until the usage is below the current soft limit, but the documented semantics are actually that soft limit reclaim will push usage back until the soft limits are met again. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix ugly initialization of return value is in callerKAMEZAWA Hiroyuki2011-03-234-5/+9
| | | | | | | | | | | | | | | | | | | | | | | Remove initialization of vaiable in caller of memory cgroup function. Actually, it's return value of memcg function but it's initialized in caller. Some memory cgroup uses following style to bring the result of start function to the end function for avoiding races. mem_cgroup_start_A(&(*ptr)) /* Something very complicated can happen here. */ mem_cgroup_end_A(*ptr) In some calls, *ptr should be initialized to NULL be caller. But it's ugly. This patch fixes that *ptr is initialized by _start function. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: res_counter_read_u64(): fix potential races on 32-bit machinesKAMEZAWA Hiroyuki2011-03-231-0/+14
| | | | | | | | | | | | | | res_counter_read_u64 reads u64 value without lock. It's dangerous in a 32bit environment. Add locking. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bitops: remove minix bitops from asm/bitops.hAkinobu Mita2011-03-2327-110/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | minix bit operations are only used by minix filesystem and useless by other modules. Because byte order of inode and block bitmaps is different on each architecture like below: m68k: big-endian 16bit indexed bitmaps h8300, microblaze, s390, sparc, m68knommu: big-endian 32 or 64bit indexed bitmaps m32r, mips, sh, xtensa: big-endian 32 or 64bit indexed bitmaps for big-endian mode little-endian bitmaps for little-endian mode Others: little-endian bitmaps In order to move minix bit operations from asm/bitops.h to architecture independent code in minix filesystem, this provides two config options. CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k. CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu, m32r, mips, sh, xtensa). The architectures which always use little-endian bitmaps do not select these options. Finally, we can remove minix bit operations from asm/bitops.h for all architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hirokazu Takata <takata@linux-m32r.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* m68k: remove inline asm from minix_find_first_zero_bitAkinobu Mita2011-03-231-7/+3
| | | | | | | | | | | | | As a preparation for moving minix bit operations from asm/bitops.h to architecture independent code in minix filesystem, this removes inline asm from minix_find_first_zero_bit() for m68k. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bitops: remove ext2 non-atomic bitops from asm/bitops.hAkinobu Mita2011-03-2325-78/+6
| | | | | | | | | | | | As the result of conversions, there are no users of ext2 non-atomic bit operations except for ext2 filesystem itself. Now we can put them into architecture independent code in ext2 filesystem, and remove from asm/bitops.h for all architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dm: use little-endian bitopsAkinobu Mita2011-03-231-4/+4
| | | | | | | | | | | As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud