summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* oom: replace PF_OOM_ORIGIN with toggling oom_score_adjDavid Rientjes2011-05-253-13/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/compaction: reverse the change that forbade sync migraton with ↵Andrea Arcangeli2011-05-251-1/+1
| | | | | | | | | | | | | | | | | | | __GFP_NO_KSWAPD It's uncertain this has been beneficial, so it's safer to undo it. All other compaction users would still go in synchronous mode if a first attempt at async compaction failed. Hopefully we don't need to force special behavior for THP (which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise and to be noticeable). This also make __GFP_NO_KSWAPD return to its original strict semantics specific to bypass kswapd, as THP allocations have khugepaged for the async THP allocations/compactions. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Alex Villacis Lasso <avillaci@fiec.espol.edu.ec> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occurKOSAKI Motohiro2011-05-252-2/+3
| | | | | | | | | | | | | Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [akpm@linux-foundation.org: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occursKOSAKI Motohiro2011-05-252-6/+7
| | | | | | | | | | | | | | | | Currently, memory hotplug calls setup_per_zone_wmarks() and calculate_zone_inactive_ratio(), but doesn't call setup_per_zone_lowmem_reserve(). It means the number of reserved pages aren't updated even if memory hot plug occur. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() ↵KOSAKI Motohiro2011-05-252-4/+4
| | | | | | | | | | | | | | | | | | | | should be __meminit. Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens") introduced invalid section references. Now, setup_per_zone_inactive_ratio() is marked __init and then it can't be referenced from memory hotplug code. This patch marks it as __meminit and also marks caller as __ref. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86,mm: make pagefault killableKOSAKI Motohiro2011-05-251-7/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an oom killing occurs, almost all processes are getting stuck at the following two points. 1) __alloc_pages_nodemask 2) __lock_page_or_retry 1) is not very problematic because TIF_MEMDIE leads to an allocation failure and getting out from page allocator. 2) is more problematic. In an OOM situation, zones typically don't have page cache at all and memory starvation might lead to greatly reduced IO performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly, meaning that a fork bomb may create new process quickly rather than the oom-killer killing it. Then, the system may become livelocked. This patch makes the pagefault interruptible by SIGKILL. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce wait_on_page_locked_killable()KOSAKI Motohiro2011-05-251-0/+11
| | | | | | | | | | | | | | | | commit 2687a356 ("Add lock_page_killable") introduced killable lock_page(). Similarly this patch introdues killable wait_on_page_locked(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: per-node vmstat: show proper vmstatsKOSAKI Motohiro2011-05-251-129/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2ac390370a ("writeback: add /sys/devices/system/node/<node>/vmstat") added vmstat entry. But strangely it only show nr_written and nr_dirtied. # cat /sys/devices/system/node/node20/vmstat nr_written 0 nr_dirtied 0 Of course, It's not adequate. With this patch, the vmstat show all vm stastics as /proc/vmstat. # cat /sys/devices/system/node/node0/vmstat nr_free_pages 899224 nr_inactive_anon 201 nr_active_anon 17380 nr_inactive_file 31572 nr_active_file 28277 nr_unevictable 0 nr_mlock 0 nr_anon_pages 17321 nr_mapped 8640 nr_file_pages 60107 nr_dirty 33 nr_writeback 0 nr_slab_reclaimable 6850 nr_slab_unreclaimable 7604 nr_page_table_pages 3105 nr_kernel_stack 175 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 260 nr_dirtied 1050 nr_written 938 numa_hit 962872 numa_miss 0 numa_foreign 0 numa_interleave 8617 numa_local 962872 numa_other 0 nr_anon_transparent_hugepages 0 [akpm@linux-foundation.org: no externs in .c files] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Rubin <mrubin@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nommu: fix a compile warning in do_mmap_pgoff()Namhyung Kim2011-05-251-3/+3
| | | | | | | | | | | | | | | | | | Because 'ret' is declared as int, not unsigned long, no need to cast the error contants into unsigned long. If you compile this code on a 64-bit machine somehow, you'll see following warning: CC mm/nommu.o mm/nommu.c: In function `do_mmap_pgoff': mm/nommu.c:1411: warning: overflow in implicit constant conversion Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nommu: fix a potential memory leak in do_mmap_private()Namhyung Kim2011-05-251-1/+1
| | | | | | | | | | | | | If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a memory leak between @region->vm_end and @region->vm_top. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nommu: check the vma list when unmapping file-mapped vmaNamhyung Kim2011-05-251-4/+2
| | | | | | | | | | | | | Now we have the sorted vma list, use it in do_munmap() to check that we have an exact match. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nommu: find vma using the sorted vma listNamhyung Kim2011-05-251-8/+4
| | | | | | | | | | | | | Now we have the sorted vma list, use it in the find_vma[_exact]() rather than doing linear search on the rb-tree. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nommu: don't scan the vma list when deletingNamhyung Kim2011-05-251-7/+8
| | | | | | | | | | | | | | | | Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made it a doubly linked list, we don't need to scan the list when deleting @vma. And the original code didn't update the prev pointer. Fix it too. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nommu: sort mm->mmap list properlyNamhyung Kim2011-05-254-45/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I was reading nommu code, I found that it handles the vma list/tree in an unusual way. IIUC, because there can be more than one identical/overrapped vmas in the list/tree, it sorts the tree more strictly and does a linear search on the tree. But it doesn't applied to the list (i.e. the list could be constructed in a different order than the tree so that we can't use the list when finding the first vma in that order). Since inserting/sorting a vma in the tree and link is done at the same time, we can easily construct both of them in the same order. And linear searching on the tree could be more costly than doing it on the list, it can be converted to use the list. Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly linked") made the list be doubly linked, there were a couple of code need to be fixed to construct the list properly. Patch 1/6 is a preparation. It maintains the list sorted same as the tree and construct doubly-linked list properly. Patch 2/6 is a simple optimization for the vma deletion. Patch 3/6 and 4/6 convert tree traversal to list traversal and the rest are simple fixes and cleanups. This patch: @vma added into @mm should be sorted by start addr, end addr and VMA struct addr in that order because we may get identical VMAs in the @mm. However this was true only for the rbtree, not for the list. This patch fixes this by remembering 'rb_prev' during the tree traversal like find_vma_prepare() does and linking the @vma via __vma_link_list(). After this patch, we can iterate the whole VMAs in correct order simply by using @mm->mmap list. [akpm@linux-foundation.org: avoid duplicating __vma_link_list()] Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove unused zone_idx variable from set_migratetype_isolateSergey Senozhatsky2011-05-251-2/+0
| | | | | | | | Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mmap: avoid merging cloned VMAsShaohua Li2011-05-251-5/+13
| | | | | | | | | | | | | | | | | Avoid merging a VMA with another VMA which is cloned from the parent process. The cloned VMA shares the anon_vma lock with the parent process's VMA. If we do the merge, more vmas (even the new range is only for current process) use the perent process's anon_vma lock. This introduces scalability issues. find_mergeable_anon_vma() already considers this case. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mmap: avoid unnecessary anon_vma lockShaohua Li2011-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | If we only change vma->vm_end, we can avoid taking anon_vma lock even if 'insert' isn't NULL, which is the case of split_vma. As I understand it, we need the lock before because rmap must get the 'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked to anon_vma list in __insert_vm_struct before). But now this isn't true any more. The 'insert' VMA is already linked to anon_vma list in __split_vma(with anon_vma_clone()) instead of __insert_vm_struct. There is no race rmap can't get required VMAs. So the anon_vma lock is unnecessary, and this can reduce one locking in brk case and improve scalability. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mmap: add alignment for some variablesShaohua Li2011-05-251-3/+7
| | | | | | | | | | | | | Make some variables have correct alignment/section to avoid cache issue. In a workload which heavily does mmap/munmap, the variables will be used frequently. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* arch, mm: filter disallowed nodes from arch specific show_mem functionsDavid Rientjes2011-05-252-17/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Architectures that implement their own show_mem() function did not pass the filter argument to show_free_areas() to appropriately avoid emitting the state of nodes that are disallowed in the current context. This patch now passes the filter argument to show_free_areas() so those nodes are now avoided. This patch also removes the show_free_areas() wrapper around __show_free_areas() and converts existing callers to pass an empty filter. ia64 emits additional information for each node, so skip_free_areas_zone() must be made global to filter disallowed nodes and it is converted to use a nid argument rather than a zone for this use case. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: correctly check if reclaimer should schedule during shrink_slabMinchan Kim2011-05-251-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() remains in case one shrinker is particularly slow. [mgorman@suse.de: preserve call to cond_resched after each call into shrinker] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: correct use of pgdat_balanced in sleeping_prematurelyJohannes Weiner2011-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The root cause was bugs introduced into reclaim which are addressed by the following two patches. Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") to allow kswapd to go to sleep when balanced for high orders. Patch 2 notes that it is possible for kswapd to miss every cond_resched() and updates shrink_slab() so it'll at least reach that scheduling point. Chris Wood reports that these two patches in isolation are sufficient to prevent the system hanging. AFAIK, they should also resolve similar hangs experienced by James Bottomley. This patch: Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: Fix double bit unlock in debug modeChristoph Lameter2011-05-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | Commit 442b06bcea23 ("slub: Remove node check in slab_free") added a call to deactivate_slab() in the debug case in __slab_alloc(), which unlocks the current slab used for allocation. Going to the label 'unlock_out' then does it again. Also, in the debug case we do not need all the other processing that the 'unlock_out' path does. We always fall back to the slow path in the debug case. So the tid update is useless. Similarly, ALLOC_SLOWPATH would just be incremented for all allocations. Also a pretty useless thing. So simply restore irq flags and return the object. Signed-off-by: Christoph Lameter <cl@linux.com> Reported-and-bisected-by: James Morris <jmorris@namei.org> Reported-by: Ingo Molnar <mingo@elte.hu> Reported-by: Jens Axboe <jaxboe@fusionio.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6Linus Torvalds2011-05-241-7/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (29 commits) [S390] cpu hotplug: fix external interrupt subclass mask handling [S390] oprofile: dont access lowcore [S390] oprofile: add missing irq stats counter [S390] Ignore sendmmsg system call note wired up warning [S390] s390,oprofile: fix compile error for !CONFIG_SMP [S390] s390,oprofile: fix alert counter increment [S390] Remove unused includes in process.c [S390] get CPC image name [S390] sclp: event buffer dissection [S390] chsc: process channel-path-availability information [S390] refactor page table functions for better pgste support [S390] merge page_test_dirty and page_clear_dirty [S390] qdio: prevent compile warning [S390] sclp: remove unnecessary sendmask check [S390] convert old cpumask API into new one [S390] pfault: cleanup code [S390] pfault: cpu hotplug vs missing completion interrupts [S390] smp: add __noreturn attribute to cpu_die() [S390] percpu: implement arch specific irqsafe_cpu_ops [S390] vdso: disable gcov profiling ...
| * [S390] merge page_test_dirty and page_clear_dirtyMartin Schwidefsky2011-05-231-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The page_clear_dirty primitive always sets the default storage key which resets the access control bits and the fetch protection bit. That will surprise a KVM guest that sets non-zero access control bits or the fetch protection bit. Merge page_test_dirty and page_clear_dirty back to a single function and only clear the dirty bit from the storage key. In addition move the function page_test_and_clear_dirty and page_test_and_clear_young to page.h where they belong. This requires to change the parameter from a struct page * to a page frame number. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | Merge branch 'for-2.6.40' of ↵Linus Torvalds2011-05-241-2/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: Unify input section names percpu: Avoid extra NOP in percpu_cmpxchg16b_double percpu: Cast away printk format warning percpu: Always align percpu output section to PAGE_SIZE Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
| * \ Merge branch 'fixes-2.6.39' into for-2.6.40Tejun Heo2011-05-248-18/+21
| |\ \
| | * | percpu: Cast away printk format warningMike Frysinger2011-03-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit systems which don't happen to implicitly define or cast VMALLOC_START and/or VMALLOC_END to long in their arch headers, the printk in the percpu code will cause a warning to be emitted: mm/percpu.c: In function 'pcpu_embed_first_chunk': mm/percpu.c:1648: warning: format '%lx' expects type 'long unsigned int', but argument 3 has type 'unsigned int' So add an explicit cast to unsigned long here. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | percpu: Always align percpu output section to PAGE_SIZETejun Heo2011-03-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Percpu allocator honors alignment request upto PAGE_SIZE and both the percpu addresses in the percpu address space and the translated kernel addresses should be aligned accordingly. The calculation of the former depends on the alignment of percpu output section in the kernel image. The linker script macros PERCPU_VADDR() and PERCPU() are used to define this output section and the latter takes @align parameter. Several architectures are using @align smaller than PAGE_SIZE breaking percpu memory alignment. This patch removes @align parameter from PERCPU(), renames it to PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it, add PCPU_SETUP_BUG_ON() checks such that alignment problems are reliably detected and remove percpu alignment comment recently added in workqueue.c as the condition would trigger BUG way before reaching there. For um, this patch raises the alignment of percpu area. As the area is in .init, there shouldn't be any noticeable difference. This problem was discovered by David Howells while debugging boot failure on mn10300. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Mike Frysinger <vapier@gentoo.org> Cc: uclinux-dist-devel@blackfin.uclinux.org Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-231-100/+65
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Deal with hyperthetical case of PAGE_SIZE > 2M slub: Remove node check in slab_free slub: avoid label inside conditional slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath slub: Avoid warning for !CONFIG_SLUB_DEBUG slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery slub: Move debug handlign in __slab_free slub: Move node determination out of hotpath slub: Eliminate repeated use of c->page through a new page variable slub: get_map() function to establish map of free objects in a slab slub: Use NUMA_NO_NODE in get_partial slub: Fix a typo in config name
| * \ \ \ Merge branch 'slab/next' into for-linusPekka Enberg2011-05-231-100/+65
| |\ \ \ \ | | |_|_|/ | |/| | | | | | | | | | | | | Conflicts: mm/slub.c
| | * | | slub: Remove node check in slab_freeChristoph Lameter2011-05-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can set the page pointing in the percpu structure to NULL to have the same effect as setting c->node to NUMA_NO_NODE. Gets rid of one check in slab_free() that was only used for forcing the slab_free to the slowpath for debugging. We still need to set c->node to NUMA_NO_NODE to force the slab_alloc() fastpath to the slowpath in case of debugging. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: avoid label inside conditionalDavid Rientjes2011-05-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jumping to a label inside a conditional is considered poor style, especially considering the current organization of __slab_alloc(). This removes the 'load_from_page' label and just duplicates the three lines of code that it uses: c->node = page_to_nid(page); c->page = page; goto load_freelist; since it's probably not worth making this a separate helper function. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpathChristoph Lameter2011-05-171-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fastpath can do a speculative access to a page that CONFIG_DEBUG_PAGE_ALLOC may have marked as invalid to retrieve the pointer to the next free object. Use probe_kernel_read in that case in order not to cause a page fault. Cc: <stable@kernel.org> # 38.x Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Avoid warning for !CONFIG_SLUB_DEBUGChristoph Lameter2011-05-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the #ifdef so that get_map is only defined if CONFIG_SLUB_DEBUG is defined. Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Remove CONFIG_CMPXCHG_LOCAL ifdefferyChristoph Lameter2011-05-071-56/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used everywhere. There may be performance implications since: A. We now have to manage a transaction ID for all arches B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced to a very short irqoff section. There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback case we only have to do one disable and enable like before. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Move debug handlign in __slab_freeChristoph Lameter2011-04-171-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Its easier to read if its with the check for debugging flags. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Move node determination out of hotpathChristoph Lameter2011-04-171-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the node does not change then there is no need to recalculate the node from the page struct. So move the node determination into the places where we acquire a new slab page. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Eliminate repeated use of c->page through a new page variableChristoph Lameter2011-04-171-19/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __slab_alloc is full of "c->page" repeats. Lets just use one local variable named "page" for this. Also avoids the need to a have another variable called "new". Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: get_map() function to establish map of free objects in a slabChristoph Lameter2011-04-171-12/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bit map of free objects in a slab page is determined in various functions if debugging is enabled. Provide a common function for that purpose. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Use NUMA_NO_NODE in get_partialChristoph Lameter2011-04-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A -1 was leftover during the conversion. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Fix a typo in config nameLi Zefan2011-04-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no config named SLAB_DEBUG, and it should be a typo of SLUB_DEBUG. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-232-2/+2
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) b43: fix comment typo reqest -> request Haavard Skinnemoen has left Atmel cris: typo in mach-fs Makefile Kconfig: fix copy/paste-ism for dell-wmi-aio driver doc: timers-howto: fix a typo ("unsgined") perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). treewide: fix a few typos in comments regulator: change debug statement be consistent with the style of the rest Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations" audit: acquire creds selectively to reduce atomic op overhead rtlwifi: don't touch with treewide double semicolon removal treewide: cleanup continuations and remove logging message whitespace ath9k_hw: don't touch with treewide double semicolon removal include/linux/leds-regulator.h: fix syntax in example code tty: fix typo in descripton of tty_termios_encode_baud_rate xtensa: remove obsolete BKL kernel option from defconfig m68k: fix comment typo 'occcured' arch:Kconfig.locks Remove unused config option. treewide: remove extra semicolons ...
| * | | | Merge branch 'master' into for-nextJiri Kosina2011-04-2645-1359/+2123
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | Fast-forwarded to current state of Linus' tree as there are patches to be applied for files that didn't exist on the old branch.
| * | | | | treewide: remove extra semicolonsJustin P. Mattock2011-04-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
| * | | | | mm: Fix section mismatch for setup_zone_pageset()Nikanth Karthikesan2011-04-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | build_all_zonelists() which is not __meminit, calls setup_zone_pageset(). Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | | | tmpfs: fix highmem swapoff crash regressionHugh Dickins2011-05-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 778dd893ae78 ("tmpfs: fix race between umount and swapoff") forgot the new rules for strict atomic kmap nesting, causing WARNING: at arch/x86/mm/highmem_32.c:81 from __kunmap_atomic(), then BUG: unable to handle kernel paging request at fffb9000 from shmem_swp_set() when shmem_unuse_inode() is handling swapoff with highmem in use. My disgrace again. See https://bugzilla.kernel.org/show_bug.cgi?id=35352 Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | sanitize <linux/prefetch.h> usageLinus Torvalds2011-05-204-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e66eed651fd1 ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]') grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-191-2/+5
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm: kmemleak: Initialise kmemleak after debug_objects_mem_init() kmemleak: Select DEBUG_FS unconditionally in DEBUG_KMEMLEAK kmemleak: Do not return a pointer to an object that kmemleak did not get
| * | | | | | kmemleak: Do not return a pointer to an object that kmemleak did not getCatalin Marinas2011-05-191-2/+5
| | |/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kmemleak_seq_next() function tries to get an object (and increment its use count) before returning it. If it could not get the last object during list traversal (because it may have been freed), the function should return NULL rather than a pointer to such object that it did not get. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: <stable@kernel.org>
* | | | | | memcg: fix zone congestionKAMEZAWA Hiroyuki2011-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZONE_CONGESTED should be a state of global memory reclaim. If not, a busy memcg sets this and give unnecessary throttoling in wait_iff_congested() against memory recalim in other contexts. This makes system performance bad. I'll think about "memcg is congested!" flag is required or not, later. But this fix is required first. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Ying Han <yinghan@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud