summaryrefslogtreecommitdiffstats
path: root/mm/page_alloc.c
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] printk() should not be called under zone->lockKirill Korotaev2006-06-231-4/+5
| | | | | | | | | | | | | | This patch fixes printk() under zone->lock in show_free_areas(). It can be unsafe to call printk() under this lock, since caller can try to allocate/free some memory and selfdeadlock on this lock. I found allocations/freeing mem both in netconsole and serial console. This issue was faced in reallity when meminfo was periodically printed for debug purposes and netconsole was used. Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] initialise total_memory() earlierAndrew Morton2006-06-231-3/+3
| | | | | | | | | | | | | | | Initialise total_memory earlier in boot. Because if for some reason we run page reclaim early in boot, we don't want total_memory to be zero when we use it as a divisor. And rename total_memory to vm_total_pages to avoid naming clashes with architectures. Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Martin Bligh <mbligh@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] squash duplicate page_to_pfn and pfn_to_pageAndy Whitcroft2006-06-231-30/+2
| | | | | | | | | | | | We have architectures where the size of page_to_pfn and pfn_to_page are significant enough to overall image size that they wish to push them out of line. However, in the process we have grown a second copy of the implementation of each of these routines for each memory model. Share the implmentation exposing it either inline or out-of-line as required. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] wait_table and zonelist initializing for memory hotadd: update zonelistsYasunori Goto2006-06-231-5/+21
| | | | | | | | | | | | | | | | | | | | | | In current code, zonelist is considered to be build once, no modification. But MemoryHotplug can add new zone/pgdat. It must be updated. This patch modifies build_all_zonelists(). By this, build_all_zonelist() can reconfig pgdat's zonelists. To update them safety, this patch use stop_machine_run(). Other cpus don't touch among updating them by using it. In old version (V2 of node hotadd), kernel updated them after zone initialization. But present_page of its new zone is still 0, because online_page() is not called yet at this time. Build_zonelists() checks present_pages to find present zone. It was too early. So, I changed it after online_pages(). Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] wait_table and zonelist initializing for memory hotadd: wait_table ↵Yasunori Goto2006-06-231-6/+53
| | | | | | | | | | | | | | | initialization Wait_table is initialized according to zone size at boot time. But, we cannot know the maixmum zone size when memory hotplug is enabled. It can be changed.... And resizing of wait_table is hard. So kernel allocate and initialzie wait_table as its maximum size. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] wait_table and zonelist initializing for memory hotadd: add return ↵Yasunori Goto2006-06-231-3/+8
| | | | | | | | | | | | | | | | code for init_current_empty_zone When add_zone() is called against empty zone (not populated zone), we have to initialize the zone which didn't initialize at boot time. But, init_currently_empty_zone() may fail due to allocation of wait table. So, this patch is to catch its error code. Changes against wait_table is in the next patch. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] wait_table and zonelist initializing for memory hotadd: change to ↵Yasunori Goto2006-06-231-9/+9
| | | | | | | | | | | | | meminit for build_zonelist Change definitions of some functions and data from __init to __meminit. These functions and data can be used after bootup by this patch to be used for hot-add codes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] wait_table and zonelist initializing for memory hotadd: change name ↵Yasunori Goto2006-06-231-5/+7
| | | | | | | | | | of wait_table_size() This is just to rename from wait_table_size() to wait_table_hash_nr_entries(). Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] zone handle unaligned zone boundariesAndy Whitcroft2006-06-231-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The buddy allocator has a requirement that boundaries between contigious zones occur aligned with the the MAX_ORDER ranges. Where they do not we will incorrectly merge pages cross zone boundaries. This can lead to pages from the wrong zone being handed out. Originally the buddy allocator would check that buddies were in the same zone by referencing the zone start and end page frame numbers. This was removed as it became very expensive and the buddy allocator already made the assumption that zones boundaries were aligned. It is clear that not all configurations and architectures are honouring this alignment requirement. Therefore it seems safest to reintroduce support for non-aligned zone boundaries. This patch introduces a new check when considering a page a buddy it compares the zone_table index for the two pages and refuses to merge the pages where they do not match. The zone_table index is unique for each node/zone combination when FLATMEM/DISCONTIGMEM is enabled and for each section/zone combination when SPARSEMEM is enabled (a SPARSEMEM section is at least a MAX_ORDER size). Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundaryBob Picco2006-05-211-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | Andy added code to buddy allocator which does not require the zone's endpoints to be aligned to MAX_ORDER. An issue is that the buddy allocator requires the node_mem_map's endpoints to be MAX_ORDER aligned. Otherwise __page_find_buddy could compute a buddy not in node_mem_map for partial MAX_ORDER regions at zone's endpoints. page_is_buddy will detect that these pages at endpoints are not PG_buddy (they were zeroed out by bootmem allocator and not part of zone). Of course the negative here is we could waste a little memory but the positive is eliminating all the old checks for zone boundary conditions. SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint when SPARSEMEM is configured. ia64 VIRTUAL_MEM_MAP doesn't need the logic either because the holes and endpoints are handled differently. This leaves checking alloc_remap and other arches which privately allocate for node_mem_map. Signed-off-by: Bob Picco <bob.picco@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Cpuset: might sleep checking zones allowed fixPaul Jackson2006-05-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a couple of infrequently encountered 'sleeping function called from invalid context' in the cpuset hooks in __alloc_pages. Could sleep while interrupts disabled. The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c __alloc_pages() to determine if a zone is allowed in the current tasks cpuset. This routine can sleep, for certain GFP_KERNEL allocations, if the zone is on a memory node not allowed in the current cpuset, but might be allowed in a parent cpuset. But we can't sleep in __alloc_pages() if in interrupt, nor if called for a GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags). The rule was intended to be: Don't call cpuset_zone_allowed() if you can't sleep, unless you pass in the __GFP_HARDWALL flag set in gfp_flag, which disables the code that might scan up ancestor cpusets and sleep. This rule was being violated in a couple of places, due to a bogus change made (by myself, pj) to __alloc_pages() as part of the November 2005 effort to cleanup its logic, and also due to a later fix to constrain which swap daemons were awoken. The bogus change can be seen at: http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags This was first noticed on a tight memory system, in code that was disabling interrupts and doing allocation requests with __GFP_WAIT not set, which resulted in __might_sleep() writing complaints to the log "Debug: sleeping function called ...", when the code in cpuset_zone_allowed() tried to take the callback_sem cpuset semaphore. We haven't seen a system hang on this 'might_sleep' yet, but we are at decent risk of seeing it fairly soon, especially since the additional cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in March 2006. Special thanks to Dave Chinner, for figuring this out, and a tip of the hat to Nick Piggin who warned me of this back in Nov 2005, before I was ready to listen. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] setup_per_zone_pages_min() overflow fixAndrew Morton2006-05-151-4/+7
| | | | | | | | | | | | | As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=6490, this function can experience overflows on 32-bit machines, causing our response to changed values of min_free_kbytes to go whacky. Fixing it efficiently is all too hard, so fix it with 64-bit math instead. Cc: Ake Sandgren <ake.sandgren@hpc2n.umu.se> Cc: Martin Bligh <mbligh@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Remove __devinit and __cpuinit from notifier_call definitionsChandra Seetharaman2006-04-261-1/+1
| | | | | | | | | | | | | Few of the notifier_chain_register() callers use __init in the definition of notifier_call. It is incorrect as the function definition should be available after the initializations (they do not unregister them during initializations). This patch fixes all such usages to _not_ have the notifier_call __init section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] page_alloc.c: buddy handling cleanupAndrew Morton2006-04-191-4/+6
| | | | | | | Fix up some whitespace damage. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] overcommit: add calculate_totalreserve_pages()Hideo AOKI2006-04-111-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These patches are an enhancement of OVERCOMMIT_GUESS algorithm in __vm_enough_memory(). - why the kernel needed patching When the kernel can't allocate anonymous pages in practice, currnet OVERCOMMIT_GUESS could return success. This implementation might be the cause of oom kill in memory pressure situation. If the Linux runs with page reservation features like /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think the oom kill occurs easily. - the overall design approach in the patch When the OVERCOMMET_GUESS algorithm calculates number of free pages, the reserved free pages are regarded as non-free pages. This change helps to avoid the pitfall that the number of free pages become less than the number which the kernel tries to keep free. - testing results I tested the patches using my test kernel module. If the patches aren't applied to the kernel, __vm_enough_memory() returns success in the situation but autual page allocation is failed. On the other hand, if the patches are applied to the kernel, memory allocation failure is avoided since __vm_enough_memory() returns failure in the situation. I checked that on i386 SMP 16GB memory machine. I haven't tested on nommu environment currently. This patch adds totalreserve_pages for __vm_enough_memory(). Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and pages_high in each zone. Finally, the function stores the sum of each zone to totalreserve_pages. The totalreserve_pages is calculated when the VM is initilized. And the variable is updated when /proc/sys/vm/lowmem_reserve_raito or /proc/sys/vm/min_free_kbytes are changed. Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix buddy list race that could lead to page lru list corruptionsNick Piggin2006-04-101-13/+18
| | | | | | | | | | | | | | | | | | | Rohit found an obscure bug causing buddy list corruption. page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0) to determine whether or not a free page's buddy is itself free and in the buddy lists. Each of the conjuncts may be true at different times due to unrelated conditions, so the non-atomic page_is_buddy test may find each conjunct to be true even if they were not both true at the same time (ie. the page was not on the buddy lists). Signed-off-by: Martin Bligh <mbligh@google.com> Signed-off-by: Rohit Seth <rohitseth@google.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] for_each_online_pgdat: remove pgdat_listKAMEZAWA Hiroyuki2006-03-271-4/+4
| | | | | | | | | By using for_each_online_pgdat(), pgdat_list is not necessary now. This patch removes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] for_each_online_pgdat: renaming for_each_pgdatKAMEZAWA Hiroyuki2006-03-271-3/+3
| | | | | | | | Replace for_each_pgdat() with for_each_online_pgdat(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove zone_mem_mapKAMEZAWA Hiroyuki2006-03-271-4/+2
| | | | | | | | | | | | | | This patch removes zone_mem_map. pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat instead of zone, which is only one user of zone_mem_map. By modifing it, we can remove zone_mem_map. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] unify pfn_to_page: generic functionsKAMEZAWA Hiroyuki2006-03-271-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM. Each arch has its own page_to_pfn(), pfn_to_page() for each models. But most of them can use the same arithmetic. This patch adds asm-generic/memory_model.h, which includes generic page_to_pfn(), pfn_to_page() definitions for each memory model. When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are used instead of macro. This is enabled by some archs and reduces text size. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fix alloc_large_system_hash() roundupJohn Hawkes2006-03-251-2/+1
| | | | | | | | | | | | | | The "rounded up to nearest power of 2 in size" algorithm in alloc_large_system_hash is not correct. As coded, it takes an otherwise acceptable power-of-2 value and doubles it. For example, we see the error if we boot with thash_entries=2097152 which produces a hash table with 4194304 entries. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] quieten zone_pcp_initAnton Blanchard2006-03-251-2/+3
| | | | | | | | | | | | | | | | In zone_pcp_init we print out all zones even if they are empty: On node 0 totalpages: 245760 DMA zone: 245760 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 To conserve dmesg space why not print only the non zero zones. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpusets: only wakeup kswapd for zones in the current cpusetChristoph Lameter2006-03-241-1/+2
| | | | | | | | | | | If we get under some memory pressure in a cpuset (we only scan zones that are in the cpuset for memory) then kswapd is woken up for all zones. This patch only wakes up kswapd in zones that are part of the current cpuset. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] drain_node_pages: interrupt latency reduction / optimizationChristoph Lameter2006-03-221-4/+8
| | | | | | | | | | | | | | | | 1. Only disable interrupts if there is actually something to free 2. Only dirty the pcp cacheline if we actually freed something. 3. Disable interrupts for each single pcp and not for cleaning all the pcps in all zones of a node. drain_node_pages is called every 2 seconds from cache_reap. This fix should avoid most disabling of interrupts. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: prep_zero_page() in irq is a bugAndrew Morton2006-03-221-0/+5
| | | | | | | | | | prep_zero_page() uses KM_USER0 and hence may not be used from IRQ context, at least for highmem pages. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: cleanup prep_ stuffNick Piggin2006-03-221-17/+18
| | | | | | | | Move the prep_ stuff into prep_new_page. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove set_page_count() outside mm/Nick Piggin2006-03-221-8/+6
| | | | | | | | | | | | | set_page_count usage outside mm/ is limited to setting the refcount to 1. Remove set_page_count from outside mm/, and replace those users with init_page_count() and set_page_refcounted(). This allows more debug checking, and tighter control on how code is allowed to play around with page->_count. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: nommu use compound pagesNick Piggin2006-03-221-7/+0
| | | | | | | | | | | | | | | Now that compound page handling is properly fixed in the VM, move nommu over to using compound pages rather than rolling their own refcounting. nommu vm page refcounting is broken anyway, but there is no need to have divergent code in the core VM now, nor when it gets fixed. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: David Howells <dhowells@redhat.com> (Needs testing, please). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: cleanup bootmemNick Piggin2006-03-221-13/+7
| | | | | | | | | | | | | | | | The bootmem code added to page_alloc.c duplicated some page freeing code that it really doesn't need to because it is not so performance critical. While we're here, make prefetching work properly by actually prefetching the page we're about to use before prefetching ahead to the next one (ie. get the most important transaction started first). Also prefetch just a single page ahead rather than leaving a gap of 16. Jack Steiner reported no problems with SGI's ia64 simulator. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: split highorder pagesNick Piggin2006-03-221-0/+22
| | | | | | | | | | | | | | | | | Have an explicit mm call to split higher order pages into individual pages. Should help to avoid bugs and be more explicit about the code's intention. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: page_alloc less atomicsNick Piggin2006-03-221-2/+2
| | | | | | | | More atomic operation removal from page allocator Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] __get_page_state() cpumask cleanup and fixAndrew Morton2006-03-221-11/+9
| | | | | | | | | | | | | | | | | | | | | | __get_page_state() has an open-coded for_each_cpu_mask() loop in it. Tidy that up, then notice that the code was buggy: while (cpu < NR_CPUS) { unsigned long *in, *out, off; if (!cpu_isset(cpu, *cpumask)) continue; an obvious infinite loop. I guess we just never call it with a holey cpu mask. Even after my cpumask size-reduction work, this patch increases code size :( Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages.Christoph Lameter2006-03-091-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Terminate process that fails on a constrained allocationChristoph Lameter2006-02-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Handle holes in node mask in node fallback list setupLinus Torvalds2006-02-171-11/+11
| | | | | | | | | | | Change the find_next_best_node algorithm to correctly skip over holes in the node online mask. Previously it would not handle missing nodes correctly and cause crashes at boot. [Written by Linus, tested by AK] Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] compound page: default destructorHugh Dickins2006-02-141-1/+8
| | | | | | | | | | | | | | | | | | | | Somehow I imagined that calling a NULL destructor would free a compound page rather than oopsing. No, we must supply a default destructor, __free_pages_ok using the order noted by prep_compound_page. hugetlb can still replace this as before with its own free_huge_page pointer. The case that needs this is not common: rarely does put_compound_page's put_page_testzero bring the count down to 0. But if get_user_pages is applied to some part of a compound page, without immediate release (e.g. AIO or Infiniband), then it's possible for its put_page to come after the containing vma has been unmapped and the driver done its free_pages. That's just the kind of case compound pages are supposed to be guarding against (but Nick points out, nor did PageReserved handle this right). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] compound page: use page[1].lruHugh Dickins2006-02-141-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a compound page has its own put_page_testzero destructor (the only current example is free_huge_page), that is noted in page[1].mapping of the compound page. But that's rather a poor place to keep it: functions which call set_page_dirty_lock after get_user_pages (e.g. Infiniband's __ib_umem_release) ought to be checking first, otherwise set_page_dirty is liable to crash on what's not the address of a struct address_space. And now I'm about to make that worse: it turns out that every compound page needs a destructor, so we can no longer rely on hugetlb pages going their own special way, to avoid further problems of page->mapping reuse. For example, not many people know that: on 50% of i386 -Os builds, the first tail page of a compound page purports to be PageAnon (when its destructor has an odd address), which surprises page_add_file_rmap. Keep the compound page destructor in page[1].lru.next instead. And to free up the common pairing of mapping and index, also move compound page order from index to lru.prev. Slab reuses page->lru too: but if we ever need slab to use compound pages, it can easily stack its use above this. (akpm: decoded version of the above: the tail pages of a compound page now have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]() caller to check that they're not compund pages before doing the dirty). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] percpu data: only iterate over possible CPUsEric Dumazet2006-02-051-4/+6
| | | | | | | | | | | | | | | | | | | | | | | percpu_data blindly allocates bootmem memory to store NR_CPUS instances of cpudata, instead of allocating memory only for possible cpus. As a preparation for changing that, we need to convert various 0 -> NR_CPUS loops to use for_each_cpu(). (The above only applies to users of asm-generic/percpu.h. powerpc has gone it alone and is presently only allocating memory for present CPUs, so it's currently corrupting memory). Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@suse.de> Cc: Anton Blanchard <anton@samba.org> Acked-by: William Irwin <wli@holomorphy.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] __cpuinit functions wrongly marked __meminitAshok Raj2006-02-011-3/+3
| | | | | | | | | __meminit has overzelously been modified and crept its way into marking cpuup callbacks as __meminit. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Zone reclaim: Reclaim logicChristoph Lameter2006-01-181-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some bits for zone reclaim exists in 2.6.15 but they are not usable. This patch fixes them up, removes unused code and makes zone reclaim usable. Zone reclaim allows the reclaiming of pages from a zone if the number of free pages falls below the watermarks even if other zones still have enough pages available. Zone reclaim is of particular importance for NUMA machines. It can be more beneficial to reclaim a page than taking the performance penalties that come with allocating a page on a remote zone. Zone reclaim is enabled if the maximum distance to another node is higher than RECLAIM_DISTANCE, which may be defined by an arch. By default RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same component (enclosure or motherboard) on IA64. The meaning of the NUMA distance information seems to vary by arch. If zone reclaim is not successful then no further reclaim attempts will occur for a certain time period (ZONE_RECLAIM_INTERVAL). This patch was discussed before. See http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86_64: add __meminit for memory hotplugMatt Tolentino2006-01-161-7/+7
| | | | | | | | | | | Add __meminit to the __init lineup to ensure functions default to __init when memory hotplug is not enabled. Replace __devinit with __meminit on functions that were changed when the memory hotplug code was introduced. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] memmap_init_zone(): remove uneccesary page++Greg Ungerer2006-01-121-1/+1
| | | | | | | | Remove unecessary page++ from memmap_init_zone loop. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: gfp_atomic commentsPaul Jackson2006-01-111-1/+2
| | | | | | | | | Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Restore KERN_EMERG to each line printed by bad_pageHugh Dickins2006-01-111-3/+3
| | | | | | Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] fix/simplify mutex debugging codeDavid Woodhouse2006-01-111-1/+1
| | | | | | | | | | Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as arguments instead, since all its callers were just calculating the 'to' address for themselves anyway... (and sometimes doing so badly). Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mutex subsystem, more debugging codeIngo Molnar2006-01-091-0/+3
| | | | | | | | more mutex debugging: check for held locks during memory freeing, task exit, enable sysrq printouts, etc. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org>
* [PATCH] cpuset: memory pressure meterPaul Jackson2006-01-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a simple per-cpuset metric of memory pressure, tracking the -rate- that the tasks in a cpuset call try_to_free_pages(), the synchronous (direct) memory reclaim code. This enables batch managers monitoring jobs running in dedicated cpusets to efficiently detect what level of memory pressure that job is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or reprioritize jobs that are trying to use more memory than allowed on the nodes assigned them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. This patch just provides a very economical way for the batch manager to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. ==> Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only systems that enable this feature will compute the metric. Why a per-cpuset, running average: Because this meter is per-cpuset, rather than per-task or mm, the system load imposed by a batch scheduler monitoring this metric is sharply reduced on large systems, because a scan of the tasklist can be avoided on each set of queries. Because this meter is a running average, instead of an accumulating counter, a batch scheduler can detect memory pressure with a single read, instead of having to read and accumulate results for a period of time. Because this meter is per-cpuset rather than per-task or mm, the batch scheduler can obtain the key information, memory pressure in a cpuset, with a single read, rather than having to query and accumulate results over all the (dynamically changing) set of tasks in the cpuset. A per-cpuset simple digital filter (requires a spinlock and 3 words of data per-cpuset) is kept, and updated by any task attached to that cpuset, if it enters the synchronous (direct) page reclaim code. A per-cpuset file provides an integer number representing the recent (half-life of 10 seconds) rate of direct page reclaims caused by the tasks in the cpuset, in units of reclaims attempted per second, times 1000. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: free_pages optNick Piggin2006-01-081-28/+30
| | | | | | | | | | | | | | | Try to streamline free_pages_bulk by ensuring callers don't pass in a 'count' that exceeds the list size. Some cleanups: Rename __free_pages_bulk to __free_one_page. Put the page list manipulation from __free_pages_ok into free_one_page. Make __free_pages_ok static. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: cleanup zone_pcpNick Piggin2006-01-081-8/+8
| | | | | | | | | | | | | Use zone_pcp everywhere even though NUMA code "knows" the internal details of the zone. Stop other people trying to copy, and it looks nicer. Also, only print the pagesets of online cpus in zoneinfo. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Seth, Rohit" <rohit.seth@intel.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Make high and batch sizes of per_cpu_pagelists configurableRohit Seth2006-01-081-0/+49
| | | | | | | | | | | | | | | | | | | As recently there has been lot of traffic on the right values for batch and high water marks for per_cpu_pagelists. This patch makes these two variables configurable through /proc interface. A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry controls the fraction of pages at most in each zone that are allocated for each per cpu page list. The min value for this is 8. It means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OpenPOWER on IntegriCloud