summaryrefslogtreecommitdiffstats
path: root/mm/slab.c
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] slab: Fix kmem_cache_destroy() on NUMARoland Dreier2006-05-161-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_NUMA set, kmem_cache_destroy() may fail and say "Can't free all objects." The problem is caused by sequences such as the following (suppose we are on a NUMA machine with two nodes, 0 and 1): * Allocate an object from cache on node 0. * Free the object on node 1. The object is put into node 1's alien array_cache for node 0. * Call kmem_cache_destroy(), which ultimately ends up in __cache_shrink(). * __cache_shrink() does drain_cpu_caches(), which loops through all nodes. For each node it drains the shared array_cache and then handles the alien array_cache for the other node. However this means that node 0's shared array_cache will be drained, and then node 1 will move the contents of its alien[0] array_cache into that same shared array_cache. node 0's shared array_cache is never looked at again, so the objects left there will appear to be in use when __cache_shrink() calls __node_shrink() for node 0. So __node_shrink() will return 1 and kmem_cache_destroy() will fail. This patch fixes this by having drain_cpu_caches() do drain_alien_cache() on every node before it does drain_array() on the nodes' shared array_caches. The problem was originally reported by Or Gerlitz <ogerlitz@voltaire.com>. Signed-off-by: Roland Dreier <rolandd@cisco.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add slab_is_available() routine for boot codeMike Kravetz2006-05-151-0/+8
| | | | | | | | | | slab_is_available() indicates slab based allocators are available for use. SPARSEMEM code needs to know this as it can be called at various times during the boot process. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix crash on __drain_alien_cahce() during CPU Hotplugshin, jacob2006-04-281-1/+2
| | | | | | | | transfer_objects should only be called when all of the cpus in the node are online. CPU_DEAD notifier callback marks l3->shared to NULL. Signed-off-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Remove __devinit and __cpuinit from notifier_call definitionsChandra Seetharaman2006-04-261-1/+1
| | | | | | | | | | | | | Few of the notifier_chain_register() callers use __init in the definition of notifier_call. It is incorrect as the function definition should be available after the initializations (they do not unregister them during initializations). This patch fixes all such usages to _not_ have the notifier_call __init section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] nommu: use compound page in slab allocatorLuke Yang2006-04-111-0/+7
| | | | | | | | | | | | | | The earlier patch to consolidate mmu and nommu page allocation and refcounting by using compound pages for nommu allocations had a bug: kmalloc slabs who's pages were initially allocated by a non-__GFP_COMP allocator could be passed into mm/nommu.c kmalloc allocations which really wanted __GFP_COMP underlying pages. Fix that by having nommu pass __GFP_COMP to all higher order slab allocations. Signed-off-by: Luke Yang <luke.adi@gmail.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: add statistics for alien cache overflowsRavikiran G Thirumalai2006-04-111-4/+10
| | | | | | | | | | | | Add a statistics counter which is incremented everytime the alien cache overflows. alien_cache limit is hardcoded to 12 right now. We can use this statistics to tune alien cache if needed in the future. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: allocate node local memory for off-slab slabmanagementRavikiran G Thirumalai2006-04-111-3/+6
| | | | | | | | | | | Allocate off-slab slab descriptors from node local memory. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* BUG_ON() Conversion in mm/slab.cEric Sesterhenn2006-04-021-12/+6
| | | | | | | | this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
* [PATCH] for_each_possible_cpu: fixes for generic partKAMEZAWA Hiroyuki2006-03-281-2/+2
| | | | | | | | replaces for_each_cpu with for_each_possible_cpu(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix memory leak in alloc_kmemlistChristoph Lameter2006-03-251-5/+28
| | | | | | | | | | | | | | | | | | | | | | | | | We have had this memory leak for a while now. The situation is complicated by the use of alloc_kmemlist() as a function to resize various caches by do_tune_cpucache(). What we do here is first of all make sure that we deallocate properly in the loop over all the nodes. If we are just resizing caches then we can simply return with -ENOMEM if an allocation fails. If the cache is new then we need to rollback and remove all earlier allocations. We detect that a cache is new by checking if the link to the global cache chain has been setup. This is a bit hackish .... (also fix up too overlong lines that I added in the last patch...) Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] alloc_kmemlist: Some cleanup in preparation for a real memory leak fixChristoph Lameter2006-03-251-17/+17
| | | | | | | | | | | | | | | | | | | | Inspired by Jesper Juhl's patch from today 1. Get rid of err We do not set it to anything else but zero. 2. Drop the CONFIG_NUMA stuff. There are definitions for alloc_alien_cache and free_alien_cache() that do the right thing for the non NUMA case. 3. Better naming of variables. 4. Remove redundant cachep->nodelists[node] expressions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: Bypass free lists for __drain_alien_cache()Christoph Lameter2006-03-251-2/+9
| | | | | | | | | | | | | | | | | | __drain_alien_cache() currently drains objects by freeing them to the (remote) freelists of the original node. However, each node also has a shared list containing objects to be used on any processor of that node. We can avoid a number of remote node accesses by copying the pointers to the free objects directly into the remote shared array. And while we are at it: Skip alien draining if the alien cache spinlock is already taken. Kiran reported that this is a performance benefit. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: add transfer_objects() functionChristoph Lameter2006-03-251-14/+28
| | | | | | | | | | | | | slabr_objects() can be used to transfer objects between various object caches of the slab allocator. It is currently only used during __cache_alloc() to retrieve elements from the shared array. We will be using it soon to transfer elements from the alien caches to the remote shared array. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: use kmem_cache_zallocPekka Enberg2006-03-251-2/+1
| | | | | | | | Convert mm/ to use the new kmem_cache_zalloc allocator. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: introduce kmem_cache_zalloc allocatorPekka Enberg2006-03-251-0/+17
| | | | | | | | | | Introduce a memory-zeroing variant of kmem_cache_alloc. The allocator already exits in XFS and there are potential users for it so this patch makes the allocator available for the general public. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: implement /proc/slab_allocatorsAl Viro2006-03-251-6/+174
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset: memory_spread_slab drop useless PF_SPREAD_PAGE checkPaul Jackson2006-03-241-3/+2
| | | | | | | | | | | | | | | | | | | The hook in the slab cache allocation path to handle cpuset memory spreading for tasks in cpusets with 'memory_spread_slab' enabled has a modest performance bug. The hook calls into the memory spreading handler alternate_node_alloc() if either of 'memory_spread_slab' or 'memory_spread_page' is enabled, even though the handler does nothing (albeit harmlessly) for the page case Fix - drop PF_SPREAD_PAGE from the set of flag bits that are used to trigger a call to alternate_node_alloc(). The page case is handled by separate hooks -- see the calls conditioned on cpuset_do_page_mem_spread() in mm/filemap.c Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset memory spread slab cache optimizationsPaul Jackson2006-03-241-13/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | The hooks in the slab cache allocator code path for support of NUMA mempolicies and cpuset memory spreading are in an important code path. Many systems will use neither feature. This patch optimizes those hooks down to a single check of some bits in the current tasks task_struct flags. For non NUMA systems, this hook and related code is already ifdef'd out. The optimization is done by using another task flag, set if the task is using a non-default NUMA mempolicy. Taking this flag bit along with the PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset memory spreading' patch set, one can check for the combination of any of these special case memory placement mechanisms with a single test of the current tasks task_struct flags. This patch also tightens up the code, to save a few bytes of kernel text space, and moves some of it out of line. Due to the nested inlines called from multiple places, we were ending up with three copies of this code, which once we get off the main code path (for local node allocation) seems a bit wasteful of instruction memory. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpuset memory spread slab cache implementationPaul Jackson2006-03-241-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide the slab cache infrastructure to support cpuset memory spreading. See the previous patches, cpuset_mem_spread, for an explanation of cpuset memory spreading. This patch provides a slab cache SLAB_MEM_SPREAD flag. If set in the kmem_cache_create() call defining a slab cache, then any task marked with the process state flag PF_MEMSPREAD will spread memory page allocations for that cache over all the allowed nodes, instead of preferring the local (faulting) node. On systems not configured with CONFIG_NUMA, this results in no change to the page allocation code path for slab caches. On systems with cpusets configured in the kernel, but the "memory_spread" cpuset option not enabled for the current tasks cpuset, this adds a call to a cpuset routine and failed bit test of the processor state flag PF_SPREAD_SLAB. For tasks so marked, a second inline test is done for the slab cache flag SLAB_MEM_SPREAD, and if that is set and if the allocation is not in_interrupt(), this adds a call to to a cpuset routine that computes which of the tasks mems_allowed nodes should be preferred for this allocation. ==> This patch adds another hook into the performance critical code path to allocating objects from the slab cache, in the ____cache_alloc() chunk, below. The next patch optimizes this hook, reducing the impact of the combined mempolicy plus memory spreading hooks on this critical code path to a single check against the tasks task_struct flags word. This patch provides the generic slab flags and logic needed to apply memory spreading to a particular slab. A subsequent patch will mark a few specific slab caches for this placement policy. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: slab cache interleave rotor fixPaul Jackson2006-03-221-1/+1
| | | | | | | | | | | | | The alien cache rotor in mm/slab.c assumes that the first online node is node 0. Eventually for some archs, especially with hotplug, this will no longer be true. Fix the interleave rotor to handle the general case of node numbering. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix drain_array() so that it works correctly with the shared_arrayChristoph Lameter2006-03-221-9/+12
| | | | | | | | | | | | | | | The list_lock also protects the shared array and we call drain_array() with the shared array. Therefore we cannot go as far as I wanted to but have to take the lock in a way so that it also protects the array_cache in drain_pages. (Note: maybe we should make the array_cache locking more consistent? I.e. always take the array cache lock for shared arrays and disable interrupts for the per cpu arrays?) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: remove drain_array_lockedChristoph Lameter2006-03-221-21/+10
| | | | | | | | | Remove drain_array_locked and use that opportunity to limit the time the l3 lock is taken further. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: make drain_array more universal by adding more parametersChristoph Lameter2006-03-221-9/+11
| | | | | | | | | | | And a parameter to drain_array to control the freeing of all objects and then use drain_array() to replace instances of drain_array_locked with drain_array. Doing so will avoid taking locks in those locations if the arrays are empty. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: cache_reap(): further reduction in interrupt holdoffChristoph Lameter2006-03-221-14/+43
| | | | | | | | | | | | | | | | | | | | | | | | | cache_reap takes the l3->list_lock (disabling interrupts) unconditionally and then does a few checks and maybe does some cleanup. This patch makes cache_reap() only take the lock if there is work to do and then the lock is taken and released for each cleaning action. The checking of when to do the next reaping is done without any locking and becomes racy. Should not matter since reaping can also be skipped if the slab mutex cannot be acquired. The same is true for the touched processing. If we get this wrong once in awhile then we will mistakenly clean or not clean the shared cache. This will impact performance slightly. Note that the additional drain_array() function introduced here will fall out in a subsequent patch since array cleaning will now be very similar from all callers. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: nommu use compound pagesNick Piggin2006-03-221-1/+8
| | | | | | | | | | | | | | | Now that compound page handling is properly fixed in the VM, move nommu over to using compound pages rather than rolling their own refcounting. nommu vm page refcounting is broken anyway, but there is no need to have divergent code in the core VM now, nor when it gets fixed. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: David Howells <dhowells@redhat.com> (Needs testing, please). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: use on_each_cpu()Andrew Morton2006-03-221-19/+2
| | | | | | | Slab duplicates on_each_cpu(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: Remove SLAB_NO_REAP optionChristoph Lameter2006-03-221-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | SLAB_NO_REAP is documented as an option that will cause this slab not to be reaped under memory pressure. However, that is not what happens. The only thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused slab elements that were allocated in batch in cache_reap(). Cache_reap() is run every few seconds independently of memory pressure. Could we remove the whole thing? Its only used by three slabs anyways and I cannot find a reason for having this option. There is an additional problem with SLAB_NO_REAP. If set then the recovery of objects from alien caches is switched off. Objects not freed on the same node where they were initially allocated will only be reused if a certain amount of objects accumulates from one alien node (not very likely) or if the cache is explicitly shrunk. (Strangely __cache_shrink does not check for SLAB_NO_REAP) Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix kernel-doc warningsRandy Dunlap2006-03-221-3/+12
| | | | | | | | Fix kernel-doc warnings in mm/slab.c. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: remove cachep->spinlockRavikiran G Thirumalai2006-03-221-11/+9
| | | | | | | | | | | | | | | | | | Remove cachep->spinlock. Locking has moved to the kmem_list3 and most of the structures protected earlier by cachep->spinlock is now protected by the l3->list_lock. slab cache tunables like batchcount are accessed always with the cache_chain_mutex held. Patch tested on SMP and NUMA kernels with dbench processes running, constant onlining/offlining, and constant cache tuning, all at the same time. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab cleanupAndrew Morton2006-03-221-292/+304
| | | | | | | | | | | | | slab.c has become a bit revolting again. Try to repair it. - Coding style fixes - Don't do assignments-in-if-statements. - Don't typecast assignments to/from void* Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: extract setup_cpu_cachePekka Enberg2006-03-221-54/+55
| | | | | | | | | Extract setup_cpu_cache() function from kmem_cache_create() to make the latter a little less complex. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: object to index mapping cleanupPekka Enberg2006-03-221-11/+23
| | | | | | | | Clean up the object to index mapping that has been spread around mm/slab.c. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm: slab less atomicsNick Piggin2006-03-221-3/+3
| | | | | | | | Atomic operation removal from slab Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages.Christoph Lameter2006-03-091-3/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: allocate larger cache_cache if order 0 failsJack Steiner2006-03-081-3/+8
| | | | | | | | | | | | kmem_cache_init() incorrectly assumes that the cache_cache object will fit in an order 0 allocation. On very large systems, this is not true. Change the code to try larger order allocations if order 0 fails. Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* slab: fix calculate_slab_order() for SLAB_RECLAIM_ACCOUNTLinus Torvalds2006-03-081-11/+9
| | | | | | | | Instead of having a hard-to-read and confusing conditional in the caller, just make the slab order calculation handle this special case, since it's simple and obvious there. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* slab: clarify and fix calculate_slab_order()Linus Torvalds2006-03-061-12/+12
| | | | | | | | | If we triggered the 'offslab_limit' test, we would return with cachep->gfporder incremented once too many times. This clarifies the logic somewhat, and fixes that bug. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Fix "check_slabp" printout size calculationLinus Torvalds2006-03-061-1/+1
| | | | | | | | | | | We want to use the "struct slab" size, not the size of the pointer to same. As it is, we'd not print out the last <n> entry pointers in the slab (where <n> is ~10, depending on whether it's a 32-bit or 64-bit kernel). Gaah, that slab code was written by somebody who likes unreadable crud. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: Avoid deadlock at kmem_cache_create/kmem_cache_destroyRavikiran G Thirumalai2006-02-101-3/+7
| | | | | | | | | | | Prevents deadlock situation between kmem_cache_create()/kmem_cache_destory(), and kmem_cache_create() /cpu hotplug. The locking order probably got moved over time. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* mm/slab.c (non-NUMA): Fix compile warning and clean up codeLinus Torvalds2006-02-051-3/+8
| | | | | | | | | | | | | | The non-NUMA case would do an unmatched "free_alien_cache()" on an alien pointer that had never been allocated. It might not matter from a code generation standpoint (since in the non-NUMA case, the code doesn't actually _do_ anything), but it not only results in a compiler warning, it's really really ugly too. Fix the compiler warning by just having a matching dummy allocation. That also avoids an unnecessary #ifdef in the code. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NUMA slab locking fixes: fix cpu down and up lockingRavikiran G Thirumalai2006-02-051-38/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NUMA slab locking fixes: irq disabling from cahep->spinlock to l3 lockRavikiran G Thirumalai2006-02-051-18/+18
| | | | | | | | | | | | | | | | | Earlier, we had to disable on chip interrupts while taking the cachep->spinlock because, at cache_grow, on every addition of a slab to a slab cache, we incremented colour_next which was protected by the cachep->spinlock, and cache_grow could occur at interrupt context. Since, now we protect the per-node colour_next with the node's list_lock, we do not need to disable on chip interrupts while taking the per-cache spinlock, but we just need to disable interrupts when taking the per-node kmem_list3 list_lock. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NUMA slab locking fixes: move color_next to l3Ravikiran G Thirumalai2006-02-051-11/+11
| | | | | | | | | | | | | | | | | | | colour_next is used as an index to add a colouring offset to a new slab in the cache (colour_off * colour_next). Now with the NUMA aware slab allocator, it makes sense to colour slabs added on the same node sequentially with colour_next. This patch moves the colouring index "colour_next" per-node by placing it on kmem_list3 rather than kmem_cache. This also helps simplify locking for CPU up and down paths. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix sparse warningRandy Dunlap2006-02-011-2/+2
| | | | | | | | mm/slab.c:1522:13: error: incompatible types for operation (&) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mm/slab: add kernel-doc for one functionRandy.Dunlap2006-02-011-2/+7
| | | | | | | | Fix kernel-doc for calculate_slab_order(). Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: fix kzalloc and kstrdup caller report for CONFIG_DEBUG_SLABPekka Enberg2006-02-011-5/+24
| | | | | | | | | | | | | | Fix kzalloc() and kstrdup() caller report for CONFIG_DEBUG_SLAB. We must pass the caller to __cache_alloc() instead of directly doing __builtin_return_address(0) there; otherwise kzalloc() and kstrdup() are reported as the allocation site instead of the real one. Thanks to Valdis Kletnieks for reporting the problem and Steven Rostedt for the original idea. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: replace kmem_cache_t with struct kmem_cachePekka Enberg2006-02-011-97/+98
| | | | | | | | Replace uses of kmem_cache_t with proper struct kmem_cache in mm/slab.c. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: rename ac_data to cpu_cache_getPekka Enberg2006-02-011-18/+18
| | | | | | | | | Rename the ac_data() function to more descriptive cpu_cache_get(). Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: extract virt_to_{cache|slab}Pekka Enberg2006-02-011-5/+17
| | | | | | | | | | | Introduce virt_to_cache() and virt_to_slab() functions to reduce duplicate code and introduce a proper abstraction should we want to support other kind of mapping for address to slab and cache (eg. for vmalloc() or I/O memory). Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] slab: reduce inliningPekka Enberg2006-02-011-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Manfred Spraul <manfred@colorfullife.com> Reduce the amount of inline functions in slab to the functions that are used in the hot path: - no inline for debug functions - no __always_inline, inline is already __always_inline - remove inline from a few numa support functions. Before: text data bss dec hex filename 13588 752 48 14388 3834 mm/slab.o (defconfig) 16671 2492 48 19211 4b0b mm/slab.o (numa) After: text data bss dec hex filename 13366 752 48 14166 3756 mm/slab.o (defconfig) 16230 2492 48 18770 4952 mm/slab.o (numa) Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OpenPOWER on IntegriCloud