summaryrefslogtreecommitdiffstats
path: root/mm/hugetlb.c
Commit message (Collapse)AuthorAgeFilesLines
* Do not account for hugetlbfs quota at mmap() time if mapping [SHM|MAP]_NORESERVEMel Gorman2009-02-111-20/+33
| | | | | | | | | | | | | | | | | | | | Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more in line with the core VM by obeying VM_NORESERVE and not reserving hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE are specified. However, it is still taking filesystem quota unconditionally. At fault time, if there are no reserves and attempt is made to allocate the page and account for filesystem quota. If either fail, the fault fails. The impact is that quota is getting accounted for twice. This patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To help prevent this mistake happening again, it improves the documentation of hugetlb_reserve_pages() Reported-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Do not account for the address space used by hugetlbfs using VM_ACCOUNTMel Gorman2009-02-101-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When overcommit is disabled, the core VM accounts for pages used by anonymous shared, private mappings and special mappings. It keeps track of VMAs that should be accounted for with VM_ACCOUNT and VMAs that never had a reserve with VM_NORESERVE. Overcommit for hugetlbfs is much riskier than overcommit for base pages due to contiguity requirements. It avoids overcommiting on both shared and private mappings using reservation counters that are checked and updated during mmap(). This ensures (within limits) that hugepages exist in the future when faults occurs or it is too easy to applications to be SIGKILLed. As hugetlbfs makes its own reservations of a different unit to the base page size, VM_ACCOUNT should never be set. Even if the units were correct, we would double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may be set because an application can request no reserves be made for hugetlbfs at the risk of getting killed later. With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This breaks the accounting for both the core VM and hugetlbfs, can trigger an OOM storm when hugepage pools are too small lockups and corrupted counters otherwise are used. This patch brings hugetlbfs more in line with how the core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: hugetlb: remove redundant `if' operationCyrill Gorcunov2009-01-061-3/+2
| | | | | | | | | | | At this point we already know that 'addr' is not NULL so get rid of redundant 'if'. Probably gcc eliminate it by optimization pass. [akpm@linux-foundation.org: use __weak, too] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: fix sparse warningsHannes Eder2009-01-061-4/+8
| | | | | | | | | | | | Fix the following sparse warnings: mm/hugetlb.c:375:3: warning: returning void-valued expression mm/hugetlb.c:408:3: warning: returning void-valued expression Signed-off-by: Hannes Eder <hannes@hanneseder.net> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: report the MMU pagesize in /proc/pid/smapsMel Gorman2009-01-061-0/+13
| | | | | | | | | | | | | | | The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: report the pagesize backing a VMA in /proc/pid/smapsMel Gorman2009-01-061-0/+16
| | | | | | | | | | | | | | | It is useful to verify a hugepage-aware application is using the expected pagesizes for its memory regions. This patch creates an entry called KernelPageSize in /proc/pid/smaps that is the size of page used by the kernel to back a VMA. The entry is not called PageSize as it is possible the MMU uses a different size. This extension should not break any sensible parser that skips lines containing unrecognised information. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: make unmap_ref_private multi-size-awareAdam Litke2008-11-121-2/+3
| | | | | | | | | | | | | | | | | Oops. Part of the hugetlb private reservation code was not fully converted to use hstates. When a huge page must be unmapped from VMAs due to a failed COW, HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of the page size being used. This works if the VMA is using the default huge page size. Otherwise we might unmap too much, too little, or trigger a BUG_ON. Rare but serious -- fix it. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: pull gigantic page initialisation out of the default pathAndy Whitcroft2008-11-061-1/+11
| | | | | | | | | | | | | | | | | | As we can determine exactly when a gigantic page is in use we can optimise the common regular page cases by pulling out gigantic page initialisation into its own function. As gigantic pages are never released to buddy we do not need a destructor. This effectivly reverts the previous change to the main buddy allocator. It also adds a paranoid check to ensure we never release gigantic pages from hugetlbfs to the main buddy. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: handle pages higher order than MAX_ORDERAndy Whitcroft2008-11-061-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | When working with hugepages, hugetlbfs assumes that those hugepages are smaller than MAX_ORDER. Specifically it assumes that the mem_map is contigious and uses that to optimise access to the elements of the mem_map that represent the hugepage. Gigantic pages (such as 16GB pages on powerpc) by definition are of greater order than MAX_ORDER (larger than MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of the buddy alloctor guarentees for the contiguity of the mem_map, which ensures that the mem_map is at least contigious for maximmally aligned areas of MAX_ORDER_NR_PAGES pages. This patch adds new mem_map accessors and iterator helpers which handle any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to implement gigantic page versions of copy_huge_page and clear_huge_page, and to allow follow_hugetlb_page handle gigantic pages. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: switch /proc/meminfo to seq_fileAlexey Dobriyan2008-10-231-2/+3
| | | | | | and move it to fs/proc/meminfo.c while I'm at it. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* hugepage: support ZERO_PAGE()KOSAKI Motohiro2008-10-201-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently hugepage doesn't use zero page at all because zero page is only used for coredumping and hugepage can't core dump. However we have now implemented hugepage coredumping. Therefore we should implement the zero page of hugepage. Implementation note: o Why do we only check VM_SHARED for zero page? normal page checked as .. static inline int use_zero_page(struct vm_area_struct *vma) { if (vma->vm_flags & (VM_LOCKED | VM_SHARED)) return 0; return !vma->vm_ops || !vma->vm_ops->fault; } First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED. Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it doesn't have any file backing. Thus ops->fault checking is meaningless. o Why don't we use zero page if !pte. !pte indicate {pud, pmd} doesn't exist or some error happened. So we shouldn't return zero page if any error occurred. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: hugetlb.c make functions static, use NULL rather than 0Harvey Harrison2008-10-201-7/+5
| | | | | | | | | | | | mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static? mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static? mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: split LRU lists into anon & file setsRik van Riel2008-10-201-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: handle updating of ACCESSED and DIRTY in hugetlb_fault()David Gibson2008-10-161-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | The page fault path for normal pages, if the fault is neither a no-page fault nor a write-protect fault, will update the DIRTY and ACCESSED bits in the page table appropriately. The hugepage fault path, however, does not do this, handling only no-page or write-protect type faults. It assumes that either the ACCESSED and DIRTY bits are irrelevant for hugepages (usually true, since they are never swapped) or that they are handled by the arch code. This is inconvenient for some software-loaded TLB architectures, where the _PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write) access to the page at the TLB miss. This could be worked around in the arch TLB miss code, but the TLB miss fast path can be made simple more easily if the hugetlb_fault() path handles this, as the normal page fault path does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* allocate structures for reservation tracking in hugetlbfs outside of ↵Andy Whitcroft2008-08-121-4/+11
| | | | | | | | | | | | | | | | | | | spinlocks v2 [Andrew this should replace the previous version which did not check the returns from the region prepare for errors. This has been tested by us and Gerald and it looks good. Bah, while reviewing the locking based on your previous email I spotted that we need to check the return from the vma_needs_reservation call for allocation errors. Here is an updated patch to correct this. This passes testing here.] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: allocate structures for reservation tracking outside of spinlocksAndy Whitcroft2008-08-121-9/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | In the normal case, hugetlbfs reserves hugepages at map time so that the pages exist for future faults. A struct file_region is used to track when reservations have been consumed and where. These file_regions are allocated as necessary with kmalloc() which can sleep with the mm->page_table_lock held. This is wrong and triggers may-sleep warning when PREEMPT is enabled. Updates to the underlying file_region are done in two phases. The first phase prepares the region for the change, allocating any necessary memory, without actually making the change. The second phase actually commits the change. This patch makes use of this by checking the reservations before the page_table_lock is taken; triggering any necessary allocations. This may then be safely repeated within the locks without any allocations being required. Credit to Mel Gorman for diagnosing this failure and initial versions of the patch. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: call arch_prepare_hugepage() for surplus pagesGerald Schaefer2008-08-121-1/+6
| | | | | | | | | | | | | | | | | | | | | | The s390 software large page emulation implements shared page tables by using page->index of the first tail page from a compound large page to store page table information. This is set up in arch_prepare_hugepage(), which is called from alloc_fresh_huge_page_node(). A similar call to arch_prepare_hugepage() is missing for surplus large pages that are allocated in alloc_buddy_huge_page(), which breaks the software emulation mode for (surplus) large pages on s390. This patch adds the missing call to arch_prepare_hugepage(). It will have no effect on other architectures where arch_prepare_hugepage() is a nop. Also, use the correct order in the error path in alloc_fresh_huge_page_node(). Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert duplicate "mm/hugetlb.c must #include <asm/io.h>"Linus Torvalds2008-08-061-1/+1
| | | | | | | | | This reverts commit 7cb93181629c613ee2b8f4ffe3446f8003074842, since we did that patch twice, and the problem was already fixed earlier by 78a34ae29bf1c9df62a5bd0f0798b6c62a54d520. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hugetlb: don't crash when HPAGE_SHIFT is 0Benjamin Herrenschmidt2008-08-011-1/+6
| | | | | | | | | | | | | | Some platform decide whether they support huge pages at boot time. On these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is set to 0 when there is no such support. The patches to introduce multiple huge pages support broke that causing the kernel to crash at boot time on machines such as POWER3 which lack support for multiple page sizes. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6Linus Torvalds2008-08-011-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits) mm/hugetlb.c must #include <asm/io.h> video: Fix up hp6xx driver build regressions. sh: defconfig updates. sh: Kill off stray mach-rsk7203 reference. serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression. sh: Move out individual boards without mach groups. sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h. sh: Allow SH-3 and SH-5 to use common headers. sh: Provide common CPU headers, prune the SH-2 and SH-2A directories. sh/maple: clean maple bus code sh: More header path fixups for mach dir refactoring. sh: Move out the solution engine headers to arch/sh/include/mach-se/ sh: I2C fix for AP325RXA and Migo-R sh: Shuffle the board directories in to mach groups. sh: dma-sh: Fix up dreamcast dma.h mach path. sh: Switch KBUILD_DEFCONFIG to shx3_defconfig. sh: Add ARCH_DEFCONFIG entries for sh and sh64. sh: Fix compile error of Solution Engine sh: Proper __put_user_asm() size mismatch fix. sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation. ...
| * mm/hugetlb.c must #include <asm/io.h>Adrian Bunk2008-07-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following build error on sh caused by commit aa888a74977a8f2120ae9332376e179c39a6b07d (hugetlb: support larger than MAX_ORDER): <-- snip --> ... CC mm/hugetlb.o /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c: In function 'alloc_bootmem_huge_page': /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys' make[2]: *** [mm/hugetlb.o] Error 1 <-- snip --> Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
* | mm/hugetlb.c must #include <asm/io.h>Adrian Bunk2008-07-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following build error on sh caused by commit aa888a74977a8f2120ae9332376e179c39a6b07d ("hugetlb: support larger than MAX_ORDER"): mm/hugetlb.c: In function 'alloc_bootmem_huge_page': mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys' Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mmu-notifiers: coreAndrea Arcangeli2008-07-281-0/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: fix CONFIG_SYSCTL=n buildNishanth Aravamudan2008-07-261-12/+12
| | | | | | | | | | | | | | | | | | | | | | | Fixes a build failure reported by Alan Cox: mm/hugetlb.c: In function `hugetlb_acct_memory': mm/hugetlb.c:1507: error: implicit declaration of function `cpuset_mems_nr' Also reverts Ingo's commit e44d1b2998d62a1f2f4d7eb17b56ba396535509f Author: Ingo Molnar <mingo@elte.hu> Date: Fri Jul 25 12:57:41 2008 +0200 mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL which fixed the build error but added some unused-static-function warnings. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hugetlb.c: fix build failure with !CONFIG_SYSCTLIngo Molnar2008-07-251-11/+11
| | | | | | | | | | | | on !CONFIG_SYSCTL on x86 with latest -git i get: mm/hugetlb.c: In function 'decrement_hugepage_resv_vma': mm/hugetlb.c:83: error: 'reserve' undeclared (first use in this function) mm/hugetlb.c:83: error: (Each undeclared identifier is reported only once mm/hugetlb.c:83: error: for each function it appears in.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: quota is not freed for unused reserved private huge pagesAdam Litke2008-07-241-1/+3
| | | | | | | | | | | | | | | | | | | | | With shared reservations (and now also with private reservations), we reserve huge pages at mmap time. We also account for the mapping against fs quota to prevent a reservation from being preempted by quota exhaustion. When testing with the libhugetlbfs test suite, I found a problem with quota accounting. FS quota for allocated pages is handled correctly but we are not releasing quota for private pages that were reserved but never allocated. Do this in hugetlb_vm_op_close() at the same time as unused page reservations are released. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: fix a hugepage reservation check for MAP_SHAREDMel Gorman2008-07-241-6/+6
| | | | | | | | | | | | | | | | | | | | | | When removing a huge page from the hugepage pool for a fault the system checks to see if the mapping requires additional pages to be reserved, and if it does whether there are any unreserved pages remaining. If not, the allocation fails without even attempting to get a page. In order to determine whether to apply this check we call vma_has_private_reserves() which tells us if this vma is MAP_PRIVATE and is the owner. This incorrectly triggers the remaining reservation test for MAP_SHARED mappings which prevents allocation of the final page in the pool even though it is reserved for this mapping. In reality we only want to check this for MAP_PRIVATE mappings where the process is not the original mapper. Replace vma_has_private_reserves() with vma_has_reserves() which indicates whether further reserves are required, and update the caller. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: allow arch overridden hugepage allocationJon Tollefson2008-07-241-8/+3
| | | | | | | | | | | | | | | Allow alloc_bootmem_huge_page() to be overridden by architectures that can't always use bootmem. This requires huge_boot_pages to be available for use by this function. This is required for powerpc 16G pages, which have to be reserved prior to boot-time. The location of these pages are indicated in the device tree. Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: override default huge page sizeNick Piggin2008-07-241-6/+17
| | | | | | | | | | | | | | Allow configurations with the default huge page size which is different to the traditional HPAGE_SIZE size. The default huge page size is the one represented in the legacy /proc ABIs, SHM, and which is defaulted to when mounting hugetlbfs filesystems. This is implemented with a new kernel option default_hugepagesz=, which defaults to HPAGE_SIZE if not specified. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: introduce pud_hugeAndi Kleen2008-07-241-0/+9
| | | | | | | | | | | | Straight forward extensions for huge pages located in the PUD instead of PMDs. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: printk cleanupAndi Kleen2008-07-241-4/+16
| | | | | | | | | | | | | - Reword sentence to clarify meaning with multiple options - Add support for using GB prefixes for the page size - Add extra printk to delayed > MAX_ORDER allocation code Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: support boot allocate different sizesAndi Kleen2008-07-241-9/+30
| | | | | | | | | | | | | | | | | | | Make some infrastructure changes to allow boot-time allocation of different hugepage page sizes. - move all basic hstate initialisation into hugetlb_add_hstate - create a new function hugetlb_hstate_alloc_pages() to do the actual initial page allocations. Call this function early in order to allocate giant pages from bootmem. - Check for multiple hugepages= parameters Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Andrew Hastings <abh@cray.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: support larger than MAX_ORDERAndi Kleen2008-07-241-2/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | This is needed on x86-64 to handle GB pages in hugetlbfs, because it is not practical to enlarge MAX_ORDER to 1GB. Instead the 1GB pages are only allocated at boot using the bootmem allocator using the hugepages=... option. These 1G bootmem pages are never freed. In theory it would be possible to implement that with some complications, but since it would be a one-way street (>= MAX_ORDER pages cannot be allocated later) I decided not to currently. The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very big and the ifdef uglyness seemed not be worth it. Known problems: /proc/meminfo and "free" do not display the memory allocated for gb pages in "Total". This is a little confusing for the user. Acked-by: Andrew Hastings <abh@cray.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: abstract numa round robin selectionAndi Kleen2008-07-241-15/+22
| | | | | | | | | | | | | Need this as a separate function for a future patch. No behaviour change. Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: new sysfs interfaceNishanth Aravamudan2008-07-241-66/+222
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide new hugepages user APIs that are more suited to multiple hstates in sysfs. There is a new directory, /sys/kernel/hugepages. Underneath that directory there will be a directory per-supported hugepage size, e.g.: /sys/kernel/hugepages/hugepages-64kB /sys/kernel/hugepages/hugepages-16384kB /sys/kernel/hugepages/hugepages-16777216kB corresponding to 64k, 16m and 16g respectively. Within each hugepages-size directory there are a number of files, corresponding to the tracked counters in the hstate, e.g.: /sys/kernel/hugepages/hugepages-64/nr_hugepages /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages /sys/kernel/hugepages/hugepages-64/free_hugepages /sys/kernel/hugepages/hugepages-64/resv_hugepages /sys/kernel/hugepages/hugepages-64/surplus_hugepages Of these files, the first two are read-write and the latter three are read-only. The size of the hugepage being manipulated is trivially deducible from the enclosing directory and is always expressed in kB (to match meminfo). [dave@linux.vnet.ibm.com: fix build] [nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel] [nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency] Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: per mount huge page sizesAndi Kleen2008-07-241-13/+3
| | | | | | | | | | | | | | | | | | | | | | Add the ability to configure the hugetlb hstate used on a per mount basis. - Add a new pagesize= option to the hugetlbfs mount that allows setting the page size - This option causes the mount code to find the hstate corresponding to the specified size, and sets up a pointer to the hstate in the mount's superblock. - Change the hstate accessors to use this information rather than the global_hstate they were using (requires a slight change in mm/memory.c so we don't NULL deref in the error-unmap path -- see comments). [np: take hstate out of hugetlbfs inode and vma->vm_private_data] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: multiple hstates for multiple page sizesAndi Kleen2008-07-241-27/+121
| | | | | | | | | | | | | | | | | Add basic support for more than one hstate in hugetlbfs. This is the key to supporting multiple hugetlbfs page sizes at once. - Rather than a single hstate, we now have an array, with an iterator - default_hstate continues to be the struct hstate which we use by default - Add functions for architectures to register new hstates [akpm@linux-foundation.org: coding-style fixes] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: modular state for hugetlb page sizeAndi Kleen2008-07-241-167/+201
| | | | | | | | | | | | | | | | | | | | | | | | | | The goal of this patchset is to support multiple hugetlb page sizes. This is achieved by introducing a new struct hstate structure, which encapsulates the important hugetlb state and constants (eg. huge page size, number of huge pages currently allocated, etc). The hstate structure is then passed around the code which requires these fields, they will do the right thing regardless of the exact hstate they are operating on. This patch adds the hstate structure, with a single global instance of it (default_hstate), and does the basic work of converting hugetlb to use the hstate. Future patches will add more hstate structures to allow for different hugetlbfs mounts to have different page sizes. [akpm@linux-foundation.org: coding-style fixes] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: factor out prep_new_huge_pageAndi Kleen2008-07-241-6/+11
| | | | | | | | | | | Needed to avoid code duplication in follow up patches. Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vma_page_offset() has no callees: drop itJohannes Weiner2008-07-241-20/+9
| | | | | | | | | | | | | | | | | Hugh adds: vma_pagecache_offset() has a dangerously misleading name, since it's using hugepage units: rename it to vma_hugecache_offset(). [apw@shadowen.org: restack onto fixed MAP_PRIVATE reservations] [akpm@linux-foundation.org: vma_split conversion] Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splitsAndy Whitcroft2008-07-241-27/+145
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a hugetlb mapping with a reservation is split, a new VMA is cloned from the original. This new VMA is a direct copy of the original including the reservation count. When this pair of VMAs are unmapped we will incorrect double account the unused reservation and the overall reservation count will be incorrect, in extreme cases it will wrap. The problem occurs when we split an existing VMA say to unmap a page in the middle. split_vma() will create a new VMA copying all fields from the original. As we are storing our reservation count in vm_private_data this is also copies, endowing the new VMA with a duplicate of the original VMA's reservation. Neither of the new VMAs can exhaust these reservations as they are too small, but when we unmap and close these VMAs we will incorrect credit the remainder twice and resv_huge_pages will become out of sync. This can lead to allocation failures on mappings with reservations and even to resv_huge_pages wrapping which prevents all subsequent hugepage allocations. The simple fix would be to correctly apportion the remaining reservation count when the split is made. However the only hook we have vm_ops->open only has the new VMA we do not know the identity of the preceeding VMA. Also even if we did have that VMA to hand we do not know how much of the reservation was consumed each side of the split. This patch therefore takes a different tack. We know that the whole of any private mapping (which has a reservation) has a reservation over its whole size. Any present pages represent consumed reservation. Therefore if we track the instantiated pages we can calculate the remaining reservation. This patch reuses the existing regions code to track the regions for which we have consumed reservation (ie. the instantiated pages), as each page is faulted in we record the consumption of reservation for the new page. When we need to return unused reservations at unmap time we simply count the consumed reservation region subtracting that from the whole of the map. During a VMA split the newly opened VMA will point to the same region map, as this map is offset oriented it remains valid for both of the split VMAs. This map is referenced counted so that it is removed when all VMAs which are part of the mmap are gone. Thanks to Adam Litke and Mel Gorman for their review feedback. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: allow huge page mappings to be created without reservationsAndy Whitcroft2008-07-241-5/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | By default all shared mappings and most private mappings now have reservations associated with them. This improves semantics by providing allocation guarentees to the mapper. However a small number of applications may attempt to make very large sparse mappings, with these strict reservations the system will never be able to honour the mapping. This patch set brings MAP_NORESERVE support to hugetlb files. This allows new mappings to be made to hugetlbfs files without an associated reservation, for both shared and private mappings. This allows applications which want to create very sparse mappings to opt-out of the reservation system. Obviously as there is no reservation they are liable to fault at runtime if the huge page pool becomes exhausted; buyer beware. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: move reservation region support earlierAndy Whitcroft2008-07-241-121/+125
| | | | | | | | | | | | | | | | The following patch will require use of the reservation regions support. Move this earlier in the file. No changes have been made to this code. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* huge page private reservation review cleanupsAndy Whitcroft2008-07-241-13/+45
| | | | | | | | | | | | | | | | | | Create some new accessors for vma private data to cut down on and contain the casts. Encapsulates the huge and small page offset calculations. Also adds a couple of VM_BUG_ONs for consistency. [akpm@linux-foundation.org: Make things static] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: guarantee that COW faults for a process that called ↵Mel Gorman2008-07-241-18/+183
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mmap(MAP_PRIVATE) on hugetlbfs will succeed After patch 2 in this series, a process that successfully calls mmap() for a MAP_PRIVATE mapping will be guaranteed to successfully fault until a process calls fork(). At that point, the next write fault from the parent could fail due to COW if the child still has a reference. We only reserve pages for the parent but a copy must be made to avoid leaking data from the parent to the child after fork(). Reserves could be taken for both parent and child at fork time to guarantee faults but if the mapping is large it is highly likely we will not have sufficient pages for the reservation, and it is common to fork only to exec() immediatly after. A failure here would be very undesirable. Note that the current behaviour of mainline with MAP_PRIVATE pages is pretty bad. The following situation is allowed to occur today. 1. Process calls mmap(MAP_PRIVATE) 2. Process calls mlock() to fault all pages and makes sure it succeeds 3. Process forks() 4. Process writes to MAP_PRIVATE mapping while child still exists 5. If the COW fails at this point, the process gets SIGKILLed even though it had taken care to ensure the pages existed This patch improves the situation by guaranteeing the reliability of the process that successfully calls mmap(). When the parent performs COW, it will try to satisfy the allocation without using reserves. If that fails the parent will steal the page leaving any children without a page. Faults from the child after that point will result in failure. If the child COW happens first, an attempt will be made to allocate the page without reserves and the child will get SIGKILLed on failure. To summarise the new behaviour: 1. If the original mapper performs COW on a private mapping with multiple references, it will attempt to allocate a hugepage from the pool or the buddy allocator without using the existing reserves. On fail, VMAs mapping the same area are traversed and the page being COW'd is unmapped where found. It will then steal the original page as the last mapper in the normal way. 2. The VMAs the pages were unmapped from are flagged to note that pages with data no longer exist. Future no-page faults on those VMAs will terminate the process as otherwise it would appear that data was corrupted. A warning is printed to the console that this situation occured. 2. If the child performs COW first, it will attempt to satisfy the COW from the pool if there are enough pages or via the buddy allocator if overcommit is allowed and the buddy allocator can satisfy the request. If it fails, the child will be killed. If the pool is large enough, existing applications will not notice that the reserves were a factor. Existing applications depending on the no-reserves been set are unlikely to exist as for much of the history of hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that point or failing the mmap(). [npiggin@suse.de: fix CONFIG_HUGETLB=n build] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings ↵Mel Gorman2008-07-241-39/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | until fork() This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in a similar manner to the reservations taken for MAP_SHARED mappings. The reserve count is accounted both globally and on a per-VMA basis for private mappings. This guarantees that a process that successfully calls mmap() will successfully fault all pages in the future unless fork() is called. The characteristics of private mappings of hugetlbfs files behaviour after this patch are; 1. The process calling mmap() is guaranteed to succeed all future faults until it forks(). 2. On fork(), the parent may die due to SIGKILL on writes to the private mapping if enough pages are not available for the COW. For reasonably reliable behaviour in the face of a small huge page pool, children of hugepage-aware processes should not reference the mappings; such as might occur when fork()ing to exec(). 3. On fork(), the child VMAs inherit no reserves. Reads on pages already faulted by the parent will succeed. Successful writes will depend on enough huge pages being free in the pool. 4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper and at fault time otherwise. Before this patch, all reads or writes in the child potentially needs page allocations that can later lead to the death of the parent. This applies to reads and writes of uninstantiated pages as well as COW. After the patch it is only a write to an instantiated page that causes problems. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: move hugetlb_acct_memory()Mel Gorman2008-07-241-41/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a patchset to give reliable behaviour to a process that successfully calls mmap(MAP_PRIVATE) on a hugetlbfs file. Currently, it is possible for the process to be killed due to a small hugepage pool size even if it calls mlock(). MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This guarantees all future faults against the mapping will succeed. This allows local allocations at first use improving NUMA locality whilst retaining reliability. MAP_PRIVATE mappings do not reserve pages. This can result in an application being SIGKILLed later if a huge page is not available at fault time. This makes huge pages usage very ill-advised in some cases as the unexpected application failure cannot be detected and handled as it is immediately fatal. Although an application may force instantiation of the pages using mlock(), this may lead to poor memory placement and the process may still be killed when performing COW. This patchset introduces a reliability guarantee for the process which creates a private mapping, i.e. the process that calls mmap() on a hugetlbfs file successfully. The first patch of the set is purely mechanical code move to make later diffs easier to read. The second patch will guarantee faults up until the process calls fork(). After patch two, as long as the child keeps the mappings, the parent is no longer guaranteed to be reliable. Patch 3 guarantees that the parent will always successfully COW by unmapping the pages from the child in the event there are insufficient pages in the hugepage pool in allocate a new page, be it via a static or dynamic pool. Existing hugepage-aware applications are unlikely to be affected by this change. For much of hugetlbfs's history, pages were pre-faulted at mmap() time or mmap() failed which acts in a reserve-like manner. If the pool is sized correctly already so that parent and child can fault reliably, the application will not even notice the reserves. It's only when the pool is too small for the application to function perfectly reliably that the reserves come into play. Credit goes to Andy Whitcroft for cleaning up a number of mistakes during review before the patches were released. This patch: A later patch in this set needs to call hugetlb_acct_memory() before it is defined. This patch moves the function without modification. This makes later diffs easier to read. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hugetlb.c: fix duplicate variableAdrian Bunk2008-07-241-1/+0
| | | | | | | | | | | | | | | It's confusing that set_max_huge_pages() contained two different variables named "ret", and although the code works correctly this should be fixed. The inner of the two variables can simply be removed. Spotted by sparse. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: fix lockdep errorNick Piggin2008-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ============================================= [ INFO: possible recursive locking detected ] 2.6.26-rc4 #30 --------------------------------------------- heap-overflow/2250 is trying to acquire lock: (&mm->page_table_lock){--..}, at: [<c0000000000cf2e8>] .copy_hugetlb_page_range+0x108/0x280 but task is already holding lock: (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280 other info that might help us debug this: 3 locks held by heap-overflow/2250: #0: (&mm->mmap_sem){----}, at: [<c000000000050e44>] .dup_mm+0x134/0x410 #1: (&mm->mmap_sem/1){--..}, at: [<c000000000050e54>] .dup_mm+0x144/0x410 #2: (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280 stack backtrace: Call Trace: [c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable) [c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34 [c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080 [c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0 [c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80 [c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280 [c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790 [c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410 [c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020 [c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310 [c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80 [c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc Fix is the same way that mm/memory.c copy_page_range does the lockdep annotation. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: explicitly retry hugepage allocationsNishanth Aravamudan2008-04-291-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate userspace putting pressure on the VM by repeated echo's into /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to allow for large-order __GFP_REPEAT attempts to loop for a bit (as opposed to indefinitely), this increases the likelihood of getting hugepages when the system experiences (or recently experienced) load. Mel tested the patchset on an x86_32 laptop. With the patches, it was easier to use the proc interface to grow the hugepage pool. The following is the output of a script that grows the pool as much as possible running on 2.6.25-rc9. Allocating hugepages test ------------------------- Disabling OOM Killer for current test process Starting page count: 0 Attempt 1: 57 pages Progress made with 57 pages Attempt 2: 73 pages Progress made with 16 pages Attempt 3: 74 pages Progress made with 1 pages Attempt 4: 75 pages Progress made with 1 pages Attempt 5: 77 pages Progress made with 2 pages 77 pages was the most it allocated but it took 5 attempts from userspace to get it. With the 3 patches in this series applied, Allocating hugepages test ------------------------- Disabling OOM Killer for current test process Starting page count: 0 Attempt 1: 75 pages Progress made with 75 pages Attempt 2: 76 pages Progress made with 1 pages Attempt 3: 79 pages Progress made with 3 pages And 79 pages was the most it got. Your patches were able to allocate the bulk of possible pages on the first attempt. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud