From 5beb49305251e5669852ed541e8e2f2f7696c53e Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Fri, 5 Mar 2010 13:42:07 -0800 Subject: mm: change anon_vma linking to fix multi-process server scalability issue The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel Cc: KOSAKI Motohiro Cc: Larry Woodman Cc: Lee Schermerhorn Cc: Minchan Kim Cc: Andrea Arcangeli Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 35 +++++++++++++++++++++++++++++++---- 1 file changed, 31 insertions(+), 4 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b019ae64e2ab..62da2001d55c 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -37,7 +37,27 @@ struct anon_vma { * is serialized by a system wide lock only visible to * mm_take_all_locks() (mm_all_locks_mutex). */ - struct list_head head; /* List of private "related" vmas */ + struct list_head head; /* Chain of private "related" vmas */ +}; + +/* + * The copy-on-write semantics of fork mean that an anon_vma + * can become associated with multiple processes. Furthermore, + * each child process will have its own anon_vma, where new + * pages for that process are instantiated. + * + * This structure allows us to find the anon_vmas associated + * with a VMA, or the VMAs associated with an anon_vma. + * The "same_vma" list contains the anon_vma_chains linking + * all the anon_vmas associated with this VMA. + * The "same_anon_vma" list contains the anon_vma_chains + * which link all the VMAs associated with this anon_vma. + */ +struct anon_vma_chain { + struct vm_area_struct *vma; + struct anon_vma *anon_vma; + struct list_head same_vma; /* locked by mmap_sem & page_table_lock */ + struct list_head same_anon_vma; /* locked by anon_vma->lock */ }; #ifdef CONFIG_MMU @@ -89,12 +109,19 @@ static inline void anon_vma_unlock(struct vm_area_struct *vma) */ void anon_vma_init(void); /* create anon_vma_cachep */ int anon_vma_prepare(struct vm_area_struct *); -void __anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *); -void anon_vma_unlink(struct vm_area_struct *); -void anon_vma_link(struct vm_area_struct *); +void unlink_anon_vmas(struct vm_area_struct *); +int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); +int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); void __anon_vma_link(struct vm_area_struct *); void anon_vma_free(struct anon_vma *); +static inline void anon_vma_merge(struct vm_area_struct *vma, + struct vm_area_struct *next) +{ + VM_BUG_ON(vma->anon_vma != next->anon_vma); + unlink_anon_vmas(next); +} + /* * rmap interfaces called when adding or removing pte of page */ -- cgit v1.2.1 From c44b674323f4a2480dbeb65d4b487fa5f06f49e0 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Fri, 5 Mar 2010 13:42:09 -0800 Subject: rmap: move exclusively owned pages to own anon_vma in do_wp_page() When the parent process breaks the COW on a page, both the original which is mapped at child and the new page which is mapped parent end up in that same anon_vma. Generally this won't be a problem, but for some workloads it could preserve the O(N) rmap scanning complexity. A simple fix is to ensure that, when a page which is mapped child gets reused in do_wp_page, because we already are the exclusive owner, the page gets moved to our own exclusive child's anon_vma. Signed-off-by: Rik van Riel Cc: KOSAKI Motohiro Cc: Larry Woodman Cc: Lee Schermerhorn Reviewed-by: Minchan Kim Cc: Andrea Arcangeli Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 62da2001d55c..72be23b1480a 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -125,6 +125,7 @@ static inline void anon_vma_merge(struct vm_area_struct *vma, /* * rmap interfaces called when adding or removing pte of page */ +void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long); void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long); void page_add_file_rmap(struct page *); -- cgit v1.2.1 From 645747462435d84c6c6a64269ed49cc3015f753d Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 5 Mar 2010 13:42:22 -0800 Subject: vmscan: detect mapped file pages used only once The VM currently assumes that an inactive, mapped and referenced file page is in use and promotes it to the active list. However, every mapped file page starts out like this and thus a problem arises when workloads create a stream of such pages that are used only for a short time. By flooding the active list with those pages, the VM quickly gets into trouble finding eligible reclaim canditates. The result is long allocation latencies and eviction of the wrong pages. This patch reuses the PG_referenced page flag (used for unmapped file pages) to implement a usage detection that scales with the speed of LRU list cycling (i.e. memory pressure). If the scanner encounters those pages, the flag is set and the page cycled again on the inactive list. Only if it returns with another page table reference it is activated. Otherwise it is reclaimed as 'not recently used cache'. This effectively changes the minimum lifetime of a used-once mapped file page from a full memory cycle to an inactive list cycle, which allows it to occur in linear streams without affecting the stable working set of the system. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Cc: Minchan Kim Cc: OSAKI Motohiro Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 72be23b1480a..d25bd224d370 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -209,7 +209,7 @@ static inline int page_referenced(struct page *page, int is_locked, unsigned long *vm_flags) { *vm_flags = 0; - return TestClearPageReferenced(page); + return 0; } #define try_to_unmap(page, refs) SWAP_FAIL -- cgit v1.2.1