From 10af6235e0d327d42e1bad974385197817923dc1 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Mon, 24 Jul 2017 21:41:38 -0700 Subject: x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PCID is a "process context ID" -- it's what other architectures call an address space ID. Every non-global TLB entry is tagged with a PCID, only TLB entries that match the currently selected PCID are used, and we can switch PGDs without flushing the TLB. x86's PCID is 12 bits. This is an unorthodox approach to using PCID. x86's PCID is far too short to uniquely identify a process, and we can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups. This patch uses PCID differently. We use a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, we give it a fresh PCID each time it's loaded except in cases where we want to preserve the TLB, in which case we reuse a recent value. Here are some benchmark results, done on a Skylake laptop at 2.3 GHz (turbo off, intel_pstate requesting max performance) under KVM with the guest using idle=poll (to avoid artifacts when bouncing between CPUs). I haven't done any real statistics here -- I just ran them in a loop and picked the fastest results that didn't look like outliers. Unpatched means commit a4eb8b993554, so all the bookkeeping overhead is gone. ping-pong between two mms on the same CPU using eventfd: patched: 1.22µs patched, nopcid: 1.33µs unpatched: 1.34µs Same ping-pong, but now touch 512 pages (all zero-page to minimize cache misses) each iteration. dTLB misses are measured by dtlb_load_misses.miss_causes_a_walk: patched: 1.8µs 11M dTLB misses patched, nopcid: 6.2µs, 207M dTLB misses unpatched: 6.1µs, 190M dTLB misses Signed-off-by: Andy Lutomirski Reviewed-by: Nadav Amit Cc: Andrew Morton Cc: Arjan van de Ven Cc: Borislav Petkov Cc: Dave Hansen Cc: Linus Torvalds Cc: Mel Gorman Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/9ee75f17a81770feed616358e6860d98a2a5b1e7.1500957502.git.luto@kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/tlbflush.h | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'arch/x86/include/asm/tlbflush.h') diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6397275008db..d23e61dc0640 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,6 +82,12 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif +/* + * 6 because 6 should be plenty and struct tlb_state will fit in + * two cache lines. + */ +#define TLB_NR_DYN_ASIDS 6 + struct tlb_context { u64 ctx_id; u64 tlb_gen; @@ -95,6 +101,8 @@ struct tlb_state { * mode even if we've already switched back to swapper_pg_dir. */ struct mm_struct *loaded_mm; + u16 loaded_mm_asid; + u16 next_asid; /* * Access to this CR4 shadow and to H/W CR4 is protected by @@ -104,7 +112,8 @@ struct tlb_state { /* * This is a list of all contexts that might exist in the TLB. - * Since we don't yet use PCID, there is only one context. + * There is one per ASID that we use, and the ASID (what the + * CPU calls PCID) is the index into ctxts. * * For each context, ctx_id indicates which mm the TLB's user * entries came from. As an invariant, the TLB will never @@ -114,8 +123,13 @@ struct tlb_state { * To be clear, this means that it's legal for the TLB code to * flush the TLB without updating tlb_gen. This can happen * (for now, at least) due to paravirt remote flushes. + * + * NB: context 0 is a bit special, since it's also used by + * various bits of init code. This is fine -- code that + * isn't aware of PCID will end up harmlessly flushing + * context 0. */ - struct tlb_context ctxs[1]; + struct tlb_context ctxs[TLB_NR_DYN_ASIDS]; }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); -- cgit v1.2.1