| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
WARNING: CPU: 12 PID: 4322 at /arch/powerpc/mm/pgtable-book3s64.c:76 set_pmd_at+0x4c/0x2b0
Modules linked in:
CPU: 12 PID: 4322 Comm: qemu-system-ppc Tainted: G W 4.19.0-rc3-00758-g8f0c636b0542 #36
NIP: c0000000000872fc LR: c000000000484eec CTR: 0000000000000000
REGS: c000003fba876fe0 TRAP: 0700 Tainted: G W (4.19.0-rc3-00758-g8f0c636b0542)
MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24282884 XER: 00000000
CFAR: c000000000484ee8 IRQMASK: 0
GPR00: c000000000484eec c000003fba877268 c000000001f0ec00 c000003fbd229f80
GPR04: 00007c8fe8e00000 c000003f864c5a38 860300853e0000c0 0000000000000080
GPR08: 0000000080000000 0000000000000001 0401000000000080 0000000000000001
GPR12: 0000000000002000 c000003fffff5400 c000003fce292000 00007c9024570000
GPR16: 0000000000000000 0000000000ffffff 0000000000000001 c000000001885950
GPR20: 0000000000000000 001ffffc0004807c 0000000000000008 c000000001f49d05
GPR24: 00007c8fe8e00000 c0000000020f2468 ffffffffffffffff c000003fcd33b090
GPR28: 00007c8fe8e00000 c000003fbd229f80 c000003f864c5a38 860300853e0000c0
NIP [c0000000000872fc] set_pmd_at+0x4c/0x2b0
LR [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20
Call Trace:
[c000003fba877268] [c00000000045931c] mpol_misplaced+0x1bc/0x230 (unreliable)
[c000003fba8772c8] [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20
[c000003fba877398] [c00000000040d344] __handle_mm_fault+0x5e4/0x2300
[c000003fba8774d8] [c00000000040f400] handle_mm_fault+0x3a0/0x420
[c000003fba877528] [c0000000003ff6f4] __get_user_pages+0x2e4/0x560
[c000003fba877628] [c000000000400314] get_user_pages_unlocked+0x104/0x2a0
[c000003fba8776c8] [c000000000118f44] __gfn_to_pfn_memslot+0x284/0x6a0
[c000003fba877748] [c0000000001463a0] kvmppc_book3s_radix_page_fault+0x360/0x12d0
[c000003fba877838] [c000000000142228] kvmppc_book3s_hv_page_fault+0x48/0x1300
[c000003fba877988] [c00000000013dc08] kvmppc_vcpu_run_hv+0x1808/0x1b50
[c000003fba877af8] [c000000000126b44] kvmppc_vcpu_run+0x34/0x50
[c000003fba877b18] [c000000000123268] kvm_arch_vcpu_ioctl_run+0x288/0x2d0
[c000003fba877b98] [c00000000011253c] kvm_vcpu_ioctl+0x1fc/0x8c0
[c000003fba877d08] [c0000000004e9b24] do_vfs_ioctl+0xa44/0xae0
[c000003fba877db8] [c0000000004e9c44] ksys_ioctl+0x84/0xf0
[c000003fba877e08] [c0000000004e9cd8] sys_ioctl+0x28/0x80
We removed the pte_protnone check earlier with the understanding that we
mark the pte invalid before the set_pte/set_pmd usage. But the huge pmd
autonuma still use the set_pmd_at directly. This is ok because a protnone pte
won't have translation cache in TLB.
Fixes: da7ad366b497 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes the following warnings (obtained with make W=1).
arch/powerpc/mm/slice.c: In function 'slice_range_to_mask':
arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited range of data type [-Werror=type-limits]
if (start < SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c:81:20: error: comparison is always false due to limited range of data type [-Werror=type-limits]
if ((start + len) > SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c: In function 'slice_mask_for_free':
arch/powerpc/mm/slice.c:136:17: error: comparison is always true due to limited range of data type [-Werror=type-limits]
if (high_limit <= SLICE_LOW_TOP)
^
arch/powerpc/mm/slice.c: In function 'slice_check_range_fits':
arch/powerpc/mm/slice.c:185:12: error: comparison is always true due to limited range of data type [-Werror=type-limits]
if (start < SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c:195:39: error: comparison is always false due to limited range of data type [-Werror=type-limits]
if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {
^
arch/powerpc/mm/slice.c: In function 'slice_scan_available':
arch/powerpc/mm/slice.c:306:11: error: comparison is always true due to limited range of data type [-Werror=type-limits]
if (addr < SLICE_LOW_TOP) {
^
arch/powerpc/mm/slice.c: In function 'get_slice_psize':
arch/powerpc/mm/slice.c:709:11: error: comparison is always true due to limited range of data type [-Werror=type-limits]
if (addr < SLICE_LOW_TOP) {
^
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes the following warnings (obtained with make W=1).
arch/powerpc/mm/slice.c: At top level:
arch/powerpc/mm/slice.c:682:15: error: no previous prototype for 'arch_get_unmapped_area' [-Werror=missing-prototypes]
unsigned long arch_get_unmapped_area(struct file *filp,
^
arch/powerpc/mm/slice.c:692:15: error: no previous prototype for 'arch_get_unmapped_area_topdown' [-Werror=missing-prototypes]
unsigned long arch_get_unmapped_area_topdown(struct file *filp,
^
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
| |
Add a trace point for tlbia (Translation Lookaside Buffer Invalidate
All) instruction.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
| |
commit 0428491cba927 ("powerpc/mm: Trace tlbie(l) instructions")
added tracepoints for tlbie calls, but _tlbil_va() was forgotten
Fixes: 0428491cba927 ("powerpc/mm: Trace tlbie(l) instructions")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since commit bd0dbb73e013 ("powerpc/mm/books3s: Add new pte bit to
mark pte temporarily invalid."), _PAGE_PRESENT doesn't mean exactly
that a page is present. A page is also considered preset when
_PAGE_INVALID is set.
This patch changes the meaning of "present" and adds a status "valid"
associated to the _PAGE_PRESENT flag.
Fixes: bd0dbb73e013 ("powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Back when I added -Werror in commit ba55bd74360e ("powerpc: Add
configurable -Werror for arch/powerpc") I did it by adding it to most
of the arch Makefiles.
At the time we excluded math-emu, because apparently it didn't build
cleanly. But that seems to have been fixed somewhere in the interim.
So move the -Werror addition to the top-level of the arch, this saves
us from repeating it in every Makefile and means we won't forget to
add it to any new sub-dirs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we limit the max addressable memory to 128TB. This patch increase the
limit to 2PB. We can have devices like nvdimm which adds memory above 512TB
limit.
We still don't support regular system ram above 512TB. One of the challenge with
that is the percpu allocator, that allocates per node memory and use the max
distance between them as the percpu offsets. This means with large gap in
address space ( system ram above 1PB) we will run out of vmalloc space to map
the percpu allocation.
In order to support addressable memory above 512TB, kernel should be able to
linear map this range. To do that with hash translation we now add 4 context
to kernel linear map region. Our per context addressable range is 512TB. We
still keep VMALLOC and VMEMMAP region to old size. SLB miss handlers is updated
to validate these limit.
We also limit this update to SPARSEMEM_VMEMMAP and SPARSEMEM_EXTREME
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
| |
We will be adding get_kernel_context later. Update function name to indicate
this handle context allocation user space address.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
| |
This adds CONFIG_DEBUG_VM checks to ensure:
- The kernel stack is in the SLB after it's flushed and bolted.
- We don't insert an SLB for an address that is aleady in the SLB.
- The kernel SLB miss handler does not take an SLB miss.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so
it can not possibly change the stack, so it should not be touching the
shadow area. And since vmalloc is no longer bolted, it should not
change any bolted mappings at all.
Change the name to slb_flush_and_restore_bolted(), and have it just
load the kernel stack from what's currently in the shadow SLB area.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
| |
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
| |
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory must not be accessed when handling kernel
space SLB misses, so care should be taken there. However user SLB
misses can access any kernel memory, which can be used to move some
fields out of the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to
bad address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Disallow tracing for all of slb.c for now.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
_PAGE_PRIVILEGED corresponds to the SH bit which doesn't protect
against user access but only disables ASID verification on kernel
accesses. User access is controlled with _PMD_USER flag.
Name it _PAGE_SH instead of _PAGE_PRIVILEGED
_PAGE_HUGE corresponds to the SPS bit which doesn't really tells
that's it is a huge page but only that it is not a 4k page.
Name it _PAGE_SPS instead of _PAGE_HUGE
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
| |
To reduce the complexity of flag_array, and allow the removal of
default 0 value of non existing flags, lets have one flag_array
table for each platform family with only the really existing flags.
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
| |
Get rid of platform specific _PAGE_XXXX in powerpc common code and
use helpers instead.
mm/dump_linuxpagetables.c will be handled separately
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'access' parameter of hash_preload() is either 0 or _PAGE_EXEC.
Among the two versions of hash_preload(), only the PPC64 one is
doing something with this 'access' parameter.
In order to remove the use of _PAGE_EXEC outside platform code,
'access' parameter is replaced by 'is_exec' which will be either
true of false, and the PPC64 version of hash_preload() creates
the access flag based on 'is_exec'.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
| |
book3s/32 doesn't define _PAGE_EXEC, so no need to use it.
All other platforms define _PAGE_EXEC so no need to check
it is not NUL when not book3s/32.
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
| |
In order to avoid multiple conversions, handover directly a
pgprot_t to map_kernel_page() as already done for radix.
Do the same for __ioremap_caller() and __ioremap_at().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
| |
Set PAGE_KERNEL directly in the caller and do not rely on a
hack adding PAGE_KERNEL flags when _PAGE_PRESENT is not set.
As already done for PPC64, use pgprot_cache() helpers instead of
_PAGE_XXX flags in PPC32 ioremap() derived functions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Other arches have ioremap_wt() to map IO areas write-through.
Implement it on PPC as well in order to avoid drivers using
__ioremap(_PAGE_WRITETHRU)
Also implement ioremap_coherent() to avoid drivers using
__ioremap(_PAGE_COHERENT)
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The powerpc mobility code may receive RTAS requests to perform PRRN
(Platform Resource Reassignment Notification) topology changes at any
time, including during LPAR migration operations.
In some configurations where the affinity of CPUs or memory is being
changed on that platform, the PRRN requests may apply or refer to
outdated information prior to the complete update of the device-tree.
This patch changes the duration for which topology updates are
suppressed during LPAR migrations from just the rtas_ibm_suspend_me()
/ 'ibm,suspend-me' call(s) to cover the entire migration_store()
operation to allow all changes to the device-tree to be applied prior
to accepting and applying any PRRN requests.
For tracking purposes, pr_info notices are added to the functions
start_topology_update() and stop_topology_update() of 'numa.c'.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\
| |
| |
| |
| |
| | |
Merge our fixes branch. It has a few important fixes that are needed for
futher testing and also some commits that will conflict with content in
next.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot"), kdump kernel on shared LPAR may crash.
The necessary conditions are
- Shared LPAR with at least 2 nodes having memory and CPUs.
- Memory requirement for kdump kernel must be met by the first N-1
nodes where there are at least N nodes with memory and CPUs.
Example numactl of such a machine.
$ numactl -H
available: 5 nodes (0,2,5-7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 255 MB
node 2 free: 189 MB
node 5 cpus: 24 25 26 27 28 29 30 31
node 5 size: 4095 MB
node 5 free: 4024 MB
node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 6 size: 6353 MB
node 6 free: 5998 MB
node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 7 size: 7640 MB
node 7 free: 7164 MB
node distances:
node 0 2 5 6 7
0: 10 40 40 40 40
2: 40 10 40 40 40
5: 40 40 10 40 40
6: 40 40 40 10 20
7: 40 40 40 20 10
Steps to reproduce.
1. Load / start kdump service.
2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)
When booting a kdump kernel with 2048M:
kexec: Starting switchover sequence.
I'm in purgatory
Using 1TB segments
hash-mmu: Initializing hash mmu with SLB
Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
Found initrd at 0xc000000009e70000:0xc00000000ae554b4
Using pSeries machine description
-----------------------------------------------------
ppc64_pft_size = 0x1e
phys_mem_size = 0x88000000
dcache_bsize = 0x80
icache_bsize = 0x80
cpu_features = 0x000000ff8f5d91a7
possible = 0x0000fbffcf5fb1a7
always = 0x0000006f8b5c91a1
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features = 0x7c006001
firmware_features = 0x00000007c45bfc57
htab_hash_mask = 0x7fffff
physical_start = 0x8000000
-----------------------------------------------------
numa: NODE_DATA [mem 0x87d5e300-0x87d67fff]
numa: NODE_DATA(0) on node 6
numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff]
Top of RAM: 0x88000000, Total RAM: 0x88000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x0000000087ffffff]
DMA32 empty
Normal empty
Movable zone start for each node
Early memory node ranges
node 6: [mem 0x0000000000000000-0x0000000087ffffff]
Could not find start_pfn for node 0
Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
On node 0 totalpages: 0
Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
On node 6 totalpages: 34816
Unable to handle kernel paging request for data at address 0x00000060
Faulting instruction address: 0xc000000008703a54
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+)
MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24022022 XER: 20000002
CFAR: c0000000086fc238 IRQMASK: 0
GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
NIP [c000000008703a54] bus_add_device+0x84/0x1e0
LR [c000000008703a38] bus_add_device+0x68/0x1e0
Call Trace:
[c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
[c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
[c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
[c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
[c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
[c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
[c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
[c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
[c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
[c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
[c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
[c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
---[ end trace 593577668c2daa65 ]---
However a regular kernel with 4096M (2048 gets reserved for crash
kernel) boots properly.
Unlike regular kernels, which mark all available nodes as online,
kdump kernel only marks just enough nodes as online and marks the rest
as offline at boot. However kdump kernel boots with all available
CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for
shared processors at boot"), all CPUs are onlined on their respective
nodes at boot time. try_online_node() tries to online the offline
nodes but fails as all needed subsystems are not yet initialized.
As part of fix, detect and skip early onlining of a offline node.
Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Reported-by: Pavithra Prakash <pavrampu@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently associativity is used to lookup node-id even if the
preceding VPHN hcall failed. However this can cause CPU to be made
part of the wrong node, (most likely to be node 0). This is because
VPHN is not enabled on KVM guests.
With 2ea6263 ("powerpc/topology: Get topology for shared processors at
boot"), associativity is used to set to the wrong node. Hence KVM
guest topology is broken.
For example : A 4 node KVM guest before would have reported.
[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1746 MB
node 0 free: 1604 MB
node 1 cpus: 4 5 6 7
node 1 size: 2044 MB
node 1 free: 1765 MB
node 2 cpus: 8 9 10 11
node 2 size: 2044 MB
node 2 free: 1837 MB
node 3 cpus: 12 13 14 15
node 3 size: 2044 MB
node 3 free: 1903 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10
Would now report:
[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 1746 MB
node 0 free: 1244 MB
node 1 cpus:
node 1 size: 2044 MB
node 1 free: 2032 MB
node 2 cpus: 1
node 2 size: 2044 MB
node 2 free: 2028 MB
node 3 cpus:
node 3 size: 2044 MB
node 3 free: 2032 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10
Fix this by skipping associativity lookup if the VPHN hcall failed.
Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
After migration of a powerpc LPAR, the kernel executes code to
update the system state to reflect new platform characteristics.
Such changes include modifications to device tree properties provided
to the system by PHYP. Property notifications received by the
post_mobility_fixup() code are passed along to the kernel in general
through a call to of_update_property() which in turn passes such
events back to all modules through entries like the '.notifier_call'
function within the NUMA module.
When the NUMA module updates its state, it resets its event timer. If
this occurs after a previous call to stop_topology_update() or on a
system without VPHN enabled, the code runs into an unitialized timer
structure and crashes. This patch adds a safety check along this path
toward the problem code.
An example crash log is as follows.
ibmvscsi 30000081: Re-enabling adapter!
------------[ cut here ]------------
kernel BUG at kernel/time/timer.c:958!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
...
NIP mod_timer+0x4c/0x400
LR reset_topology_timer+0x40/0x60
Call Trace:
0xc0000003f9407830 (unreliable)
reset_topology_timer+0x40/0x60
dt_update_callback+0x100/0x120
notifier_call_chain+0x90/0x100
__blocking_notifier_call_chain+0x60/0x90
of_property_notify+0x90/0xd0
of_update_property+0x104/0x150
update_dt_property+0xdc/0x1f0
pseries_devicetree_update+0x2d0/0x510
post_mobility_fixup+0x7c/0xf0
migration_store+0xa4/0xc0
kobj_attr_store+0x30/0x60
sysfs_kf_write+0x64/0xa0
kernfs_fop_write+0x16c/0x240
__vfs_write+0x40/0x200
vfs_write+0xc8/0x240
ksys_write+0x5c/0x100
system_call+0x58/0x6c
Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
scan_pkey_feature() uses of_property_read_u32_array() to read the
ibm,processor-storage-keys property and calls be32_to_cpu() on the
value it gets. The problem is that of_property_read_u32_array() already
returns the value converted to the CPU byte order.
The value of pkeys_total ends up more or less sane because there's a min()
call in pkey_initialize() which reduces pkeys_total to 32. So in practice
the kernel ignores the fact that the hypervisor reserved one key for
itself (the device tree advertises 31 keys in my test VM).
This is wrong, but the effect in practice is that when a process tries to
allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
would indicate that there aren't any keys available
Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This stops us from doing code patching in init sections after they've
been freed.
In this chain:
kvm_guest_init() ->
kvm_use_magic_page() ->
fault_in_pages_readable() ->
__get_user() ->
__get_user_nocheck() ->
barrier_nospec();
We have a code patching location at barrier_nospec() and
kvm_guest_init() is an init function. This whole chain gets inlined,
so when we free the init section (hence kvm_guest_init()), this code
goes away and hence should no longer be patched.
We seen this as userspace memory corruption when using a memory
checker while doing partition migration testing on powervm (this
starts the code patching post migration via
/sys/kernel/mobility/migration). In theory, it could also happen when
using /sys/kernel/debug/powerpc/barrier_nospec.
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Local radix TLB flush operations that operate on congruence classes
have explicit ERAT flushes for POWER9. The process scoped LPID flush
did not have a flush, so add it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.
Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When enumerating page size definitions to check hardware support,
we construct a constant which is (1U << (def->shift - 10)).
However, the array of page size definitions is only initalised for
various MMU_PAGE_* constants, so it contains a number of 0-initialised
elements with def->shift == 0. This means we end up shifting by a
very large number, which gives the following UBSan splat:
================================================================================
UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
Call Trace:
[c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
[c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
[c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
[c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
[c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
[c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
================================================================================
Fix this by first checking if the element exists (shift != 0) before
constructing the constant.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When a process allocates a hugepage, the following leak is
reported by kmemleak. This is a false positive which is
due to the pointer to the table being stored in the PGD
as physical memory address and not virtual memory pointer.
unreferenced object 0xc30f8200 (size 512):
comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<e32b68da>] huge_pte_alloc+0xdc/0x1f8
[<9e0df1e1>] hugetlb_fault+0x560/0x8f8
[<7938ec6c>] follow_hugetlb_page+0x14c/0x44c
[<afbdb405>] __get_user_pages+0x1c4/0x3dc
[<b8fd7cd9>] __mm_populate+0xac/0x140
[<3215421e>] vm_mmap_pgoff+0xb4/0xb8
[<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc
[<4fcd760f>] ret_from_syscall+0x0/0x38
See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as
memory leaks when using kmemleak") for detailed explanation.
To fix that, this patch tells kmemleak to ignore the allocated
hugepage table.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Make sure we are operating on THP and hugetlb entries in the respective hash
fault handling routines.
No functional change in this patch. If we walked the table wrongly before, we
will retry the access.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Update few code paths to check for pmd_large.
set_pmd_at:
We want to use this to store swap pte at pmd level. For swap ptes we don't want
to set H_PAGE_THP_HUGE. Hence check for pmd_large in set_pmd_at. This remove
the false WARN_ON when using this with swap pmd entry.
pmd_page:
We don't really use them on pmd migration entries. But they can also work with
migration entries and we don't differentiate at the pte level. Hence update
pmd_page to work with pmd migration entries too
__find_linux_pte:
lockless page table walk need to handle pmd migration entries. pmd_trans_huge
check will return false on them. We don't set thp = 1 for such entries, but
update hpage_shift correctly. Without this we will walk pmd migration entries
as a pte page pointer which is wrong.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This make hugetlb directory pointer similar to other page able entries. A hugepd
entry is identified by lack of _PAGE_PTE bit set and directory size stored in
HUGEPD_SHIFT_MASK. We update that to also look at _PAGE_PRESENT
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid
pgd/pud/pmd entry. We also switch the p**_present() to look at this bit.
With pmd_present, we have a special case. We need to make sure we consider a
pmd marked invalid during THP split as present. Right now we clear the
_PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special
case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is
only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit
for this special case. pmd_present is also updated to look at _PAGE_INVALID.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commits:
5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C")
8fed04d0f6ae ("powerpc/64s/hash: remove user SLB data from the paca")
655deecf67b2 ("powerpc/64s/hash: SLB allocation status bitmaps")
2e1626744e8d ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup")
89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache")
This series had a few bugs, and the fixes are not all trivial. So
revert most of it for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| | |
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
User SLB mappig data is copied into the PACA from the mm->context so
it can be accessed by the SLB miss handlers.
After the C conversion, SLB miss handlers now run with relocation on,
and user SLB misses are able to take recursive kernel SLB misses, so
the user SLB mapping data can be removed from the paca and accessed
directly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory may not be accessed when handling kernel space
SLB misses, so care should be taken there. However user SLB misses can
access any kernel memory, which can be used to move some fields out of
the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to bad
address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Since RFC:
- Added MSR[RI] handling
- Fixed up a register loss bug exposed by irq tracing (Aneesh)
- Reject misses outside the defined kernel regions (Aneesh)
- Added several more sanity checks and error handling (Aneesh), we may
look at consolidating these tests and tightenig up the code but for
a first pass we decided it's better to check carefully.
Since v1:
- Fixed SLB cache corruption (Aneesh)
- Fixed untidy SLBE allocation "leak" in get_vsid error case
- Now survives some stress testing on real hardware
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
POWER9 introduces SLBIA IH=3, which invalidates all SLB entries and
associated lookaside information that have a class value of 1, which
Linux assigns to user addresses. This matches what switch_slb wants,
and allows a simple fast implementation that avoids the slb_cache
complexity.
As a side-effect, the POWER5 < DD2.1 SLB invalidation workaround is
also avoided on POWER9.
Process context switching rate is improved about 2.2% for a small
process that hits the slb cache which is the best case for the current
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The SLBIA IH=1 hint will remove all non-zero SLBEs, but only
invalidate ERAT entries associated with a class value of 1, for
processors that support the hint (e.g., POWER6 and newer), which
Linux assigns to user addresses.
This prevents kernel ERAT entries from being invalidated when
context switchig (if the thread faulted in more than 8 user SLBEs).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Remove the vmalloc segment from bolted SLBEs. This is not required to
be bolted, and seems like it was added to help pre-load the SLB on
context switch. However there are now other segments like the vmemmap
segment and non-zero node memory that often take misses after a context
switch, so it is better to solve this in a more general way.
A subsequent change will track free SLB entries and uses those rather
than round-robin overwrite valid entries, which makes it far less
likely for kernel SLBEs to be evicted after they are installed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The POWER5 < DD2.1 issue is that slbie needs to be issued more than
once. It came in with this change:
ChangeSet@1.1608, 2004-04-29 07:12:31-07:00, david@gibson.dropbear.id.au
[PATCH] POWER5 erratum workaround
Early POWER5 revisions (<DD2.1) have a problem requiring slbie
instructions to be repeated under some circumstances. The patch below
adds a workaround (patch made by Anton Blanchard).
(aka. 3e4520f7605243abf66a7ccd3d2e49e48e8c0483 in the full history tree)
The extra slbie in switch_slb is done even for the case where slbia is
called (slb_flush_and_rebolt). I don't believe that is required
because there are other slb_flush_and_rebolt callers which do not
issue the workaround slbie, which would be broken if it was required.
It also seems to be fine inside the isync with the first slbie, as it
is in the kernel stack switch code.
So move this workaround to where it is required. This is not much of
an optimisation because this is the fast path, but it makes the code
more understandable and neater.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Retain slbie_data initialisation to avoid compiler warning]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| | |
I only have POWER8/9 to test, so just remove it for those.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This causes SLB alloation to start 1 beyond the start of the SLB.
There is no real problem because after it wraps it stats behaving
properly, it's just surprisig to see when looking at SLB traces.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|