summaryrefslogtreecommitdiffstats
path: root/arch/powerpc
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2017-02-222-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: "142 patches: - DAX updates - various misc bits - OCFS2 updates - most of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits) mm/z3fold.c: limit first_num to the actual range of possible buddy indexes mm: fix <linux/pagemap.h> stray kernel-doc notation zram: remove obsolete sysfs attrs mm/memblock.c: remove unnecessary log and clean up oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA mm: drop unused argument of zap_page_range() mm: drop zap_details::check_swap_entries mm: drop zap_details::ignore_dirty mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled mm: help __GFP_NOFAIL allocations which do not trigger OOM killer mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically mm: consolidate GFP_NOFAIL checks in the allocator slowpath lib/show_mem.c: teach show_mem to work with the given nodemask arch, mm: remove arch specific show_mem mm, page_alloc: warn_alloc print nodemask mm, page_alloc: do not report all nodes in show_mem Revert "mm: bail out in shrink_inactive_list()" mm, vmscan: consider eligible zones in get_scan_count mm, vmscan: cleanup lru size claculations mm, vmscan: do not count freed pages as PGDEACTIVATE ...
| * lib/show_mem.c: teach show_mem to work with the given nodemaskMichal Hocko2017-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * powerpc: do not make the entire heap executableDenys Vlasenko2017-02-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt, or with a toolchain which defaults to it) look like this: [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4 [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4 [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4 Which results in an ELF load header: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000 This is all correct, the load region containing the PLT is marked as executable. Note that the PLT starts at 0002b00c but the file mapping ends at 0002aff8, so the PLT falls in the 0 fill section described by the load header, and after a page boundary. Unfortunately the generic ELF loader ignores the X bit in the load headers when it creates the 0 filled non-file backed mappings. It assumes all of these mappings are RW BSS sections, which is not the case for PPC. gcc/ld has an option (--secure-plt) to not do this, this is said to incur a small performance penalty. Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire brk area* with executable rights for all binaries, even --secure-plt ones. Stop doing that. Teach the ELF loader to check the X bit in the relevant load header and create 0 filled anonymous mappings that are executable if the load header requests that. Test program showing the difference in /proc/$PID/maps: int main() { char buf[16*1024]; char *p = malloc(123); /* make "[heap]" mapping appear */ int fd = open("/proc/self/maps", O_RDONLY); int len = read(fd, buf, sizeof(buf)); write(1, buf, len); printf("%p\n", p); return 0; } Compiled using: gcc -mbss-plt -m32 -Os test.c -otest Unpatched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10690000-106c0000 rwxp 00000000 00:00 0 [heap] f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack] 0x10690008 Patched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10180000-101b0000 rw-p 00000000 00:00 0 [heap] ^^^^ this has changed f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ff860000-ff890000 rw-p 00000000 00:00 0 [stack] 0x10180008 The patch was originally posted in 2012 by Jason Gunthorpe and apparently ignored: https://lkml.org/lkml/2012/9/30/138 Lightly run-tested. Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2017-02-2216-378/+930
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "4.11 is going to be a relatively large release for KVM, with a little over 200 commits and noteworthy changes for most architectures. ARM: - GICv3 save/restore - cache flushing fixes - working MSI injection for GICv3 ITS - physical timer emulation MIPS: - various improvements under the hood - support for SMP guests - a large rewrite of MMU emulation. KVM MIPS can now use MMU notifiers to support copy-on-write, KSM, idle page tracking, swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is also supported, so that writes to some memory regions can be treated as MMIO. The new MMU also paves the way for hardware virtualization support. PPC: - support for POWER9 using the radix-tree MMU for host and guest - resizable hashed page table - bugfixes. s390: - expose more features to the guest - more SIMD extensions - instruction execution protection - ESOP2 x86: - improved hashing in the MMU - faster PageLRU tracking for Intel CPUs without EPT A/D bits - some refactoring of nested VMX entry/exit code, preparing for live migration support of nested hypervisors - expose yet another AVX512 CPUID bit - host-to-guest PTP support - refactoring of interrupt injection, with some optimizations thrown in and some duct tape removed. - remove lazy FPU handling - optimizations of user-mode exits - optimizations of vcpu_is_preempted() for KVM guests generic: - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits) x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 x86/paravirt: Change vcp_is_preempted() arg type to long KVM: VMX: use correct vmcs_read/write for guest segment selector/base x86/kvm/vmx: Defer TR reload after VM exit x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss x86/kvm/vmx: Simplify segment_base() x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels x86/kvm/vmx: Don't fetch the TSS base from the GDT x86/asm: Define the kernel TSS limit in a macro kvm: fix page struct leak in handle_vmon KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now KVM: Return an error code only as a constant in kvm_get_dirty_log() KVM: Return an error code only as a constant in kvm_get_dirty_log_protect() KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl() KVM: x86: remove code for lazy FPU handling KVM: race-free exit from KVM_RUN without POSIX signals KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message KVM: PPC: Book3S PR: Ratelimit copy data failure error messages KVM: Support vCPU-based gfn->hva cache KVM: use separate generations for each address space ...
| * Merge branch 'kvm-ppc-next' of ↵Paolo Bonzini2017-02-204-5/+17
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD Paul Mackerras writes: "Please do a pull from my kvm-ppc-next branch to get some fixes which I would like to have in 4.11. There are four small commits there; two are fixes for potential host crashes in the new HPT resizing code, and the other two are changes to printks to make KVM on PPC a little less noisy."
| | * KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for nowPaul Mackerras2017-02-182-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new HPT resizing code added in commit b5baa6877315 ("KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation", 2016-12-20) doesn't have code to handle the new HPTE format which POWER9 uses. Thus it would be best not to advertise it to userspace on POWER9 systems until it works properly. Also, since resize_hpt_rehash_hpte() contains BUG_ON() calls that could be hit on POWER9, let's prevent it from being called on POWER9 for now. Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug messageThomas Huth2017-02-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The average user likely does not know what a "htab" or "LPID" is, and it's annoying that these messages are quickly filling the dmesg log when you're doing boot cycle tests, so let's turn it into a debug message instead. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S PR: Ratelimit copy data failure error messagesVipin K Parashar2017-02-172-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_ppc_mmu_book3s_32/64 xlat() logs "KVM can't copy data" error upon failing to copy user data to kernel space. This floods kernel log once such fails occur in short time period. Ratelimit this error to avoid flooding kernel logs upon copy data failures. Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: Prevent double-free on HPT resize commit pathDavid Gibson2017-02-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | resize_hpt_release(), called once the HPT resize of a KVM guest is completed (successfully or unsuccessfully) frees the state structure for the resize. It is currently not safe to call with a NULL pointer. However, one of the error paths in kvm_vm_ioctl_resize_hpt_commit() can invoke it with a NULL pointer. This will occur if userspace improperly invokes KVM_PPC_RESIZE_HPT_COMMIT without previously calling KVM_PPC_RESIZE_HPT_PREPARE, or if it calls COMMIT twice without an intervening PREPARE. To fix this potential crash bug - and maybe others like it, make it safe (and a no-op) to call resize_hpt_release() with a NULL resize pointer. Found by Dan Carpenter with a static checker. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: race-free exit from KVM_RUN without POSIX signalsPaolo Bonzini2017-02-171-1/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick" a VCPU out of KVM_RUN through a POSIX signal. A signal is attached to a dummy signal handler; by blocking the signal outside KVM_RUN and unblocking it inside, this possible race is closed: VCPU thread service thread -------------------------------------------------------------- check flag set flag raise signal (signal handler does nothing) KVM_RUN However, one issue with KVM_SET_SIGNAL_MASK is that it has to take tsk->sighand->siglock on every KVM_RUN. This lock is often on a remote NUMA node, because it is on the node of a thread's creator. Taking this lock can be very expensive if there are many userspace exits (as is the case for SMP Windows VMs without Hyper-V reference time counter). As an alternative, we can put the flag directly in kvm_run so that KVM can see it: VCPU thread service thread -------------------------------------------------------------- raise signal signal handler set run->immediate_exit KVM_RUN check run->immediate_exit Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * Merge branch 'kvm-ppc-next' of ↵Paolo Bonzini2017-02-159-87/+56
| |\ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD This brings in two fixes for potential host crashes, from Ben Herrenschmidt and Nick Piggin.
| | * KVM: PPC: Book 3S: Fix error return in kvm_vm_ioctl_create_spapr_tce()Wei Yongjun2017-02-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Fix to return error code -ENOMEM from the memory alloc error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras2017-02-088-87/+55
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | This merges in a fix which touches both PPC and KVM code, which was therefore put into a topic branch in the powerpc tree. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | \ Merge tag 'kvmarm-for-4.11' of ↵Paolo Bonzini2017-02-0920-58/+131
| |\ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD kvmarm updates for 4.11 - GICv3 save restore - Cache flushing fixes - MSI injection fix for GICv3 ITS - Physical timer emulation support
| * | | KVM: PPC: Book3S HV: Advertise availablity of HPT resizing on KVM HVDavid Gibson2017-01-311-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This updates the KVM_CAP_SPAPR_RESIZE_HPT capability to advertise the presence of in-kernel HPT resizing on KVM HV. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: KVM-HV HPT resizing implementationDavid Gibson2017-01-311-1/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the "guts" of the implementation for the HPT resizing PAPR extension. It has the code to allocate and clear a new HPT, rehash an existing HPT's entries into it, and accomplish the switchover for a KVM guest from the old HPT to the new one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementationDavid Gibson2017-01-314-0/+223
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a not yet working outline of the HPT resizing PAPR extension. Specifically it adds the necessary ioctl() functions, their basic steps, the work function which will handle preparation for the resize, and synchronization between these, the guest page fault path and guest HPT update path. The actual guts of the implementation isn't here yet, so for now the calls will always fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Create kvmppc_unmap_hpte_helper()David Gibson2017-01-311-33/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvm_unmap_rmapp() function, called from certain MMU notifiers, is used to force all guest mappings of a particular host page to be set ABSENT, and removed from the reverse mappings. For HPT resizing, we will have some cases where we want to set just a single guest HPTE ABSENT and remove its reverse mappings. To prepare with this, we split out the logic from kvm_unmap_rmapp() to evict a single HPTE, moving it to a new helper function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT sizeDavid Gibson2017-01-313-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page table (HPT) that userspace expects a guest VM to have, and is also used to clear that HPT when necessary (e.g. guest reboot). At present, once the ioctl() is called for the first time, the HPT size can never be changed thereafter - it will be cleared but always sized as from the first call. With upcoming HPT resize implementation, we're going to need to allow userspace to resize the HPT at reset (to change it back to the default size if the guest changed it). So, we need to allow this ioctl() to change the HPT size. This patch also updates Documentation/virtual/kvm/api.txt to reflect the new behaviour. In fact the documentation was already slightly incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation ioctl" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Split HPT allocation from activationDavid Gibson2017-01-314-55/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT) and sets it up as the active page table for a VM. For the upcoming HPT resize implementation we're going to want to allocate HPTs separately from activating them. So, split the allocation itself out into kvmppc_allocate_hpt() and perform the activation with a new kvmppc_set_hpt() function. Likewise we split kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt() which unsets it as an active HPT, then frees it. We also move the logic to fall back to smaller HPT sizes if the first try fails into the single caller which used that behaviour, kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in that previously if the initial attempt at CMA allocation failed, we would fall back to attempting smaller sizes with the page allocator. Now, we try first CMA, then the page allocator at each size. As far as I can tell this change should be harmless. To match, we make kvmppc_free_hpt() just free the actual HPT itself. The call to kvmppc_free_lpid() that was there, we move to the single caller. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Don't store values derivable from HPT orderDavid Gibson2017-01-314-26/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the kvm_hpt_info structure stores the hashed page table's order, and also the number of HPTEs it contains and a mask for its size. The last two can be easily derived from the order, so remove them and just calculate them as necessary with a couple of helper inlines. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Gather HPT related variables into sub-structureDavid Gibson2017-01-314-84/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the powerpc kvm_arch structure contains a number of variables tracking the state of the guest's hashed page table (HPT) in KVM HV. This patch gathers them all together into a single kvm_hpt_info substructure. This makes life more convenient for the upcoming HPT resizing implementation. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarityDavid Gibson2017-01-313-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at all obvious from the name. In practice kvmppc_alloc_hpt() allocates an HPT by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate it with CMA only. To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma(). Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras2017-01-3131-183/+1496
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This merges in the POWER9 radix MMU host and guest support, which was put into a topic branch because it touches both powerpc and KVM code. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book3S PR: Refactor program interrupt related code into separate ↵Thomas Huth2017-01-271-65/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | function The function kvmppc_handle_exit_pr() is quite huge and thus hard to read, and even contains a "spaghetti-code"-like goto between the different case labels of the big switch statement. This can be made much more readable by moving the code related to injecting program interrupts / instruction emulation into a separate function instead. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpuPaul Mackerras2017-01-271-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The H_PROD hypercall is supposed to wake up an idle vcpu. We have an implementation, but because Linux doesn't use it except when doing cpu hotplug, it was never tested properly. AIX does use it, and reported it broken. It turns out we were waking the wrong vcpu (the one doing H_PROD, not the target of the prod) and we weren't handling the case where the target needs an IPI to wake it. Fix it by using the existing kvmppc_fast_vcpu_kick_hv() function, which is intended for this kind of thing, and by using the target vcpu not the current vcpu. We were also not looking at the prodded flag when checking whether a ceded vcpu should wake up, so this adds checks for the prodded flag alongside the checks for pending exceptions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resendLi Zhong2017-01-272-51/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch improves the code that takes lock twice to check the resend flag and do the actual resending, by checking the resend flag locklessly, and add a boolean parameter check_resend to icp_[rm_]deliver_irq(), so the resend flag can be checked in the lock when doing the delivery. We need make sure when we clear the ics's bit in the icp's resend_map, we don't miss the resend flag of the irqs that set the bit. It could be ordered through the barrier in test_and_clear_bit(), and a newly added wmb between setting irq's resend flag, and icp's resend_map. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book 3S: XICS: Implement ICS P/Q statesLi Zhong2017-01-274-71/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements P(Presented)/Q(Queued) states for ICS irqs. When the interrupt is presented, set P. Present if P was not set. If P is already set, don't present again, set Q. When the interrupt is EOI'ed, move Q into P (and clear Q). If it is set, re-present. The asserted flag used by LSI is also incorporated into the P bit. When the irq state is saved, P/Q bits are also saved, they need some qemu modifications to be recognized and passed around to be restored. KVM_XICS_PENDING bit set and saved should also indicate KVM_XICS_PRESENTED bit set and saved. But it is possible some old code doesn't have/recognize the P bit, so when we restore, we set P for PENDING bit, too. The idea and much of the code come from Ben. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resendsLi Zhong2017-01-272-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible that in the following order, one irq is resent twice: CPU 1 CPU 2 ics_check_resend() lock ics_lock see resend set unlock ics_lock /* change affinity of the irq */ kvmppc_xics_set_xive() write_xive() lock ics_lock see resend set unlock ics_lock icp_deliver_irq() /* resend */ icp_deliver_irq() /* resend again */ It doesn't have any user-visible effect at present, but needs to be avoided when the following patch implementing the P/Q stuff is applied. This patch clears the resend flag before releasing the ics lock, when we know we will do a re-delivery after checking the flag, or setting the flag. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counterLi Zhong2017-01-271-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some counters are added in Commit 6e0365b78273 ("KVM: PPC: Book3S HV: Add ICP real mode counters"), to provide some performance statistics to determine whether further optimizing is needed for real mode functions. The n_reject counter counts how many times ICP rejects an irq because of priority in real mode. The redelivery of an lsi that is still asserted after eoi doesn't fall into this category, so the increasement there is removed. Also, it needs to be increased in icp_rm_deliver_irq() if it rejects another one. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECTLi Zhong2017-01-272-11/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b0221556dbd3 ("KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode") removed the setting of the XICS_RM_REJECT flag. And since that commit, nothing else sets the flag any more, so we can remove the flag and the remaining code that handles it, including the counter that counts how many times it get set. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | | KVM: PPC: Book3S HV: Don't try to signal cpu -1Paul Mackerras2017-01-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with -1 as the cpu argument. Although this is not meaningful, in the past, before commit 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9", 2016-11-18), it was harmless because CPU -1 is not in the same core as any real CPU thread. On a POWER9, however, we don't do the "same core" check, so we were trying to do a msgsnd to thread -1, which is invalid. To avoid this, we add a check to see that vcpu->arch.thread_cpu is >= 0 before calling kvmppc_ipi_thread() with it. Since vcpu->arch.thread_vcpu can change asynchronously, we use READ_ONCE to ensure that the value we check is the same value that we use as the argument to kvmppc_ipi_thread(). Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
* | | | | Merge tag 'powerpc-4.11-1' of ↵Linus Torvalds2017-02-22122-902/+3821
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for direct mapped LPC on POWER9, giving Linux direct access to devices that may be on there such as a UART. - Memory hotplug support for the Power9 Radix MMU. - Add new AUX vectors describing the processor's cache geometry, to be used by glibc. - The ability for a guest to ask the hypervisor to resize the guest's hash table, and in addition support for doing so automatically when memory is hotplugged into/out-of the guest. This allows the hash table to be sized based on the current memory usage of the guest, rather than the maximum possible memory usage. - Implementation of optprobes (kprobe optimisation) for powerpc. In addition there's the topic branch shared with the KVM tree, which includes support for guests to use the Radix MMU on Power9. Thanks to: Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun" * tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits) powerpc/mm/radix: Skip ptesync in pte update helpers powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm powerpc/mm/radix: Update pte update sequence for pte clear case powerpc/mm: Update PROTFAULT handling in the page fault path powerpc/xmon: Fix data-breakpoint powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n powerpc/pseries: Fix typo in parameter description powerpc/kprobes: Remove kprobe_exceptions_notify() kprobes: Introduce weak variant of kprobe_exceptions_notify() powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL powerpc/powernv: Fix opal_exit tracepoint opcode powerpc: Add a prototype for mcount() so it can be versioned powerpc: Drop GPL from of_node_to_nid() export to match other arches powerpc/kprobes: Optimize kprobe in kretprobe_trampoline() powerpc/kprobes: Implement Optprobes powerpc/kprobes: Fixes for kprobe_lookup_name() on BE powerpc: Add helper to check if offset is within relative branch range powerpc/bpf: Introduce __PPC_SH64() ...
| * | | | | powerpc/mm/radix: Skip ptesync in pte update helpersAneesh Kumar K.V2017-02-151-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We do them at the start of tlb flush, and we are sure a pte update will be followed by a tlbflush. Hence we can skip the ptesync in pte update helpers. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | | powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mmAneesh Kumar K.V2017-02-152-1/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This helps us to do some optimization for application exit case, where we can skip the DD1 style pte update sequence. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | | powerpc/mm/radix: Update pte update sequence for pte clear caseAneesh Kumar K.V2017-02-151-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the kernel we do follow the below sequence in different code paths. pte = ptep_get_clear(ptep) .... set_pte_at(ptep, pte) We do that for mremap, autonuma protection update and softdirty clearing. This implies our optimization to skip a tlb flush when clearing a pte update is not valid, because for DD1 system that followup set_pte_at will be done witout doing the required tlbflush. Fix that by always doing the dd1 style pte update irrespective of new_pte value. In a later patch we will optimize the application exit case. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | | powerpc/mm: Update PROTFAULT handling in the page fault pathAneesh Kumar K.V2017-02-152-14/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With radix, we can get page fault with DSISR_PROTFAULT value set in case of PROT_NONE or autonuma mapping. The PROT_NONE case in handled by the vma check where we consider the access bad. For autonuma we should fall through and fixup the access mask correctly. Without this patch we trigger the WARN_ON() on radix. This code moves that WARN_ON() within a radix_enabled() check. I also moved the WARN_ON() outside the if condition making it apply for all type of faults (exec/write/read). It is also conditionalized for book3s, because BOOK3E can also get a PROTFAULT to handle the D/I cache sync. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | | powerpc/xmon: Fix data-breakpointRavi Bangoria2017-02-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently xmon data-breakpoint feature is broken. Whenever there is a watchpoint match occurs, hw_breakpoint_handler will be called by do_break via notifier chains mechanism. If watchpoint is registered by xmon, hw_breakpoint_handler won't find any associated perf_event and returns immediately with NOTIFY_STOP. Similarly, do_break also returns without notifying to xmon. Solve this by returning NOTIFY_DONE when hw_breakpoint_handler does not find any perf_event associated with matched watchpoint, rather than NOTIFY_STOP, which tells the core code to continue calling the other breakpoint handlers including the xmon one. Cc: stable@vger.kernel.org Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | | powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=yMichael Ellerman2017-02-151-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recently merged HPT (Hash Page Table) resize support broke the build when BOOK3S_64=n (ie. 32-bit or 64-bit Book3E) and MEMORY_HOTPLUG=y: arch/powerpc/mm/mem.o: In function `.arch_add_memory': (.text+0x4e4): undefined reference to `.resize_hpt_for_hotplug' Fix it by adding a dummy version. Fixes: 438cc81a41e8 ("powerpc/pseries: Automatically resize HPT for memory hot add/remove") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | | Merge branch 'topic/ppc-kvm' into nextMichael Ellerman2017-02-1435-249/+1530
| |\ \ \ \ \ | | | |_|_|/ | | |/| | | | | | | | | Merge the topic branch we're sharing with the kvm-ppc tree.
| | * | | | powerpc/powernv: Remove separate entry for OPAL real mode callsBenjamin Herrenschmidt2017-02-076-86/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All entry points already read the MSR so they can easily do the right thing. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | | powerpc/64: CONFIG_RELOCATABLE support for hmi interruptsNicholas Piggin2017-02-072-1/+9
| | | |/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The branch from hmi_exception_early to hmi_exception_realmode must use a "relocatable-style" branch, because it is branching from unrelocated exception code to beyond __end_interrupts. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Enable radix guest supportPaul Mackerras2017-01-315-27/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a few last pieces of the support for radix guests: * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and KVM_PPC_GET_RMMU_INFO ioctls for radix guests * On POWER9, allow secondary threads to be on/off-lined while guests are running. * Set up LPCR and the partition table entry for radix guests. * Don't allocate the rmap array in the kvm_memory_slot structure on radix. * Don't try to initialize the HPT for radix guests, since they don't have an HPT. * Take out the code that prevents the HV KVM module from initializing on radix hosts. At this stage, we only support radix guests if the host is running in radix mode, and only support HPT guests if the host is running in HPT mode. Thus a guest cannot switch from one mode to the other, which enables some simplifications. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Invalidate ERAT on guest entry/exit for POWER9 DD1Paul Mackerras2017-01-311-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On POWER9 DD1, we need to invalidate the ERAT (effective to real address translation cache) when changing the PIDR register, which we do as part of guest entry and exit. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Allow guest exit path to have MMU onPaul Mackerras2017-01-313-17/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we allow LPCR[AIL] to be set for radix guests, then interrupts from the guest to the host can be delivered by the hardware with relocation on, and thus the code path starting at kvmppc_interrupt_hv can be executed in virtual mode (MMU on) for radix guests (previously it was only ever executed in real mode). Most of the code is indifferent to whether the MMU is on or off, but the calls to OPAL that use the real-mode OPAL entry code need to be switched to use the virtual-mode code instead. The affected calls are the calls to the OPAL XICS emulation functions in kvmppc_read_one_intr() and related functions. We test the MSR[IR] bit to detect whether we are in real or virtual mode, and call the opal_rm_* or opal_* function as appropriate. The other place that depends on the MMU being off is the optimization where the guest exit code jumps to the external interrupt vector or hypervisor doorbell interrupt vector, or returns to its caller (which is __kvmppc_vcore_entry). If the MMU is on and we are returning to the caller, then we don't need to use an rfid instruction since the MMU is already on; a simple blr suffices. If there is an external or hypervisor doorbell interrupt to handle, we branch to the relocation-on version of the interrupt vector. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movementPaul Mackerras2017-01-314-14/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With radix, the guest can do TLB invalidations itself using the tlbie (global) and tlbiel (local) TLB invalidation instructions. Linux guests use local TLB invalidations for translations that have only ever been accessed on one vcpu. However, that doesn't mean that the translations have only been accessed on one physical cpu (pcpu) since vcpus can move around from one pcpu to another. Thus a tlbiel might leave behind stale TLB entries on a pcpu where the vcpu previously ran, and if that task then moves back to that previous pcpu, it could see those stale TLB entries and thus access memory incorrectly. The usual symptom of this is random segfaults in userspace programs in the guest. To cope with this, we detect when a vcpu is about to start executing on a thread in a core that is a different core from the last time it executed. If that is the case, then we mark the core as needing a TLB flush and then send an interrupt to any thread in the core that is currently running a vcpu from the same guest. This will get those vcpus out of the guest, and the first one to re-enter the guest will do the TLB flush. The reason for interrupting the vcpus executing on the old core is to cope with the following scenario: CPU 0 CPU 1 CPU 4 (core 0) (core 0) (core 1) VCPU 0 runs task X VCPU 1 runs core 0 TLB gets entries from task X VCPU 0 moves to CPU 4 VCPU 0 runs task X Unmap pages of task X tlbiel (still VCPU 1) task X moves to VCPU 1 task X runs task X sees stale TLB entries That is, as soon as the VCPU starts executing on the new core, it could unmap and tlbiel some page table entries, and then the task could migrate to one of the VCPUs running on the old core and potentially see stale TLB entries. Since the TLB is shared between all the threads in a core, we only use the bit of kvm->arch.need_tlb_flush corresponding to the first thread in the core. To ensure that we don't have a window where we can miss a flush, this moves the clearing of the bit from before the actual flush to after it. This way, two threads might both do the flush, but we prevent the situation where one thread can enter the guest before the flush is finished. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Make HPT-specific hypercalls return error in radix modePaul Mackerras2017-01-311-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the guest is in radix mode, then it doesn't have a hashed page table (HPT), so all of the hypercalls that manipulate the HPT can't work and should return an error. This adds checks to make them return H_FUNCTION ("function not supported"). Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Implement dirty page logging for radix guestsPaul Mackerras2017-01-314-33/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds code to keep track of dirty pages when requested (that is, when memslot->dirty_bitmap is non-NULL) for radix guests. We use the dirty bits in the PTEs in the second-level (partition-scoped) page tables, together with a bitmap of pages that were dirty when their PTE was invalidated (e.g., when the page was paged out). This bitmap is stored in the first half of the memslot->dirty_bitmap area, and kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the bitmap that gets returned to userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: MMU notifier callbacks for radix guestsPaul Mackerras2017-01-313-21/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adapts our implementations of the MMU notifier callbacks (unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva) to call radix functions when the guest is using radix. These implementations are much simpler than for HPT guests because we have only one PTE to deal with, so we don't need to traverse rmap chains. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | | KVM: PPC: Book3S HV: Page table construction and page faults for radix guestsPaul Mackerras2017-01-315-3/+415
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the code to construct the second-level ("partition-scoped" in architecturese) page tables for guests using the radix MMU. Apart from the PGD level, which is allocated when the guest is created, the rest of the tree is all constructed in response to hypervisor page faults. As well as hypervisor page faults for missing pages, we also get faults for reference/change (RC) bits needing to be set, as well as various other error conditions. For now, we only set the R or C bit in the guest page table if the same bit is set in the host PTE for the backing page. This code can take advantage of the guest being backed with either transparent or ordinary 2MB huge pages, and insert 2MB page entries into the guest page tables. There is no support for 1GB huge pages yet. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OpenPOWER on IntegriCloud