summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/mm/book3s64
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'powerpc-5.6-1' of ↵Linus Torvalds2020-02-043-2/+18
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "A pretty small batch for us, and apologies for it being a bit late, I wanted to sneak Christophe's user_access_begin() series in. Summary: - Implement user_access_begin() and friends for our platforms that support controlling kernel access to userspace. - Enable CONFIG_VMAP_STACK on 32-bit Book3S and 8xx. - Some tweaks to our pseries IOMMU code to allow SVMs ("secure" virtual machines) to use the IOMMU. - Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 32-bit VDSO, and some other improvements. - A series to use the PCI hotplug framework to control opencapi card's so that they can be reset and re-read after flashing a new FPGA image. As well as other minor fixes and improvements as usual. Thanks to: Alastair D'Silva, Alexandre Ghiti, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Bai Yingjie, Chen Zhou, Christophe Leroy, Frederic Barrat, Greg Kurz, Jason A. Donenfeld, Joel Stanley, Jordan Niethe, Julia Lawall, Krzysztof Kozlowski, Laurent Dufour, Laurentiu Tudor, Linus Walleij, Michael Bringmann, Nathan Chancellor, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Peter Ujfalusi, Pingfan Liu, Ram Pai, Randy Dunlap, Russell Currey, Sam Bobroff, Sebastian Andrzej Siewior, Shawn Anastasio, Stephen Rothwell, Steve Best, Sukadev Bhattiprolu, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain" * tag 'powerpc-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (131 commits) powerpc: configs: Cleanup old Kconfig options powerpc/configs/skiroot: Enable some more hardening options powerpc/configs/skiroot: Disable xmon default & enable reboot on panic powerpc/configs/skiroot: Enable security features powerpc/configs/skiroot: Update for symbol movement only powerpc/configs/skiroot: Drop default n CONFIG_CRYPTO_ECHAINIV powerpc/configs/skiroot: Drop HID_LOGITECH powerpc/configs: Drop NET_VENDOR_HP which moved to staging powerpc/configs: NET_CADENCE became NET_VENDOR_CADENCE powerpc/configs: Drop CONFIG_QLGE which moved to staging powerpc: Do not consider weak unresolved symbol relocations as bad powerpc/32s: Fix kasan_early_hash_table() for CONFIG_VMAP_STACK powerpc: indent to improve Kconfig readability powerpc: Provide initial documentation for PAPR hcalls powerpc: Implement user_access_save() and user_access_restore() powerpc: Implement user_access_begin and friends powerpc/32s: Prepare prevent_user_access() for user_access_end() powerpc/32s: Drop NULL addr verification powerpc/kuap: Fix set direction in allow/prevent_user_access() powerpc/32s: Fix bad_kuap_fault() ...
| * powerpc/mm: Remove kvm radix prefetch workaround for Power9 DD2.2Jordan Niethe2020-01-262-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with KVM") introduced a number of workarounds as coming out of a guest with the mmu enabled would make the cpu would start running in hypervisor state with the PID value from the guest. The cpu will then start prefetching for the hypervisor with that PID value. In Power9 DD2.2 the cpu behaviour was modified to fix this. When accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will not be performed. This means that we can get rid of the workarounds for Power9 DD2.2 and later revisions. Add a new cpu feature CPU_FTR_P9_RADIX_PREFETCH_BUG to indicate if the workarounds are needed. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191206031722.25781-1-jniethe5@gmail.com
| * powerpc/book3s64/hash: Disable 16M linear mapping size if not alignedRussell Currey2020-01-161-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With STRICT_KERNEL_RWX on in a relocatable kernel under the hash MMU, if the position the kernel is loaded at is not 16M aligned things go horribly wrong. Specifically hash__mark_initmem_nx() will call hash__change_memory_range() which then aligns down the start address, and due to the text not being 16M aligned causes some of the kernel text to be marked non-executable. We can avoid this when selecting the linear mapping size, so do so and print a warning. I tested this for various alignments and as long as the position is 64K aligned it's fine (the base requirement for powerpc). Signed-off-by: Russell Currey <ruscur@russell.cc> [mpe: Add details of the failure mode] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191224064126.183670-1-ruscur@russell.cc
* | powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP caseAneesh Kumar K.V2020-02-041-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fixup page directory freeing", v4. This is a repost of patch series from Peter with the arch specific changes except ppc64 dropped. ppc64 changes are added here because we are redoing the patch series on top of ppc64 changes. This makes it easy to backport these changes. Only the first 2 patches need to be backported to stable. The thing is, on anything SMP, freeing page directories should observe the exact same order as normal page freeing: 1) unhook page/directory 2) TLB invalidate 3) free page/directory Without this, any concurrent page-table walk could end up with a Use-after-Free. This is esp. trivial for anything that has software page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware caches partial page-walks (ie. caches page directories). Even on UP this might give issues since mmu_gather is preemptible these days. An interrupt or preempted task accessing user pages might stumble into the free page if the hardware caches page directories. This patch series fixes ppc64 and add generic MMU_GATHER changes to support the conversion of other architectures. I haven't added patches w.r.t other architecture because they are yet to be acked. This patch (of 9): A followup patch is going to make sure we correctly invalidate page walk cache before we free page table pages. In order to keep things simple enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the !SMP case differently in the followup patch !SMP case is right now broken for radix translation w.r.t page walk cache flush. We can get interrupted in between page table free and that would imply we have page walk cache entries pointing to tables which got freed already. Michael said "both our platforms that run on Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a problem for anyone in practice, unless they've hacked their kernel to build it !SMP." Link: http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, tree-wide: rename put_user_page*() to unpin_user_page*()John Hubbard2020-01-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages*() calls match up with unpin_user_pages*() calls, and the API is a lot closer to being self-explanatory. Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | powerpc: book3s64: convert to pin_user_pages() and put_user_page()John Hubbard2020-01-311-5/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Convert from get_user_pages() to pin_user_pages(). 2. As required by pin_user_pages(), release these pages via put_user_page(). Link: http://lkml.kernel.org/r/20200107224558.2362728-21-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-12-011-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: "Incoming: - a small number of updates to scripts/, ocfs2 and fs/buffer.c - most of MM I still have quite a lot of material (mostly not MM) staged after linux-next due to -next dependencies. I'll send those across next week as the preprequisites get merged up" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits) mm/page_io.c: annotate refault stalls from swap_readpage mm/Kconfig: fix trivial help text punctuation mm/Kconfig: fix indentation mm/memory_hotplug.c: remove __online_page_set_limits() mm: fix typos in comments when calling __SetPageUptodate() mm: fix struct member name in function comments mm/shmem.c: cast the type of unmap_start to u64 mm: shmem: use proper gfp flags for shmem_writepage() mm/shmem.c: make array 'values' static const, makes object smaller userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register() userfaultfd: wrap the common dst_vma check into an inlined function userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb() userfaultfd: use vma_pagesize for all huge page size calculation mm/madvise.c: use PAGE_ALIGN[ED] for range checking mm/madvise.c: replace with page_size() in madvise_inject_error() mm/mmap.c: make vma_merge() comment more easy to understand mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops autonuma: reduce cache footprint when scanning page tables autonuma: fix watermark checking in migrate_balanced_pgdat() ...
| * powerpc/mm: remove pmd_huge/pud_huge stubs and include hugetlb.hMike Kravetz2019-12-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "hugetlbfs: convert macros to static inline, fix sparse warning". The definition for huge_pte_offset() in <linux/hugetlb.h> causes a sparse warning in the !CONFIG_HUGETLB_PAGE. Fix this as well as converting all macros in this block of definitions to static inlines for better type checking. When making the above changes, build errors were found in powerpc due to duplicate definitions. A separate powerpc specific patch is included as a requisite to remove the definitions and get them from <linux/hugetlb.h>. This patch (of 2): This removes the power specific stubs created by commit aad71e3928be ("powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n") used when !CONFIG_HUGETLB_PAGE. Instead, it addresses the build break by getting the definitions from <linux/hugetlb.h>. This allows the macros in <linux/hugetlb.h> to be replaced with static inlines. Link: http://lkml.kernel.org/r/20191112194558.139389-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Ben Dooks <ben.dooks@codethink.co.uk> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warningAneesh Kumar K.V2019-11-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With large memory (8TB and more) hotplug, we can get soft lockup warnings as below. These were caused by a long loop without any explicit cond_resched which is a problem for !PREEMPT kernels. Avoid this using cond_resched() while inserting hash page table entries. We already do similar cond_resched() in __add_pages(), see commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to __add_pages"). rcu: 3-....: (24002 ticks this GP) idle=13e/1/0x4000000000000002 softirq=722/722 fqs=12001 (t=24003 jiffies g=4285 q=2002) NMI backtrace for cpu 3 CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2 Call Trace: dump_stack+0xb0/0xf4 (unreliable) nmi_cpu_backtrace+0x124/0x130 nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 arch_trigger_cpumask_backtrace+0x28/0x3c rcu_dump_cpu_stacks+0xf8/0x154 rcu_sched_clock_irq+0x878/0xb40 update_process_times+0x48/0x90 tick_sched_handle.isra.16+0x4c/0x80 tick_sched_timer+0x68/0xe0 __hrtimer_run_queues+0x180/0x430 hrtimer_interrupt+0x110/0x300 timer_interrupt+0x108/0x2f0 decrementer_common+0x114/0x120 --- interrupt: 901 at arch_add_memory+0xc0/0x130 LR = arch_add_memory+0x74/0x130 memremap_pages+0x494/0x650 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x750 nvdimm_bus_probe+0xac/0x2c0 really_probe+0x148/0x570 driver_probe_device+0x19c/0x1d0 device_driver_attach+0xcc/0x100 bind_store+0x134/0x1c0 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1a0/0x270 __vfs_write+0x3c/0x70 vfs_write+0xd0/0x260 ksys_write+0xdc/0x130 system_call+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191001084656.31277-1-aneesh.kumar@linux.ibm.com
* | powerpc/mm/book3s64/radix: Flush the full mm even when need_flush_all is setAneesh Kumar K.V2019-11-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | With the previous patch, we should now not be using need_flush_all for powerpc. But then make sure we force a PID tlbie flush with RIC=2 if we ever find need_flush_all set. Also don't reset it after a mmu gather flush. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024075801.22434-3-aneesh.kumar@linux.ibm.com
* | powerpc/mm/book3s64/radix: Use freed_tables instead of need_flush_allAneesh Kumar K.V2019-11-051-8/+3
| | | | | | | | | | | | | | | | | | | | | | With commit 22a61c3c4f13 ("asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather") we now track whether we freed page table in mmu_gather. Use that to decide whether to flush Page Walk Cache. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024075801.22434-2-aneesh.kumar@linux.ibm.com
* | powerpc/mm/book3s64/radix: Remove unused code.Aneesh Kumar K.V2019-11-051-60/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mm_tlb_flush_nested change was added in the mmu gather tlb flush to handle the case of parallel pte invalidate happening with mmap_sem held in read mode. This fix was done by commit 02390f66bd23 ("powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP") and the problem is explained in detail in commit 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem") This was later updated by commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range() for force flush") to do a full mm flush rather than a range flush. By commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") we are also now allowing a page table free in mmap_sem read mode which means we should do a PWC flush too. Our current full mm flush imply a PWC flush. With all the above change the mm_tlb_flush_nested(mm) branch in radix__tlb_flush will never be taken because for the nested case we would have taken the if (tlb->fullmm) branch. This patch removes the unused code. Also, remove the gflush change in __radix__flush_tlb_range that was added to handle the range tlb flush code. We only check for THP there because hugetlb is flushed via a different code path where page size is explicitly specified. This is a partial revert of commit 02390f66bd23 ("powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024075801.22434-1-aneesh.kumar@linux.ibm.com
* | powerpc/book3s64/hash: Use secondary hash for bolted mapping if the primary ↵Aneesh Kumar K.V2019-10-282-10/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is full With bolted hash page table entry, kernel currently only use primary hash group when inserting the hash page table entry. In the rare case where kernel find all the 8 primary hash slot occupied by bolted entries, this can result in hash page table insert failure for bolted entries. Avoid this by using the secondary hash group. This is different from what kernel does for the non-bolted mapping. With non-bolted entries kernel will try secondary before removing an existing entry from hash page table group. With bolted prefer primary hash group and hence try to insert the page table entry by removing a slot from primary before trying the secondary hash group. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024093542.29777-3-aneesh.kumar@linux.ibm.com
* | powerpc/pseries: Don't fail hash page table insert for bolted mappingAneesh Kumar K.V2019-10-281-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page table insert by removing a random entry from the group. After some runtime, it is very well possible to find all the 8 hash page table entry slot in the hpte group used for mapping. Don't fail a bolted entry insert in that case. With Storage class memory a user can find this error easily since a namespace enable/disable is equivalent to memory add/remove. This results in failures as reported below: $ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M libndctl: ndctl_dax_enable: dax1.3: failed to enable Error: namespace1.2: failed to enable failed to create namespace: No such device or address In kernel log we find the details as below: Unable to create mapping for hot added memory 0xc000042006000000..0xc00004200d000000: -1 dax_pmem: probe of dax1.3 failed with error -14 This indicates that we failed to create a bolted hash table entry for direct-map address backing the namespace. We also observe failures such that not all namespaces will be enabled with ndctl enable-namespace all command. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024093542.29777-2-aneesh.kumar@linux.ibm.com
* | powerpc/pkeys: remove unused pkey_allows_readwriteQian Cai2019-10-111-10/+0
|/ | | | | | | | | | | | | | | pkey_allows_readwrite() was first introduced in the commit 5586cf61e108 ("powerpc: introduce execute-only pkey"), but the usage was removed entirely in the commit a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key"). Found by the "-Wunused-function" compiler warning flag. Fixes: a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1568733750-14580-1-git-send-email-cai@lca.pw
* Merge tag 'libnvdimm-fixes-5.4-rc1' of ↵Linus Torvalds2019-09-292-7/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm More libnvdimm updates from Dan Williams: - Complete the reworks to interoperate with powerpc dynamic huge page sizes - Fix a crash due to missed accounting for the powerpc 'struct page'-memmap mapping granularity - Fix badblock initialization for volatile (DRAM emulated) pmem ranges - Stop triggering request_key() notifications to userspace when NVDIMM-security is disabled / not present - Miscellaneous small fixups * tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm/region: Enable MAP_SYNC for volatile regions libnvdimm: prevent nvdimm from requesting key when security is disabled libnvdimm/region: Initialize bad block for volatile namespaces libnvdimm/nfit_test: Fix acpi_handle redefinition libnvdimm/altmap: Track namespace boundaries in altmap libnvdimm: Fix endian conversion issues  libnvdimm/dax: Pick the right alignment default when creating dax devices powerpc/book3s64: Export has_transparent_hugepage() related functions.
| * powerpc/book3s64: Export has_transparent_hugepage() related functions.Aneesh Kumar K.V2019-09-242-7/+2
| | | | | | | | | | | | | | | | | | | | In later patch, we want to use hash_transparent_hugepage() in a kernel module. Export two related functions. Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'powerpc-5.4-2' of ↵Linus Torvalds2019-09-283-15/+115
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "An assortment of fixes that were either missed by me, or didn't arrive quite in time for the first v5.4 pull. - Most notable is a fix for an issue with tlbie (broadcast TLB invalidation) on Power9, when using the Radix MMU. The tlbie can race with an mtpid (move to PID register, essentially MMU context switch) on another thread of the core, which can cause stores to continue to go to a page after it's unmapped. - A fix in our KVM code to add a missing barrier, the lack of which has been observed to cause missed IPIs and subsequently stuck CPUs in the host. - A change to the way we initialise PCR (Processor Compatibility Register) to make it forward compatible with future CPUs. - On some older PowerVM systems our H_BLOCK_REMOVE support could oops, fix it to detect such systems and fallback to the old invalidation method. - A fix for an oops seen on some machines when using KASAN on 32-bit. - A handful of other minor fixes, and two new selftests. Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy, Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael Roth, Oliver O'Halloran" * tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error powerpc/nvdimm: Use HCALL error as the return value selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions powerpc/pseries: Call H_BLOCK_REMOVE when supported powerpc/pseries: Read TLB Block Invalidate Characteristics KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag powerpc/mm: Fix an Oops in kasan_mmu_init() powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY powerpc/64s: Set reserved PCR bits powerpc: Fix definition of PCR bits to work with old binutils powerpc/book3s64/radix: Remove WARN_ON in destroy_context() powerpc/tm: Add tm-poison test
| * | powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9Aneesh Kumar K.V2019-09-242-11/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On POWER9, under some circumstances, a broadcast TLB invalidation will fail to invalidate the ERAT cache on some threads when there are parallel mtpidr/mtlpidr happening on other threads of the same core. This can cause stores to continue to go to a page after it's unmapped. The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie flush. This additional TLB flush will cause the ERAT cache invalidation. Since we are using PID=0 or LPID=0, we don't get filtered out by the TLB snoop filtering logic. We need to still follow this up with another tlbie to take care of store vs tlbie ordering issue explained in commit: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9"). The presence of ERAT cache implies we can still get new stores and they may miss store queue marking flush. Cc: stable@vger.kernel.org Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
| * | powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flagAneesh Kumar K.V2019-09-242-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename the #define to indicate this is related to store vs tlbie ordering issue. In the next patch, we will be adding another feature flag that is used to handles ERAT flush vs tlbie ordering issue. Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
| * | powerpc/book3s64/radix: Remove WARN_ON in destroy_context()Aneesh Kumar K.V2019-09-211-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On failed task initialization due to memory allocation failures, we can call into destroy_context() with process_tb entry already populated. This patch forces the process_tb entry to zero in destroy_context(). With this patch, we lose the ability to track if we are destroying a context without flushing the process table entry. WARNING: CPU: 4 PID: 6368 at arch/powerpc/mm/mmu_context_book3s64.c:246 destroy_context+0x58/0x340 NIP [c0000000000875f8] destroy_context+0x58/0x340 LR [c00000000013da18] __mmdrop+0x78/0x270 Call Trace: [c000000f7db77c80] [c00000000013da18] __mmdrop+0x78/0x270 [c000000f7db77cf0] [c0000000004d6a34] __do_execve_file.isra.13+0xbd4/0x1000 [c000000f7db77e00] [c0000000004d7428] sys_execve+0x58/0x70 [c000000f7db77e30] [c00000000000b388] system_call+0x5c/0x70 Reported-by: Priya M.A <priyama2@in.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Reformat/tweak comment wording] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190918140103.24395-1-aneesh.kumar@linux.ibm.com
* | | thp: update split_huge_page_pmd() commentKefeng Wang2019-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to 78ddc5347341 ("thp: rename split_huge_page_pmd() to split_huge_pmd()"), update related comment. Link: http://lkml.kernel.org/r/20190731033406.185285-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: introduce page_shift()Matthew Wilcox (Oracle)2019-09-241-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace PAGE_SHIFT + compound_order(page) with the new page_shift() function. Minor improvements in readability. [akpm@linux-foundation.org: fix build in tce_page_is_contained()] Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%25lkp@intel.com Link: http://lkml.kernel.org/r/20190721104612.19120-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'for-linus-hmm' of ↵Linus Torvalds2019-09-211-6/+6
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...
| * | pagewalk: separate function pointers from iterator dataChristoph Hellwig2019-09-071-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mm_walk structure currently mixed data and code. Split out the operations vectors into a new mm_walk_ops structure, and while we are changing the API also declare the mm_walk structure inside the walk_page_range and walk_page_vma functions. Based on patch from Linus Torvalds. Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
| * | mm: split out a new pagewalk.h header from mm.hChristoph Hellwig2019-09-071-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | Add a new header for the two handful of users of the walk_page_range / walk_page_vma interface instead of polluting all users of mm.h with it. Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | powerpc/mm/radix: remove useless kernel messagesQian Cai2019-09-141-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Booting a POWER9 PowerNV system generates a few messages below with "____ptrval____" due to the pointers printed without a specifier extension (i.e unadorned %p) are hashed to prevent leaking information about the kernel memory layout. radix-mmu: Initializing Radix MMU radix-mmu: Partition table (____ptrval____) radix-mmu: Mapped 0x0000000000000000-0x0000000040000000 with 1.00 GiB pages (exec) radix-mmu: Mapped 0x0000000040000000-0x0000002000000000 with 1.00 GiB pages radix-mmu: Mapped 0x0000200000000000-0x0000202000000000 with 1.00 GiB pages radix-mmu: Process table (____ptrval____) and radix root for kernel: (____ptrval____) Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1566570120-16529-1-git-send-email-cai@lca.pw
* | powerpc/64s: Remove overlaps_kvm_tmp()Michael Ellerman2019-09-141-4/+0
| | | | | | | | | | | | | | | | kvm_tmp is now in .text and so doesn't need a special overlap check. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190911115746.12433-2-mpe@ellerman.id.au
* | powerpc/64s/radix: introduce options to disable use of the tlbie instructionNicholas Piggin2019-09-052-18/+219
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce two options to control the use of the tlbie instruction. A boot time option which completely disables the kernel using the instruction, this is currently incompatible with HASH MMU, KVM, and coherent accelerators. And a debugfs option can be switched at runtime and avoids using tlbie for invalidating CPU TLBs for normal process and kernel address mappings. Coherent accelerators are still managed with tlbie, as will KVM partition scope translations. Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a basic implementation which does not attempt to make any optimisation beyond the tlbie implementation. This is useful for performance testing among other things. For example in certain situations on large systems, using IPIs may be faster than tlbie as they can be directed rather than broadcast. Later we may also take advantage of the IPIs to do more interesting things such as trim the mm cpumask more aggressively. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
* | powerpc/64s: remove unnecessary translation cache flushes at bootNicholas Piggin2019-09-053-8/+7
| | | | | | | | | | | | | | | | | | | | | | The various translation structure invalidations performed in early boot when the MMU is off are not required, because everything is invalidated immediately before a CPU first enables its MMU (see early_init_mmu and early_init_mmu_secondary). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-6-npiggin@gmail.com
* | powerpc/64s/pseries: radix flush translations before MMU is enabled at bootNicholas Piggin2019-09-052-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Radix guests are responsible for managing their own translation caches, so make them match bare metal radix and hash, and make each CPU flush all its translations right before enabling its MMU. Radix guests may not flush partition scope translations, so in tlbiel_all, make these flushes conditional on CPU_FTR_HVMODE. Process scope translations are the only type visible to the guest. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-5-npiggin@gmail.com
* | powerpc/64s: make mmu_partition_table_set_entry TLB flush optionalNicholas Piggin2019-09-053-4/+4
| | | | | | | | | | | | | | | | No functional change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-4-npiggin@gmail.com
* | powerpc/64s/radix: tidy up TLB flushing codeNicholas Piggin2019-09-052-100/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There should be no functional changes. - Use calls to existing radix_tlb.c functions in flush_partition. - Rename radix__flush_tlb_lpid to radix__flush_all_lpid and similar, because they flush everything, matching flush_all_mm rather than flush_tlb_mm for the lpid. - Remove some unused radix_tlb.c flush primitives. Signed-off: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-3-npiggin@gmail.com
* | powerpc/64s: remove register_process_table callbackNicholas Piggin2019-09-053-42/+12
| | | | | | | | | | | | | | | | | | | | | | | | This callback is only required because the partition table init comes before process table allocation on powernv (aka bare metal aka native). Change the order to allocate the process table first, and remove the callback. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-2-npiggin@gmail.com
* | powerpc/64s/exception: reduce page fault unnecessary loadsNicholas Piggin2019-08-301-2/+2
| | | | | | | | | | | | | | | | | | | | This avoids 3 loads in the radix page fault case, 1 load in the hash fault case, and 2 loads in the hash miss page fault case. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190802105709.27696-37-npiggin@gmail.com
* | Merge branch 'topic/ppc-kvm' into nextMichael Ellerman2019-08-303-19/+46
|\ \ | | | | | | | | | | | | Merge our ppc-kvm topic branch to bring in the Ultravisor support patches.
| * | powerpc/mm: Write to PTCR only if ultravisor disabledClaudio Carvalho2019-08-303-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ultravisor enabled systems, PTCR becomes ultravisor privileged only for writing and an attempt to write to it will cause a Hypervisor Emulation Assitance interrupt. This patch uses the set_ptcr_when_no_uv() function to restrict PTCR writing to only when ultravisor is disabled. Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190822034838.27876-6-cclaudio@linux.ibm.com
| * | powerpc/mm: Use UV_WRITE_PATE ucall to register a PATEMichael Anderson2019-08-301-13/+37
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When Ultravisor (UV) is enabled, the partition table is stored in secure memory and can only be accessed via the UV. The Hypervisor (HV) however maintains a copy of the partition table in normal memory to allow Nest MMU translations to occur (for normal VMs). The HV copy includes partition table entries (PATE)s for secure VMs which would currently be unused (Nest MMU translations cannot access secure memory) but they would be needed as we add functionality. This patch adds the UV_WRITE_PATE ucall which is used to update the PATE for a VM (both normal and secure) when Ultravisor is enabled. Signed-off-by: Michael Anderson <andmike@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> [ cclaudio: Write the PATE in HV's table before doing that in UV's ] Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com> Reviewed-by: Ryan Grimm <grimm@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190822034838.27876-5-cclaudio@linux.ibm.com
* | powerpc/mm: refactor ioremap_range() and use ioremap_page_range()Christophe Leroy2019-08-272-41/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | book3s64's ioremap_range() is almost same as fallback ioremap_range(), except that it calls radix__ioremap_range() when radix is enabled. radix__ioremap_range() is also very similar to the other ones, expect that it calls ioremap_page_range when slab is available. PPC32 __ioremap_caller() have a loop doing the same thing as ioremap_range() so use it on PPC32 as well. Lets keep only one version of ioremap_range() which calls ioremap_page_range() on all platforms when slab is available. At the same time, drop the nid parameter which is not used. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/4b1dca7096b01823b101be7338983578641547f1.1566309263.git.christophe.leroy@c-s.fr
* | powerpc/64s/radix: Fix memory hot-unplug page table splitNicholas Piggin2019-08-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | create_physical_mapping expects physical addresses, but splitting these mapping on hot unplug is supplying virtual (effective) addresses. Fixes: 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190724084638.24982-2-npiggin@gmail.com
* | powerpc/64s/radix: Fix memory hotplug section page table creationNicholas Piggin2019-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | create_physical_mapping expects physical addresses, but creating and splitting these mappings after boot is supplying virtual (effective) addresses. This can be irritated by booting with mem= to limit memory then probing an unused physical memory range: echo <addr> > /sys/devices/system/memory/probe This mostly works by accident, firstly because __va(__va(x)) == __va(x) so the virtual address does not get corrupted. Secondly because pfn_pte masks out the upper bits of the pfn beyond the physical address limit, so a pfn constructed with a 0xc000000000000000 virtual linear address will be masked back to the correct physical address in the pte. Fixes: 6cc27341b21a8 ("powerpc/mm: add radix__create_section_mapping()") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190724084638.24982-1-npiggin@gmail.com
* | powerpc/mm: move update_mmu_cache() into book3s hash utils.Christophe Leroy2019-08-201-2/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | update_mmu_cache() is only for BOOK3S, and can be simplified for BOOK3S32. Move it out of mem.c into respective BOOK3S32 and BOOK3S64 files containing hash utils. BOOK3S64 version of hash_preload() is only used locally, declare it static. Remove the radix_enabled() stuff in BOOK3S32 version. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/107aaf43583a5f5d09e0d4e84c4c4390ecfcd512.1565933217.git.christophe.leroy@c-s.fr
* | powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()Gautham R. Shenoy2019-08-191-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The calls to arch_add_memory()/arch_remove_memory() are always made with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin(). On pSeries, arch_add_memory()/arch_remove_memory() eventually call resize_hpt() which in turn calls stop_machine() which acquires the read-side cpu_hotplug_lock again, thereby resulting in the recursive acquisition of this lock. In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system lockup during a memory hotplug operation because cpus_read_lock() is a per-cpu rwsem read, which, in the fast-path (in the absence of the writer, which in our case is a CPU-hotplug operation) simply increments the read_count on the semaphore. Thus a recursive read in the fast-path doesn't cause any problems. However, we can hit this problem in practice if there is a concurrent CPU-Hotplug operation in progress which is waiting to acquire the write-side of the lock. This will cause the second recursive read to block until the writer finishes. While the writer is blocked since the first read holds the lock. Thus both the reader as well as the writers fail to make any progress thereby blocking both CPU-Hotplug as well as Memory Hotplug operations. Memory-Hotplug CPU-Hotplug CPU 0 CPU 1 ------ ------ 1. down_read(cpu_hotplug_lock.rw_sem) [memory_hotplug_begin] 2. down_write(cpu_hotplug_lock.rw_sem) [cpu_up/cpu_down] 3. down_read(cpu_hotplug_lock.rw_sem) [stop_machine()] Lockdep complains as follows in these code-paths. swapper/0/1 is trying to acquire lock: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60 but task is already holding lock: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cpu_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by swapper/0/1: #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0 #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50 #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166 Call Trace: dump_stack+0xe8/0x164 (unreliable) __lock_acquire+0x1110/0x1c70 lock_acquire+0x240/0x290 cpus_read_lock+0x64/0xf0 stop_machine+0x2c/0x60 pseries_lpar_resize_hpt+0x19c/0x2c0 resize_hpt_for_hotplug+0x70/0xd0 arch_add_memory+0x58/0xfc devm_memremap_pages+0x5e8/0x8f0 pmem_attach_disk+0x764/0x830 nvdimm_bus_probe+0x118/0x240 really_probe+0x230/0x4b0 driver_probe_device+0x16c/0x1e0 __driver_attach+0x148/0x1b0 bus_for_each_dev+0x90/0x130 driver_attach+0x34/0x50 bus_add_driver+0x1a8/0x360 driver_register+0x108/0x170 __nd_driver_register+0xd0/0xf0 nd_pmem_driver_init+0x34/0x48 do_one_initcall+0x1e0/0x45c kernel_init_freeable+0x540/0x64c kernel_init+0x2c/0x160 ret_from_kernel_thread+0x5c/0x68 Fix this issue by 1) Requiring all the calls to pseries_lpar_resize_hpt() be made with cpu_hotplug_lock held. 2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked() as a consequence of 1) 3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt() with cpu_hotplug_lock held. Fixes: dbcf929c0062 ("powerpc/pseries: Add support for hash table resizing") Cc: stable@vger.kernel.org # v4.11+ Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.com
* | powerpc/64s: Make boot look nice(r)Nicholas Piggin2019-08-151-3/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Radix boot looks like this: ----------------------------------------------------- phys_mem_size = 0x200000000 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x0000c06f8f5fb1a7 possible = 0x0000fbffcf5fb1a7 always = 0x00000003800081a1 cpu_user_features = 0xdc0065c2 0xaee00000 mmu_features = 0xbc006041 firmware_features = 0x0000000010000000 hash-mmu: ppc64_pft_size = 0x0 hash-mmu: kernel vmalloc start = 0xc008000000000000 hash-mmu: kernel IO start = 0xc00a000000000000 hash-mmu: kernel vmemmap start = 0xc00c000000000000 ----------------------------------------------------- Fix: ----------------------------------------------------- phys_mem_size = 0x200000000 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x0000c06f8f5fb1a7 possible = 0x0000fbffcf5fb1a7 always = 0x00000003800081a1 cpu_user_features = 0xdc0065c2 0xaee00000 mmu_features = 0xbc006041 firmware_features = 0x0000000010000000 vmalloc start = 0xc008000000000000 IO start = 0xc00a000000000000 vmemmap start = 0xc00c000000000000 ----------------------------------------------------- Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190516020437.11783-1-npiggin@gmail.com
* Merge tag 'powerpc-5.3-2' of ↵Linus Torvalds2019-07-241-0/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "An assortment of non-regression fixes that have accumulated since the start of the merge window. - A fix for a user triggerable oops on machines where transactional memory is disabled, eg. Power9 bare metal, Power8 with TM disabled on the command line, or all Power7 or earlier machines. - Three fixes for handling of PMU and power saving registers when running nested KVM on Power9. - Two fixes for bugs found while stress testing the XIVE interrupt controller code, also on Power9. - A fix to allow guests to boot under Qemu/KVM on Power9 using the the Hash MMU with >= 1TB of memory. - Two fixes for bugs in the recent DMA cleanup, one of which could lead to checkstops. - And finally three fixes for the PAPR SCM nvdimm driver. Thanks to: Alexey Kardashevskiy, Andrea Arcangeli, Cédric Le Goater, Christoph Hellwig, David Gibson, Gautham R. Shenoy, Michael Neuling, Oliver O'Halloran, Satheesh Rajendran, Shawn Anastasio, Suraj Jitindar Singh, Vaibhav Jain" * tag 'powerpc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL powerpc/pseries: Update SCM hcall op-codes in hvcall.h powerpc/tm: Fix oops on sigreturn on systems without TM powerpc/dma: Fix invalid DMA mmap behavior KVM: PPC: Book3S HV: XIVE: fix rollback when kvmppc_xive_create fails powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask() powerpc: fix off by one in max_zone_pfn initialization for ZONE_DMA KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries powerpc/pmu: Set pmcregs_in_use in paca when running as LPAR KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting powerpc/mm: Limit rma_size to 1TB when running without HV mode
| * powerpc/mm: Limit rma_size to 1TB when running without HV modeSuraj Jitindar Singh2019-07-151-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The virtual real mode addressing (VRMA) mechanism is used when a partition is using HPT (Hash Page Table) translation and performs real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this mode effective address bits 0:23 are treated as zero (i.e. the access is aliased to 0) and the access is performed using an implicit 1TB SLB entry. The size of the RMA (Real Memory Area) is communicated to the guest as the size of the first memory region in the device tree. And because of the mechanism described above can be expected to not exceed 1TB. In the event that the host erroneously represents the RMA as being larger than 1TB, guest accesses in real mode to memory addresses above 1TB will be aliased down to below 1TB. This means that a memory access performed in real mode may differ to one performed in virtual mode for the same memory address, which would likely have unintended consequences. To avoid this outcome have the guest explicitly limit the size of the RMA to the current maximum, which is 1TB. This means that even if the first memory block is larger than 1TB, only the first 1TB should be accessed in real mode. Fixes: c610d65c0ad0 ("powerpc/pseries: lift RTAS limit for hash") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190710052018.14628-1-sjitindarsingh@gmail.com
* | mm: add account_locked_vm utility functionDaniel Jordan2019-07-161-37/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | locked_vm accounting is done roughly the same way in five places, so unify them in a helper. Include the helper's caller in the debug print to distinguish between callsites. Error codes stay the same, so user-visible behavior does too. The one exception is that the -EPERM case in tce_account_locked_vm is removed because Alexey has never seen it triggered. [daniel.m.jordan@oracle.com: v3] Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com [sfr@canb.auug.org.au: fix mm/util.c] Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Cc: Alan Tull <atull@kernel.org> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Lameter <cl@linux.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Moritz Fischer <mdf@kernel.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Wu Hao <hao.wu@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/ioremap: probe platform for p4d huge map supportAnshuman Khandual2019-07-161-0/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | Finish up what commit c2febafc6773 ("mm: convert generic code to 5-level paging") started while levelling up P4D huge mapping support at par with PUD and PMD. A new arch call back arch_ioremap_p4d_supported() is added which just maintains status quo (P4D huge map not supported) on x86, arm64 and powerpc. When HAVE_ARCH_HUGE_VMAP is enabled its just a simple check from the arch about the support, hence runtime effects are minimal. Link: http://lkml.kernel.org/r/1561699231-20991-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'powerpc-5.3-1' of ↵Linus Torvalds2019-07-139-134/+181
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Removal of the NPU DMA code, used by the out-of-tree Nvidia driver, as well as some other functions only used by drivers that haven't (yet?) made it upstream. - A fix for a bug in our handling of hardware watchpoints (eg. perf record -e mem: ...) which could lead to register corruption and kernel crashes. - Enable HAVE_ARCH_HUGE_VMAP, which allows us to use large pages for vmalloc when using the Radix MMU. - A large but incremental rewrite of our exception handling code to use gas macros rather than multiple levels of nested CPP macros. And the usual small fixes, cleanups and improvements. Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Cédric Le Goater, Christian Lamparter, Christophe Leroy, Christophe Lombard, Christoph Hellwig, Daniel Axtens, Denis Efremov, Enrico Weigelt, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geliang Tang, Gen Zhang, Greg Kroah-Hartman, Greg Kurz, Gustavo Romero, Krzysztof Kozlowski, Madhavan Srinivasan, Masahiro Yamada, Mathieu Malaterre, Michael Neuling, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nishad Kamdar, Oliver O'Halloran, Qian Cai, Ravi Bangoria, Sachin Sant, Sam Bobroff, Satheesh Rajendran, Segher Boessenkool, Shaokun Zhang, Shawn Anastasio, Stewart Smith, Suraj Jitindar Singh, Thiago Jung Bauermann, YueHaibing" * tag 'powerpc-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (163 commits) powerpc/powernv/idle: Fix restore of SPRN_LDBAR for POWER9 stop state. powerpc/eeh: Handle hugepages in ioremap space ocxl: Update for AFU descriptor template version 1.1 powerpc/boot: pass CONFIG options in a simpler and more robust way powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h powerpc/irq: Don't WARN continuously in arch_local_irq_restore() powerpc/module64: Use symbolic instructions names. powerpc/module32: Use symbolic instructions names. powerpc: Move PPC_HA() PPC_HI() and PPC_LO() to ppc-opcode.h powerpc/module64: Fix comment in R_PPC64_ENTRY handling powerpc/boot: Add lzo support for uImage powerpc/boot: Add lzma support for uImage powerpc/boot: don't force gzipped uImage powerpc/8xx: Add microcode patch to move SMC parameter RAM. powerpc/8xx: Use IO accessors in microcode programming. powerpc/8xx: replace #ifdefs by IS_ENABLED() in microcode.c powerpc/8xx: refactor programming of microcode CPM params. powerpc/8xx: refactor printing of microcode patch name. powerpc/8xx: Refactor microcode write powerpc/8xx: refactor writing of CPM microcode arrays ...
| * powerpc/mm: pmd_devmap implies pmd_large().Aneesh Kumar K.V2019-07-051-1/+1
| | | | | | | | | | | | | | large devmap usage is dependent on THP. Hence once check is sufficient. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OpenPOWER on IntegriCloud