summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kernel
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | | powerpc: Detect the secure boot mode of the systemNayna Jain2019-11-122-0/+37
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch defines a function to detect the secure boot state of a PowerNV system. The PPC_SECURE_BOOT config represents the base enablement of secure boot for powerpc. Signed-off-by: Nayna Jain <nayna@linux.ibm.com> Signed-off-by: Eric Richter <erichte@linux.ibm.com> [mpe: Fold in change from Nayna to add "ibm,secureboot" to ids] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/46b003b9-3225-6bf7-9101-ed6580bb748c@linux.ibm.com
| * | | powerpc: Convert flush_icache_range & friends to CAlastair D'Silva2019-11-072-222/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to commit 22e9c88d486a ("powerpc/64: reuse PPC32 static inline flush_dcache_range()") this patch converts the following ASM symbols to C: flush_icache_range() __flush_dcache_icache() __flush_dcache_icache_phys() This was done as we discovered a long-standing bug where the length of the range was truncated due to using a 32 bit shift instead of a 64 bit one. By converting these functions to C, it becomes easier to maintain. flush_dcache_icache_phys() retains a critical assembler section as we must ensure there are no memory accesses while the data MMU is disabled (authored by Christophe Leroy). Since this has no external callers, it has also been made static, allowing the compiler to inline it within flush_dcache_icache_page(). Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [mpe: Minor fixups, don't export __flush_dcache_icache()] Link: https://lore.kernel.org/r/20191104023305.9581-5-alastair@au1.ibm.com
| * | | powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GBAlastair D'Silva2019-11-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calling __kernel_sync_dicache with a size >4GB, we were masking off the upper 32 bits, so we would incorrectly flush a range smaller than intended. This patch replaces the 32 bit shifts with 64 bit ones, so that the full size is accounted for. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191104023305.9581-3-alastair@au1.ibm.com
| * | | powerpc: Allow flush_icache_range to work across ranges >4GBAlastair D'Silva2019-11-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calling flush_icache_range with a size >4GB, we were masking off the upper 32 bits, so we would incorrectly flush a range smaller than intended. This patch replaces the 32 bit shifts with 64 bit ones, so that the full size is accounted for. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191104023305.9581-2-alastair@au1.ibm.com
| * | | powerpc: Support CMDLINE_EXTENDChris Packham2019-11-071-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bring powerpc in line with other architectures that support extending or overriding the bootloader provided command line. The current behaviour is most like CMDLINE_FROM_BOOTLOADER where the bootloader command line is preferred but the kernel config can provide a fallback so CMDLINE_FROM_BOOTLOADER is the default. CMDLINE_EXTEND can be used to append the CMDLINE from the kernel config to the one provided by the bootloader. Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190801225006.21952-1-chris.packham@alliedtelesis.co.nz
| * | | powerpc/64s: Always disable branch profiling for prom_init.oMichael Ellerman2019-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise the build fails because prom_init is calling symbols it's not allowed to, eg: Error: External symbol 'ftrace_likely_update' referenced from prom_init.c make[3]: *** [arch/powerpc/kernel/Makefile:197: arch/powerpc/kernel/prom_init_check] Error 1 Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191106051129.7626-1-mpe@ellerman.id.au
| * | | powerpc/security: Fix debugfs data leak on 32-bitGeert Uytterhoeven2019-11-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "powerpc_security_features" is "unsigned long", i.e. 32-bit or 64-bit, depending on the platform (PPC_FSL_BOOK3E or PPC_BOOK3S_64). Hence casting its address to "u64 *", and calling debugfs_create_x64() is wrong, and leaks 32-bit of nearby data to userspace on 32-bit platforms. While all currently defined SEC_FTR_* security feature flags fit in 32-bit, they all have "ULL" suffixes to make them 64-bit constants. Hence fix the leak by changing the type of "powerpc_security_features" (and the parameter types of its accessors) to "u64". This also allows to drop the cast. Fixes: 398af571128fe75f ("powerpc/security: Show powerpc_security_features in debugfs") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191021142309.28105-1-geert+renesas@glider.be
| * | | powerpc/security/book3s64: Report L1TF status in sysfsAnthony Steinhauser2019-11-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some PowerPC CPUs are vulnerable to L1TF to the same extent as to Meltdown. It is also mitigated by flushing the L1D on privilege transition. Currently the sysfs gives a false negative on L1TF on CPUs that I verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV T2P9D01. Signed-off-by: Anthony Steinhauser <asteinhauser@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly] Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhauser@google.com
| * | | Merge branch 'fixes' into nextMichael Ellerman2019-11-042-1/+15
| |\ \ \ | | | |/ | | |/| | | | | | | | | Merge our fixes branch, primarily to bring in the powernv CPU hotplug warning fix.
| * | | powerpc: Add build-time check of ptrace PT_xx definesMichael Ellerman2019-10-301-0/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of the uapi we export a lot of PT_xx defines for each register in struct pt_regs. These are expressed as an index from gpr[0], in units of unsigned long. Currently there's nothing tying the values of those defines to the actual layout of the struct. But we *don't* want to change the uapi defines to derive the PT_xx values based on the layout of the struct, those values are ABI and must never change. Instead we want to do the reverse, make sure that the layout of the struct never changes vs the PT_xx defines. So add build time checks of that. This probably seems paranoid, but at least once in the past someone has sent a patch that would have broken the ABI if it hadn't been spotted. Although it probably would have been detected via testing, it's preferable to just quash any issues at the source. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191030111231.22720-1-mpe@ellerman.id.au
| * | | powerpc/ptrace: Add prototype for function pt_regs_checkMathieu Malaterre2019-10-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `pt_regs_check` is a dummy function, its purpose is to break the build if struct pt_regs and struct user_pt_regs don't match. This function has no functionnal purpose, and will get eliminated at link time or after init depending on CONFIG_LD_DEAD_CODE_DATA_ELIMINATION This commit adds a prototype to fix warning at W=1: arch/powerpc/kernel/ptrace.c:3339:13: error: no previous prototype for ‘pt_regs_check’ [-Werror=missing-prototypes] Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20181208154624.6504-1-malat@debian.org
| * | | powerpc/pseries: Mark accumulate_stolen_time() as notraceMichael Ellerman2019-10-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | accumulate_stolen_time() is called prior to interrupt state being reconciled, which can trip the warning in arch_local_irq_restore(): WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 .arch_local_irq_restore+0x9c/0x130 ... NIP .arch_local_irq_restore+0x9c/0x130 LR .rb_start_commit+0x38/0x80 Call Trace: .ring_buffer_lock_reserve+0xe4/0x620 .trace_function+0x44/0x210 .function_trace_call+0x148/0x170 .ftrace_ops_no_ops+0x180/0x1d0 ftrace_call+0x4/0x8 .accumulate_stolen_time+0x1c/0xb0 decrementer_common+0x124/0x160 For now just mark it as notrace. We may change the ordering to call it after interrupt state has been reconciled, but that is a larger change. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024055932.27940-1-mpe@ellerman.id.au
| * | | powerpc/setup_64: fix -Wempty-body warningsQian Cai2019-10-111-20/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the beginning of setup_64.c, it has, #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) #else #define DBG(fmt...) #endif where DBG() could be compiled away, and generate warnings, arch/powerpc/kernel/setup_64.c: In function 'initialize_cache_info': arch/powerpc/kernel/setup_64.c:579:49: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] DBG("Argh, can't find dcache properties !\n"); ^ arch/powerpc/kernel/setup_64.c:582:49: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] DBG("Argh, can't find icache properties !\n"); Fix it by using the suggestions from Michael: "Neither of those sites should use DBG(), that's not really early boot code, they should just use pr_warn(). And the other uses of DBG() in initialize_cache_info() should just be removed. In smp_release_cpus() the entry/exit DBG's should just be removed, and the spinning_secondaries line should just be pr_debug(). That would just leave the two calls in early_setup(). If we taught udbg_printf() to return early when udbg_putc is NULL, then we could just call udbg_printf() unconditionally and get rid of the DBG macro entirely." Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Qian Cai <cai@lca.pw> [mpe: Split udbg change out into previous patch] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1563215552-8166-1-git-send-email-cai@lca.pw
| * | | powerpc/udbg: Make it safe to call udbg_printf() alwaysMichael Ellerman2019-10-111-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make udbg_printf() check if udbg_putc is set, and if not just return. This makes it safe to call udbg_printf() anytime, even when a udbg backend has not been registered, which means we can avoid some ifdefs at call sites. Signed-off-by: Qian Cai <cai@lca.pw> [mpe: Split out of larger patch, write change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | powerpc: make syntax for FADump config options in kernel/Makefile readableHari Bathini2019-10-111-3/+2
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | arch/powerpc/kernel/fadump.c file needs to be compiled in if 'config FA_DUMP' or 'config PRESERVE_FA_DUMP' is set. The current syntax achieves that but looks a bit odd. Fix it for better readability. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/157063484064.11906.3586824898111397624.stgit@hbathini.in.ibm.com
* | | Merge tag 'powerpc-spectre-rsb' of powerpc-CVE-2019-18660.bundleLinus Torvalds2019-11-272-4/+59
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull powerpc Spectre-RSB fixes from Michael Ellerman: "We failed to activate the mitigation for Spectre-RSB (Return Stack Buffer, aka. ret2spec) on context switch, on CPUs prior to Power9 DD2.3. That allows a process to poison the RSB (called Link Stack on Power CPUs) and possibly misdirect speculative execution of another process. If the victim process can be induced to execute a leak gadget then it may be possible to extract information from the victim via a side channel. The fix is to correctly activate the link stack flush mitigation on all CPUs that have any mitigation of Spectre v2 in userspace enabled. There's a second commit which adds a link stack flush in the KVM guest exit path. A leak via that path has not been demonstrated, but we believe it's at least theoretically possible. This is the fix for CVE-2019-18660" * tag 'powerpc-spectre-rsb' of /home/torvalds/Downloads/powerpc-CVE-2019-18660.bundle: KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel powerpc/book3s64: Fix link stack flush on context switch
| * | | KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernelMichael Ellerman2019-11-141-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some systems that are vulnerable to Spectre v2, it is up to software to flush the link stack (return address stack), in order to protect against Spectre-RSB. When exiting from a guest we do some house keeping and then potentially exit to C code which is several stack frames deep in the host kernel. We will then execute a series of returns without preceeding calls, opening up the possiblity that the guest could have poisoned the link stack, and direct speculative execution of the host to a gadget of some sort. To prevent this we add a flush of the link stack on exit from a guest. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | powerpc/book3s64: Fix link stack flush on context switchMichael Ellerman2019-11-142-4/+50
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit ee13cb249fab ("powerpc/64s: Add support for software count cache flush"), I added support for software to flush the count cache (indirect branch cache) on context switch if firmware told us that was the required mitigation for Spectre v2. As part of that code we also added a software flush of the link stack (return address stack), which protects against Spectre-RSB between user processes. That is all correct for CPUs that activate that mitigation, which is currently Power9 Nimbus DD2.3. What I got wrong is that on older CPUs, where firmware has disabled the count cache, we also need to flush the link stack on context switch. To fix it we create a new feature bit which is not set by firmware, which tells us we need to flush the link stack. We set that when firmware tells us that either of the existing Spectre v2 mitigations are enabled. Then we adjust the patching code so that if we see that feature bit we enable the link stack flush. If we're also told to flush the count cache in software then we fall through and do that also. On the older CPUs we don't need to do do the software count cache flush, firmware has disabled it, so in that case we patch in an early return after the link stack flush. The naming of some of the functions is awkward after this patch, because they're called "count cache" but they also do link stack. But we'll fix that up in a later commit to ease backporting. This is the fix for CVE-2019-18660. Reported-by: Anthony Steinhauser <asteinhauser@google.com> Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache flush") Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* | | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2019-11-261-3/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The biggest changes in this cycle were: - Make kcpustat vtime aware (Frederic Weisbecker) - Rework the CFS load_balance() logic (Vincent Guittot) - Misc cleanups, smaller enhancements, fixes. The load-balancing rework is the most intrusive change: it replaces the old heuristics that have become less meaningful after the introduction of the PELT metrics, with a grounds-up load-balancing algorithm. As such it's not really an iterative series, but replaces the old load-balancing logic with the new one. We hope there are no performance regressions left - but statistically it's highly probable that there *is* going to be some workload that is hurting from these chnages. If so then we'd prefer to have a look at that workload and fix its scheduling, instead of reverting the changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) rackmeter: Use vtime aware kcpustat accessor leds: Use all-in-one vtime aware kcpustat accessor cpufreq: Use vtime aware kcpustat accessors for user time procfs: Use all-in-one vtime aware kcpustat accessor sched/vtime: Bring up complete kcpustat accessor sched/cputime: Support other fields on kcpustat_field() sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util() sched/fair: Add comments for group_type and balancing at SD_NUMA level sched/fair: Fix rework of find_idlest_group() sched/uclamp: Fix overzealous type replacement sched/Kconfig: Fix spelling mistake in user-visible help text sched/core: Further clarify sched_class::set_next_task() sched/fair: Use mul_u32_u32() sched/core: Simplify sched_class::pick_next_task() sched/core: Optimize pick_next_task() sched/core: Make pick_next_task_idle() more consistent sched/fair: Better document newidle_balance() leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM ...
| * | | Merge tag 'v5.4-rc7' into sched/core, to pick up fixesIngo Molnar2019-11-112-1/+15
| |\| | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/cputime: Rename vtime_account_system() to vtime_account_kernel()Frederic Weisbecker2019-10-091-3/+3
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vtime_account_system() decides if we need to account the time to the system (__vtime_account_system()) or to the guest (vtime_account_guest()). So this function is a misnomer as we are on a higher level than "system". All we know when we call that function is that we are accounting kernel cputime. Whether it belongs to guest or system time is a lower level detail. Rename this function to vtime_account_kernel(). This will clarify things and avoid too many underscored vtime_account_system() versions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191003161745.28464-2-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds2019-11-261-30/+7
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "The main changes in this cycle were: - Cross-arch changes to move the linker sections for NOTES and EXCEPTION_TABLE into the RO_DATA area, where they belong on most architectures. (Kees Cook) - Switch the x86 linker fill byte from x90 (NOP) to 0xcc (INT3), to trap jumps into the middle of those padding areas instead of sliding execution. (Kees Cook) - A thorough cleanup of symbol definitions within x86 assembler code. The rather randomly named macros got streamlined around a (hopefully) straightforward naming scheme: SYM_START(name, linkage, align...) SYM_END(name, sym_type) SYM_FUNC_START(name) SYM_FUNC_END(name) SYM_CODE_START(name) SYM_CODE_END(name) SYM_DATA_START(name) SYM_DATA_END(name) etc - with about three times of these basic primitives with some label, local symbol or attribute variant, expressed via postfixes. No change in functionality intended. (Jiri Slaby) - Misc other changes, cleanups and smaller fixes" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits) x86/entry/64: Remove pointless jump in paranoid_exit x86/entry/32: Remove unused resume_userspace label x86/build/vdso: Remove meaningless CFLAGS_REMOVE_*.o m68k: Convert missed RODATA to RO_DATA x86/vmlinux: Use INT3 instead of NOP for linker fill bytes x86/mm: Report actual image regions in /proc/iomem x86/mm: Report which part of kernel image is freed x86/mm: Remove redundant address-of operators on addresses xtensa: Move EXCEPTION_TABLE to RO_DATA segment powerpc: Move EXCEPTION_TABLE to RO_DATA segment parisc: Move EXCEPTION_TABLE to RO_DATA segment microblaze: Move EXCEPTION_TABLE to RO_DATA segment ia64: Move EXCEPTION_TABLE to RO_DATA segment h8300: Move EXCEPTION_TABLE to RO_DATA segment c6x: Move EXCEPTION_TABLE to RO_DATA segment arm64: Move EXCEPTION_TABLE to RO_DATA segment alpha: Move EXCEPTION_TABLE to RO_DATA segment x86/vmlinux: Move EXCEPTION_TABLE to RO_DATA segment x86/vmlinux: Actually use _etext for the end of the text segment vmlinux.lds.h: Allow EXCEPTION_TABLE to live in RO_DATA ...
| * | powerpc: Move EXCEPTION_TABLE to RO_DATA segmentKees Cook2019-11-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the EXCEPTION_TABLE is read-only, collapse it into RO_DATA. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-25-keescook@chromium.org
| * | vmlinux.lds.h: Move NOTES into RO_DATAKees Cook2019-11-041-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The .notes section should be non-executable read-only data. As such, move it to the RO_DATA macro instead of being per-architecture defined. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390 Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-11-keescook@chromium.org
| * | vmlinux.lds.h: Move Program Header restoration into NOTES macroKees Cook2019-11-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for moving NOTES into RO_DATA, make the Program Header assignment restoration be part of the NOTES macro itself. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390 Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-10-keescook@chromium.org
| * | vmlinux.lds.h: Provide EMIT_PT_NOTE to indicate export of .notesKees Cook2019-11-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for moving NOTES into RO_DATA, provide a mechanism for architectures that want to emit a PT_NOTE Program Header to do so. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390 Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-9-keescook@chromium.org
| * | powerpc: Rename PT_LOAD identifier "kernel" to "text"Kees Cook2019-11-041-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for moving NOTES into RO_DATA, rename the linker script internal identifier for the PT_LOAD Program Header from "kernel" to "text" to match other architectures. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86@kernel.org Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-4-keescook@chromium.org
| * | powerpc: Remove PT_NOTE workaroundKees Cook2019-11-041-22/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for moving NOTES into RO_DATA, remove the PT_NOTE workaround since the kernel requires at least gcc 4.6 now. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86@kernel.org Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-3-keescook@chromium.org
| * | powerpc: Rename "notes" PT_NOTE to "note"Kees Cook2019-11-041-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Program Header identifiers are internal to the linker scripts. In preparation for moving the NOTES segment declaration into RO_DATA, standardize the identifier for the PT_NOTE entry to "note" as used by all other architectures that emit PT_NOTE. Note that there was discussion about changing all architectures to use "notes" instead, but I prefer to avoid that at this time. Changing only powerpc is the smallest change to standardize the entire kernel. And while this standardization does use singular "note" for a section that has more than one note in it, this is just an internal identifier. It matches the ELF "PT_NOTE", and is 4 characters (like "text", and "data") for pretty alignment. The more exposed macro, "NOTES", use the more sensible plural wording. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20191029211351.13243-2-keescook@chromium.org
* / powerpc/prom_init: Undo relocation before entering secure modeThiago Jung Bauermann2019-10-292-1/+15
|/ | | | | | | | | | | | | | | | The ultravisor will do an integrity check of the kernel image but we relocated it so the check will fail. Restore the original image by relocating it back to the kernel virtual base address. This works because during build vmlinux is linked with an expected virtual runtime address of KERNELBASE. Fixes: 6a9c930bd775 ("powerpc/prom_init: Add the ESM call to prom_init") Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Tested-by: Michael Anderson <andmike@linux.ibm.com> [mpe: Add IS_ENABLED() to fix the CONFIG_RELOCATABLE=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190911163433.12822-1-bauerman@linux.ibm.com
* powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devicesOliver O'Halloran2019-09-271-2/+2
| | | | | | | | | | | | s/CONFIG_IOV/CONFIG_PCI_IOV/ Whoops. Fixes: bd6461cc7b3c ("powerpc/eeh: Add a eeh_dev_break debugfs interface") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> [mpe: Fixup the #endif comment as well] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190926122502.14826-1-oohall@gmail.com
* powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9Aneesh Kumar K.V2019-09-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | On POWER9, under some circumstances, a broadcast TLB invalidation will fail to invalidate the ERAT cache on some threads when there are parallel mtpidr/mtlpidr happening on other threads of the same core. This can cause stores to continue to go to a page after it's unmapped. The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie flush. This additional TLB flush will cause the ERAT cache invalidation. Since we are using PID=0 or LPID=0, we don't get filtered out by the TLB snoop filtering logic. We need to still follow this up with another tlbie to take care of store vs tlbie ordering issue explained in commit: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9"). The presence of ERAT cache implies we can still get new stores and they may miss store queue marking flush. Cc: stable@vger.kernel.org Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
* powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flagAneesh Kumar K.V2019-09-241-3/+3
| | | | | | | | | | | | Rename the #define to indicate this is related to store vs tlbie ordering issue. In the next patch, we will be adding another feature flag that is used to handles ERAT flush vs tlbie ordering issue. Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
* powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisionsAneesh Kumar K.V2019-09-241-2/+28
| | | | | | | | | | | | | | | | | | The store ordering vs tlbie issue mentioned in commit a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") is fixed for Nimbus 2.3 and Cumulus 1.3 revisions. We don't need to apply the fixup if we are running on them We can only do this on PowerNV. On pseries guest with KVM we still don't support redoing the feature fixup after migration. So we should be enabling all the workarounds needed, because whe can possibly migrate between DD 2.3 and DD 2.2 Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-1-aneesh.kumar@linux.ibm.com
* KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flagMichael Roth2019-09-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB of memory running the following guest configs: guest A: - 224GB of memory - 56 VCPUs (sockets=1,cores=28,threads=2), where: VCPUs 0-1 are pinned to CPUs 0-3, VCPUs 2-3 are pinned to CPUs 4-7, ... VCPUs 54-55 are pinned to CPUs 108-111 guest B: - 4GB of memory - 4 VCPUs (sockets=1,cores=4,threads=1) with the following workloads (with KSM and THP enabled in all): guest A: stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M guest B: stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M host: stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M the below soft-lockup traces were observed after an hour or so and persisted until the host was reset (this was found to be reliably reproducible for this configuration, for kernels 4.15, 4.18, 5.0, and 5.3-rc5): [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU [ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941 [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709] [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913] [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331] [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338] [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525] [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU [ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243 [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU [ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714 [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU [ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403 [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds. [ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds. [ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds. [ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds. [ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds. [ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds. [ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds. [ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds. [ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 CPUs 45, 24, and 124 are stuck on spin locks, likely held by CPUs 105 and 31. CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on target CPU 42. For instance: # CPU 105 registers (via xmon) R00 = c00000000020b20c R16 = 00007d1bcd800000 R01 = c00000363eaa7970 R17 = 0000000000000001 R02 = c0000000019b3a00 R18 = 000000000000006b R03 = 000000000000002a R19 = 00007d537d7aecf0 R04 = 000000000000002a R20 = 60000000000000e0 R05 = 000000000000002a R21 = 0801000000000080 R06 = c0002073fb0caa08 R22 = 0000000000000d60 R07 = c0000000019ddd78 R23 = 0000000000000001 R08 = 000000000000002a R24 = c00000000147a700 R09 = 0000000000000001 R25 = c0002073fb0ca908 R10 = c000008ffeb4e660 R26 = 0000000000000000 R11 = c0002073fb0ca900 R27 = c0000000019e2464 R12 = c000000000050790 R28 = c0000000000812b0 R13 = c000207fff623e00 R29 = c0002073fb0ca808 R14 = 00007d1bbee00000 R30 = c0002073fb0ca800 R15 = 00007d1bcd600000 R31 = 0000000000000800 pc = c00000000020b260 smp_call_function_many+0x3d0/0x460 cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460 lr = c00000000020b20c smp_call_function_many+0x37c/0x460 msr = 900000010288b033 cr = 44024824 ctr = c000000000050790 xer = 0000000000000000 trap = 100 CPU 42 is running normally, doing VCPU work: # CPU 42 stack trace (via xmon) [link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv] [c000008ed3343820] c000008ed3343850 (unreliable) [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv] [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv] [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm] [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm] [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70 [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120 [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80 [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70 --- Exception: c00 (System Call) at 00007d545cfd7694 SP (7d53ff7edf50) is in userspace It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION] was set for CPU 42 by at least 1 of the CPUs waiting in smp_call_function_many(), but somehow the corresponding call_single_queue entries were never processed by CPU 42, causing the callers to spin in csd_lock_wait() indefinitely. Nick Piggin suggested something similar to the following sequence as a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE IPI messages seems to be most common, but any mix of CALL_FUNCTION and !CALL_FUNCTION messages could trigger it): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() 105: smp_muxed_ipi_set_message(): 105: smb_mb() // STORE DEFERRED DUE TO RE-ORDERING --105: message[CALL_FUNCTION] = 1 | 105: doorbell_global_ipi(42): | 105: kvmppc_set_host_ipi(42, 1) | 42: kvmppc_set_host_ipi(42, 0) | 42: smp_ipi_demux_relaxed() | 42: // returns to executing guest | // RE-ORDERED STORE COMPLETES ->105: message[CALL_FUNCTION] = 1 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue This can be prevented with an smp_mb() at the beginning of kvmppc_set_host_ipi(), such that stores to message[<type>] (or other state indicated by the host_ipi flag) are ordered vs. the store to to host_ipi. However, doing so might still allow for the following scenario (not yet observed): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() // STORE DEFERRED DUE TO RE-ORDERING -- 42: kvmppc_set_host_ipi(42, 0) | 42: smp_ipi_demux_relaxed() | 105: smp_muxed_ipi_set_message(): | 105: smb_mb() | 105: message[CALL_FUNCTION] = 1 | 105: doorbell_global_ipi(42): | 105: kvmppc_set_host_ipi(42, 1) | // RE-ORDERED STORE COMPLETES -> 42: kvmppc_set_host_ipi(42, 0) 42: // returns to executing guest 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue Fixing this scenario would require an smp_mb() *after* clearing host_ipi flag in kvmppc_set_host_ipi() to order the store vs. subsequent processing of IPI messages. To handle both cases, this patch splits kvmppc_set_host_ipi() into separate set/clear functions, where we execute smp_mb() prior to setting host_ipi flag, and after clearing host_ipi flag. These functions pair with each other to synchronize the sender and receiver sides. With that change in place the above workload ran for 20 hours without triggering any lock-ups. Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0 Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
* powerpc/64s: Set reserved PCR bitsJordan Niethe2019-09-212-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the reserved bits of the Processor Compatibility Register (PCR) are cleared as per the Programming Note in Section 1.3.3 of version 3.0B of the Power ISA. This causes all new architecture features to be made available when running on newer processors with new architecture features added to the PCR as bits must be set to disable a given feature. For example to disable new features added as part of Version 2.07 of the ISA the corresponding bit in the PCR needs to be set. As new processor features generally require explicit kernel support they should be disabled until such support is implemented. Therefore kernels should set all unknown/reserved bits in the PCR such that any new architecture features which the kernel does not currently know about get disabled. An update is planned to the ISA to clarify that the PCR is an exception to the Programming Note on reserved bits in Section 1.3.3. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Tested-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
* Merge tag 'powerpc-5.4-1' of ↵Linus Torvalds2019-09-2055-2159/+2715
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "This is a bit late, partly due to me travelling, and partly due to a power outage knocking out some of my test systems *while* I was travelling. - Initial support for running on a system with an Ultravisor, which is software that runs below the hypervisor and protects guests against some attacks by the hypervisor. - Support for building the kernel to run as a "Secure Virtual Machine", ie. as a guest capable of running on a system with an Ultravisor. - Some changes to our DMA code on bare metal, to allow devices with medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space. - Support for firmware assisted crash dumps on bare metal (powernv). - Two series fixing bugs in and refactoring our PCI EEH code. - A large series refactoring our exception entry code to use gas macros, both to make it more readable and also enable some future optimisations. As well as many cleanups and other minor features & fixups. Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar, Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde" * tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits) powerpc/mm/mce: Keep irqs disabled during lockless page table walk powerpc: Use ftrace_graph_ret_addr() when unwinding powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR ftrace: Look up the address of return_to_handler() using helpers powerpc: dump kernel log before carrying out fadump or kdump docs: powerpc: Add missing documentation reference powerpc/xmon: Fix output of XIVE IPI powerpc/xmon: Improve output of XIVE interrupts powerpc/mm/radix: remove useless kernel messages powerpc/fadump: support holes in kernel boot memory area powerpc/fadump: remove RMA_START and RMA_END macros powerpc/fadump: update documentation about option to release opalcore powerpc/fadump: consider f/w load area powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel powerpc/fadump: improve how crashed kernel's memory is reserved powerpc/fadump: consider reserved ranges while releasing memory powerpc/fadump: make crash memory ranges array allocation generic ...
| * powerpc/mm/mce: Keep irqs disabled during lockless page table walkAneesh Kumar K.V2019-09-191-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __find_linux_mm_pte() returns a page table entry pointer after walking the page table without holding locks. To make it safe against a THP split and/or collapse, we disable interrupts around the lockless page table walk. However we need to keep interrupts disabled as long as we use the page table entry pointer that is returned. Fix addr_to_pfn() to do that. Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Rearrange code slightly and tweak change log wording] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190918145328.28602-1-aneesh.kumar@linux.ibm.com
| * powerpc: Use ftrace_graph_ret_addr() when unwindingNaveen N. Rao2019-09-181-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With support for HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, ftrace_graph_ret_addr() provides more robust unwinding when function graph is in use. Update show_stack() to use the same. With dump_stack() added to sysrq_sysctl_handler(), before this patch: root@(none):/sys/kernel/debug/tracing# cat /proc/sys/kernel/sysrq CPU: 0 PID: 218 Comm: cat Not tainted 5.3.0-rc7-00868-g8453ad4a078c-dirty #20 Call Trace: [c0000000d1e13c30] [c00000000006ab98] return_to_handler+0x0/0x40 (dump_stack+0xe8/0x164) (unreliable) [c0000000d1e13c80] [c000000000145680] sysrq_sysctl_handler+0x48/0xb8 [c0000000d1e13cd0] [c00000000006ab98] return_to_handler+0x0/0x40 (proc_sys_call_handler+0x274/0x2a0) [c0000000d1e13d60] [c00000000006ab98] return_to_handler+0x0/0x40 (return_to_handler+0x0/0x40) [c0000000d1e13d80] [c00000000006ab98] return_to_handler+0x0/0x40 (__vfs_read+0x3c/0x70) [c0000000d1e13dd0] [c00000000006ab98] return_to_handler+0x0/0x40 (vfs_read+0xb8/0x1b0) [c0000000d1e13e20] [c00000000006ab98] return_to_handler+0x0/0x40 (ksys_read+0x7c/0x140) After this patch: Call Trace: [c0000000d1e33c30] [c00000000006ab58] return_to_handler+0x0/0x40 (dump_stack+0xe8/0x164) (unreliable) [c0000000d1e33c80] [c000000000145680] sysrq_sysctl_handler+0x48/0xb8 [c0000000d1e33cd0] [c00000000006ab58] return_to_handler+0x0/0x40 (proc_sys_call_handler+0x274/0x2a0) [c0000000d1e33d60] [c00000000006ab58] return_to_handler+0x0/0x40 (__vfs_read+0x3c/0x70) [c0000000d1e33d80] [c00000000006ab58] return_to_handler+0x0/0x40 (vfs_read+0xb8/0x1b0) [c0000000d1e33dd0] [c00000000006ab58] return_to_handler+0x0/0x40 (ksys_read+0x7c/0x140) [c0000000d1e33e20] [c00000000006ab58] return_to_handler+0x0/0x40 (system_call+0x5c/0x68) Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/dc89c9a887121342d9c7819482c3dabdece2a323.1567707399.git.naveen.n.rao@linux.vnet.ibm.com
| * powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTRNaveen N. Rao2019-09-185-3/+7
| | | | | | | | | | | | | | | | | | | | This associates entries in the ftrace_ret_stack with corresponding stack frames, enabling more robust stack unwinding. Also update the only user of ftrace_graph_ret_addr() to pass the stack pointer. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/0224f2d0971b069c678e2ff678cfc2cd1e114cfe.1567707399.git.naveen.n.rao@linux.vnet.ibm.com
| * powerpc: dump kernel log before carrying out fadump or kdumpGanesh Goudar2019-09-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path"), pstore dmesg file is not updated when dump is triggered from HMC. This commit modified system reset (sreset) handler to invoke fadump or kdump (if configured), without pushing dmesg to pstore. This leaves pstore to have old dmesg data which won't be much of a help if kdump fails to capture the dump. This patch fixes that by calling kmsg_dump() before heading to fadump ot kdump. Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path") Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190904075949.15607-1-ganeshgr@linux.ibm.com
| * powerpc/fadump: support holes in kernel boot memory areaHari Bathini2019-09-141-29/+162
| | | | | | | | | | | | | | | | | | | | | | With support to copy multiple kernel boot memory regions owing to copy size limitation, also handle holes in the memory area to be preserved. Support as many as 128 kernel boot memory regions. This allows having an adequate FADump capture kernel size for different scenarios. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821385448.5656.6124791213910877759.stgit@hbathini.in.ibm.com
| * powerpc/fadump: remove RMA_START and RMA_END macrosHari Bathini2019-09-141-15/+17
| | | | | | | | | | | | | | | | | | | | | | RMA_START is defined as '0' and there is even a BUILD_BUG_ON() to make sure it is never anything else. Remove this macro and use '0' instead as code change is needed anyway when it has to be something else. Also, remove unused RMA_END macro. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821384096.5656.15026984053970204652.stgit@hbathini.in.ibm.com
| * powerpc/fadump: consider f/w load areaHari Bathini2019-09-141-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OPAL loads kernel & initrd at 512MB offset (256MB size), also exported as ibm,opal/dump/fw-load-area. So, if boot memory size of FADump is less than 768MB, kernel memory to be exported as '/proc/vmcore' would be overwritten by f/w while loading kernel & initrd. To avoid such a scenario, enforce a minimum boot memory size of 768MB on OPAL platform and skip using FADump if a newer F/W version loads kernel & initrd above 768MB. Also, irrespective of RMA size, set the minimum boot memory size expected on pseries platform at 320MB. This is to avoid inflating the minimum memory requirements on systems with 512M/1024M RMA size. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821381414.5656.1592867278535469652.stgit@hbathini.in.ibm.com
| * powerpc/fadump: add support to preserve crash data on FADUMP disabled kernelHari Bathini2019-09-143-8/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new kernel config option, CONFIG_PRESERVE_FA_DUMP that ensures that crash data, from previously crash'ed kernel, is preserved. This helps in cases where FADump is not enabled but the subsequent memory preserving kernel boot is likely to process this crash data. One typical usecase for this config option is petitboot kernel. As OPAL allows registering address with it in the first kernel and retrieving it after MPIPL, use it to store the top of boot memory. A kernel that intends to preserve crash data retrieves it and avoids using memory beyond this address. Move arch_reserved_kernel_pages() function as it is needed for both FA_DUMP and PRESERVE_FA_DUMP configurations. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821375751.5656.11459483669542541602.stgit@hbathini.in.ibm.com
| * powerpc/fadump: improve how crashed kernel's memory is reservedHari Bathini2019-09-141-24/+29
| | | | | | | | | | | | | | | | | | | | The size parameter to fadump_reserve_crash_area() function is not needed as all the memory above boot memory size must be preserved anyway. Update the function by dropping this redundant parameter. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821374440.5656.2945512543806951766.stgit@hbathini.in.ibm.com
| * powerpc/fadump: consider reserved ranges while releasing memoryHari Bathini2019-09-141-21/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for memory reservations") enabled support to parse 'reserved-ranges' DT node to reserve kernel memory falling in these ranges for firmware purposes. Along with the preserved area memory, ensure memory in reserved ranges is not overlapped with memory released by capture kernel aftering saving vmcore. Also, fix the off-by-one error in fadump_release_reserved_area function while releasing memory. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821371358.5656.6061214942558818661.stgit@hbathini.in.ibm.com
| * powerpc/fadump: make crash memory ranges array allocation genericHari Bathini2019-09-141-53/+60
| | | | | | | | | | | | | | | | | | | | | | Make allocate_crash_memory_ranges() and free_crash_memory_ranges() functions generic to reuse them for memory management of all types of dynamic memory range arrays. This change helps in memory management of reserved ranges array to be added later. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821369863.5656.4375667005352155892.stgit@hbathini.in.ibm.com
| * powerpc/fadump: make use of memblock's bottom up allocation modeHari Bathini2019-09-141-14/+12
| | | | | | | | | | | | | | | | | | | | | | | | Earlier, memblock_find_in_range() was not used to find the memory to be reserved for FADump as bottom up allocation mode was not supported. But since commit 79442ed189acb8b ("mm/memblock.c: introduce bottom-up allocation mode") bottom up allocation mode is supported for memblock. So, use it to find the memory to be reserved for FADump. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821364211.5656.14336025460336135194.stgit@hbathini.in.ibm.com
| * powerpc/fadump: handle invalidation of crashdump and re-registraionHari Bathini2019-09-141-1/+7
| | | | | | | | | | | | | | | | | | | | Make OPAL call to indicate that the dump is processed and the metadata area in OPAL can be cleared/released. Also, setup/initialize FADump for re-registration. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821356046.5656.12270927048195494911.stgit@hbathini.in.ibm.com
OpenPOWER on IntegriCloud