From 2c71d7032484f4267ca425b5a58c5a6f06a2eaf3 Mon Sep 17 00:00:00 2001 From: Mahesh Salgaonkar Date: Fri, 22 Mar 2019 20:28:00 +0530 Subject: Fix hang in pnv_platform_error_reboot path due to TOD failure. On TOD failure, with TB stuck, when linux heads down to pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. But with panic cpu stuck inside OPAL, linux never recovers/reboot. p0 c1 t0 NIA : 0x000000003001dd3c <.time_wait+0x64> CFAR : 0x000000003001dce4 <.time_wait+0xc> MSR : 0x9000000002803002 LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> STACK: SP NIA 0x0000000031c236e0 0x0000000031c23760 (big-endian) 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> 0x0000000031c23d20 0x00000000300051e4 0xc000001fea6e7870 0xc0000000000a9060 0xc000001fea6e78c0 0xc000000000030b84 0xc000001fea6e7960 0xc0000000000310b0 0xc000001fea6e7990 0xc0000000004792d4 0xc000001fea6e7ad0 0xc00000000018a570 0xc000001fea6e7b40 0xc000000000028e5c 0xc000001fea6e7b60 0xc0000000000a7168 0xc000001fea6e7bd0 0xc0000000000ac9b8 0xc000001fea6e7c80 0xc00000000012d6c8 0xc000001fea6e7d20 0xc00000000012da28 0xc000001fea6e7db0 0xc0000000001366f4 0xc000001fea6e7e20 0xc00000000000b65c This is because, there is a while loop towards the end of ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match with "msg". It loops over time_wait_ms() until exit condition is met. In normal scenario time_wait_ms() calls run pollers so that ipmi backend gets a chance to check ipmi response and set sync_msg to NULL. while (sync_msg == msg) time_wait_ms(10); But in the event when TB is in failed state time_wait_ms()->time_wait_poll() returns immediately without calling pollers and hence we end up looping forever. This patch fixes this hang by calling opal_run_pollers() in TB failed state as well. Fixes: 1764f2452 ("opal: Fix hang in time_wait* calls on HMI for TB errors.") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Mahesh Salgaonkar Signed-off-by: Stewart Smith --- core/timebase.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/core/timebase.c b/core/timebase.c index 0ae11d25..b1a02a4a 100644 --- a/core/timebase.c +++ b/core/timebase.c @@ -29,6 +29,16 @@ static void time_wait_poll(unsigned long duration) unsigned long period = msecs_to_tb(5); if (this_cpu()->tb_invalid) { + /* + * Run pollers to allow some backends to process response. + * + * In TOD failure case where TOD is unrecoverable, running + * pollers allows ipmi backend to deal with ipmi response + * from bmc and helps ipmi_queue_msg_sync() to get un-stuck. + * Thus it avoids linux kernel to hang during panic due to + * TOD failure. + */ + opal_run_pollers(); cpu_relax(); return; } -- cgit v1.2.1