summaryrefslogtreecommitdiffstats
path: root/drivers/cpuidle
Commit message (Collapse)AuthorAgeFilesLines
* cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are ↵Wanpeng Li2019-09-111-1/+2
| | | | | | | | | | | | | | available The downside of guest side polling is that polling is performed even with other runnable tasks in the host. However, even if poll in kvm can aware whether or not other runnable tasks in the same pCPU, it can still incur extra overhead in over-subscribe scenario. Now we can just enable guest polling when dedicated pCPUs are available. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle-haltpoll: do not set an owner to allow modunloadJoao Martins2019-09-111-1/+0
| | | | | | | | | | | | | | | cpuidle-haltpoll can be built as a module to allow optional late load. Given we are setting @owner to THIS_MODULE, cpuidle will attempt to grab a module reference every time a cpuidle_device is registered -- so essentially all online cpus get a reference. This prevents for the module to be unloaded later, which makes the module_exit callback entirely unused. Thus remove the @owner and allow module to be unloaded. Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle-haltpoll: return -ENODEV on modinit failureJoao Martins2019-09-111-1/+1
| | | | | | | | | | | | | | | | When a user loads cpuidle-haltpoll on a non KVM guest the module will successfully load, even though idle driver registration didn't take place. We should instead return -ENODEV signaling the user that the driver can't be loaded, like other error paths in haltpoll_init(). An example of such error paths is when we return -EBUSY when attempting to register an idle driver when it had one already (e.g. intel_idle loads at boot and then we attempt to insert module cpuidle-haltpoll). Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle-haltpoll: set haltpoll as preferred governorJoao Martins2019-09-112-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Right now, guest current governors have the following ratings: * ladder -> 10 * teo -> 19 * menu -> 20 * haltpoll -> 21 * ladder + nohz=off -> 25 haltpoll governor got introduced and it is now the default governor given its highest rating -- with ladder+nohz being the exception -- regardless of idle driver in the guest. An example of an undesirable case is x86 KVM guests with MWAIT which have intel_idle registered first, and consequently will have haltpoll be used as governor which would get limited to a poll state and state 1 and the other states wouldn't get used. To keep the previous defaults we decrease rating of governor to 9 (below current lowest rating) and thus rely on @governor switch on cpuidle_register_driver() to tie in haltpoll idle driver and governor together. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: allow governor switch on cpuidle_register_driver()Joao Martins2019-09-113-3/+31
| | | | | | | | | | | | | | | The recently introduced haltpoll driver is largely only useful with haltpoll governor. To allow drivers to associate with a particular idle behaviour, add a @governor property to 'struct cpuidle_driver' and thus allow a cpuidle driver to switch to a *preferred* governor on idle driver registration. We save the previous governor, and when an idle driver is unregistered we switch back to that. The @governor can be overridden by cpuidle.governor= boot param or alternatively be ignored if the governor doesn't exist. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle-haltpoll: vcpu hotplug supportJoao Martins2019-09-031-5/+63
| | | | | | | | | | | | | | | | | | | When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus past the online ones and thus fail to register the idle driver. This is because cpuidle_add_sysfs() will return with -ENODEV as a consequence from get_cpu_device() return no device for a non-existing CPU. Instead switch to cpuidle_register_driver() and manually register each of the present cpus through cpuhp_setup_state() callbacks and future ones that get onlined or offlined. This mimmics similar logic that intel_idle does. Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: teo: Get rid of redundant check in teo_update()Rafael J. Wysocki2019-08-101-12/+4
| | | | | | | | | | Notice that setting measured_us to UINT_MAX in teo_update() earlier doesn't change the behavior of the following code, so do that and eliminate a redundant check used for setting measured_us to UINT_MAX. This change is not expected to alter functionality. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: teo: Allow tick to be stopped if PM QoS is usedRafael J. Wysocki2019-08-051-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The TEO goveror prevents the scheduler tick from being stopped (unless stopped already) if there is a PM QoS latency constraint for the given CPU and the target residency of the deepest idle state matching that constraint is below the tick boundary. However, that is problematic if CPUs with PM QoS latency constraints are idle for long times, because it effectively causes the tick to run on them all the time which is wasteful. [It is also confusing and questionable if they are full dynticks CPUs.] To address that issue, modify the TEO governor to carry out the entire search for the most suitable idle state (from the target residency perspective) even if a latency constraint is present, to allow it to determine the expected idle duration in all cases. Also, when using the last several measured idle duration values to refine the idle state selection, make it compare those values with the current expected idle duration value (instead of comparing them with the target residency of the idle state selected so far) which should prevent the tick from being retained when it makes sense to stop it sometimes (especially in the presence of PM QoS latency constraints). Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: menu: Allow tick to be stopped if PM QoS is usedRafael J. Wysocki2019-08-051-11/+5
| | | | | | | | | | | | | | | | | | | | | | | | After commit 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") the menu governor prevents the scheduler tick from being stopped (unless stopped already) if there is a PM QoS latency constraint for the given CPU and the target residency of the deepest idle state matching that constraint is below the tick boundary. However, that is problematic if CPUs with PM QoS latency constraints are idle for long times, because it effectively causes the tick to run on them all the time which is wasteful. [It is also confusing and questionable if they are full dynticks CPUs.] To address that issue, make the menu governor allow the tick to be stopped only if the idle duration predicted by it is beyond the tick boundary, except when the shallowest idle state is selected upfront and it is not a "polling" one. Fixes: 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") Link: https://lore.kernel.org/lkml/79b247b3-e056-610e-9a07-e685dfdaa6c9@gmail.com/ Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com> Tested-by: Thomas Lindroth <thomas.lindroth@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle-haltpoll: disable host side polling when kvm virtualizedMarcelo Tosatti2019-07-301-1/+8
| | | | | | | | | | | When performing guest side polling, it is not necessary to also perform host side polling. So disable host side polling, via the new MSR interface, when loading cpuidle-haltpoll driver. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: add haltpoll governorMarcelo Tosatti2019-07-303-0/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle driver, allows guest vcpus to poll for a specified amount of time before halting. This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows a remote vCPU to avoid sending an IPI (and the associated cost of handling the IPI) when performing a wakeup. 2) The VM-exit cost can be avoided. The downside of guest side polling is that polling is performed even with other runnable tasks in the host. Results comparing halt_poll_ns and server/client application where a small packet is ping-ponged: host --> 31.33 halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) For the SAP HANA benchmarks (where idle_spin is a parameter of the previous version of the patch, results should be the same): hpns == halt_poll_ns idle_spin=0/ idle_spin=800/ idle_spin=0/ hpns=200000 hpns=0 hpns=800000 DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* governors: unify last_state_idxMarcelo Tosatti2019-07-303-20/+18
| | | | | | | | Since this field is shared by all governors, move it to cpuidle device structure. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: add poll_limit_ns to cpuidle_device structureMarcelo Tosatti2019-07-303-9/+39
| | | | | | | | | | | | Add a poll_limit_ns variable to cpuidle_device structure. Calculate and configure it in the new cpuidle_poll_time function, in case its zero. Individual governors are allowed to override this value. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* add cpuidle-haltpoll driverMarcelo Tosatti2019-07-303-0/+78
| | | | | | | | | Add a cpuidle driver that calls the architecture default_idle routine. To be used in conjunction with the haltpoll governor. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* Merge branch 'pm-cpufreq'Rafael J. Wysocki2019-07-181-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | * pm-cpufreq: cpufreq: Make cpufreq_generic_init() return void cpufreq: imx-cpufreq-dt: Add i.MX8MN support cpufreq: Add QoS requests for userspace constraints cpufreq: intel_pstate: Reuse refresh_frequency_limits() cpufreq: Register notifiers with the PM QoS framework PM / QoS: Add support for MIN/MAX frequency constraints PM / QOS: Pass request type to dev_pm_qos_read_value() PM / QOS: Rename __dev_pm_qos_read_value() and dev_pm_qos_raw_read_value() PM / QOS: Pass request type to dev_pm_qos_{add|remove}_notifier()
| * PM / QOS: Rename __dev_pm_qos_read_value() and dev_pm_qos_raw_read_value()Viresh Kumar2019-07-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dev_pm_qos_read_value() will soon need to support more constraint types (min/max frequency) and will have another argument to it, i.e. type of the constraint. While that is fine for the existing users of dev_pm_qos_read_value(), but not that optimal for the callers of __dev_pm_qos_read_value() and dev_pm_qos_raw_read_value() as all the callers of these two routines are only looking for resume latency constraint. Lets make these two routines care only about the resume latency constraint and rename them to __dev_pm_qos_resume_latency() and dev_pm_qos_raw_resume_latency(). Suggested-by: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner2019-06-195-20/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428Thomas Gleixner2019-06-051-2/+1
| | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this file is released under the gplv2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 68 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Armijn Hemel <armijn@tjaldur.nl> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215Thomas Gleixner2019-05-301-3/+1
| | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this code is licenced under the gpl version 2 as described in the copying file that acompanies the linux kernel extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171439.466585205@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201Thomas Gleixner2019-05-302-24/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms and conditions of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not see http www gnu org licenses extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 228 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171438.107155473@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157Thomas Gleixner2019-05-301-10/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on 3 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [graeme] [gregory] [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema] [hk] [hemahk]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1105 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner2019-05-302-10/+2
| | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner2019-05-215-0/+5
| | | | | | | | | | | | | | Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cpuidle: Export the next timer expiration for CPUsUlf Hansson2019-04-101-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To be able to predict the sleep duration for a CPU entering idle, it is essential to know the expiration time of the next timer. Both the teo and the menu cpuidle governors already use this information for CPU idle state selection. Moving forward, a similar prediction needs to be made for a group of idle CPUs rather than for a single one and the following changes implement a new genpd governor for that purpose. In order to support that feature, add a new function called tick_nohz_get_next_hrtimer() that will return the next hrtimer expiration time of a given CPU to be invoked after deciding whether or not to stop the scheduler tick on that CPU. Make the cpuidle core call tick_nohz_get_next_hrtimer() right before invoking the ->enter() callback provided by the cpuidle driver for the given state and store its return value in the per-CPU struct cpuidle_device, so as to make it available to code outside of cpuidle. Note that at the point when cpuidle calls tick_nohz_get_next_hrtimer(), the governor's ->select() callback has already returned and indicated whether or not the tick should be stopped, so in fact the value returned by tick_nohz_get_next_hrtimer() always is the next hrtimer expiration time for the given CPU, possibly including the tick (if it hasn't been stopped). Co-developed-by: Lina Iyer <lina.iyer@linaro.org> Co-developed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: exynos: Unify target residency for AFTR and coupled AFTR statesMarek Szyprowski2019-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 45f1ff59e27c ("cpuidle: Return nohz hint from cpuidle_select()") Exynos CPUidle driver stopped entering C1 (AFTR) mode on Exynos4412-based Trats2 board. Further analysis revealed that the CPUidle framework changed the way it handles predicted timer ticks and reported target residency for the given idle states. As a result, the C1 (AFTR) state was not chosen anymore on completely idle device. The main issue was to high target residency value. The similar C1 (AFTR) state for 'coupled' CPUidle version used 10 times lower value for the target residency, despite the fact that it is the same state from the hardware perspective. The 100000us value for standard C1 (AFTR) mode is there from the begining of the support for this idle state, added by the commit 67173ca492ab ("ARM: EXYNOS: Add support AFTR mode on EXYNOS4210"). That commit doesn't give any reason for it, instead it looks like it was blindly copied from the WFI/IDLE state of the same driver that time. That time, that value was probably not really used by the framework for any critical decision, so it didn't matter that much. Now it turned out to be an issue, so unify the target residency with the 'coupled' version, as it seems to better match the real use case values and restores the operation of the Exynos CPUidle driver on the idle device. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: governor: Add new governors to cpuidle_governors againRafael J. Wysocki2019-03-121-0/+1
| | | | | | | | | | | | | | | After commit 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command line parameter") new cpuidle governors are not added to the list of available governors, so governor selection via sysfs doesn't work as expected (even though it is rarely used anyway). Fix that by making cpuidle_register_governor() add new governors to cpuidle_governors again. Fixes: 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command line parameter") Reported-by: Kees Cook <keescook@chromium.org> Cc: 5.0+ <stable@vger.kernel.org> # 5.0+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: menu: Avoid overflows when computing varianceRafael J. Wysocki2019-03-071-1/+1
| | | | | | | | | | | | | | | The variance computation in get_typical_interval() may overflow if the square of the value of diff exceeds the maximum for the int64_t data type value which basically is the case when it is of the order of UINT_MAX. However, data points so far in the future don't matter for idle state selection anyway, so change the initial threshold value in get_typical_interval() to INT_MAX which will cause more "outlying" data points to be discarded without affecting the selection result. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: dt: bail out if the idle-state DT node is not compatibleJoseph Lo2019-02-011-6/+9
| | | | | | | | | | | | | | | Currently, the DT of the idle states will be parsed first whether it's compatible or not. This could cause a warning message that comes from if the CPU doesn't support identical idle states. E.g. Tegra186 can run with 2 Cortex-A57 and 2 Denver cores with different idle states on different types of these cores. So fix it by checking the match node earlier, then it can make sure it only goes through the idle states that the CPU supported. Signed-off-by: Joseph Lo <josephl@nvidia.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* Merge back earlier cpuidle material for v5.1.Rafael J. Wysocki2019-02-013-1/+455
|\
| * cpuidle: New timer events oriented governor for tickless systemsRafael J. Wysocki2019-01-163-1/+455
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The venerable menu governor does some things that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (because it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors depend on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular, and they are not correlated to the list of available idle states in any way whatever. Also, the pattern detection code in menu may end up considering values that are too large to matter at all, in which cases running it is a waste of time. A major rework of the menu governor would be required to address these issues and the performance of at least some workloads (tuned specifically to the current behavior of the menu governor) is likely to suffer from that. It is thus better to introduce an entirely new governor without them and let everybody use the governor that works better with their actual workloads. The new governor introduced here, the timer events oriented (TEO) governor, uses the same basic strategy as menu: it always tries to find the deepest idle state that can be used in the given conditions. However, it applies a different approach to that problem. First, it doesn't use "correction factors" for the time till the closest timer, but instead it tries to correlate the measured idle duration values with the available idle states and use that information to pick up the idle state that is most likely to "match" the upcoming CPU idle interval. Second, it doesn't take the number of "I/O waiters" into account at all and the pattern detection code in it avoids taking timer wakeups into account. It also only uses idle duration values less than the current time till the closest timer (with the tick excluded) for that purpose. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
* | cpuidle: poll_state: Fix default time limitDoug Smythies2019-01-301-1/+1
|/ | | | | | | | | | | | | | The default time is declared in units of microsecnds, but is used as nanoseconds, resulting in significant accounting errors for idle state 0 time when all idle states deeper than 0 are disabled. Under these unusual conditions, we don't really care about the poll time limit anyhow. Fixes: 800fb34a99ce ("cpuidle: poll_state: Disregard disable idle states") Signed-off-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* Merge tag 'powerpc-4.21-1' of ↵Linus Torvalds2018-12-271-1/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Mitigations for Spectre v2 on some Freescale (NXP) CPUs. - A large series adding support for pass-through of Nvidia V100 GPUs to guests on Power9. - Another large series to enable hardware assistance for TLB table walk on MPC8xx CPUs. - Some preparatory changes to our DMA code, to make way for further cleanups from Christoph. - Several fixes for our Transactional Memory handling discovered by fuzzing the signal return path. - Support for generating our system call table(s) from a text file like other architectures. - A fix to our page fault handler so that instead of generating a WARN_ON_ONCE, user accesses of kernel addresses instead print a ratelimited and appropriately scary warning. - A cosmetic change to make our unhandled page fault messages more similar to other arches and also more compact and informative. - Freescale updates from Scott: "Highlights include elimination of legacy clock bindings use from dts files, an 83xx watchdog handler, fixes to old dts interrupt errors, and some minor cleanup." And many clean-ups, reworks and minor fixes etc. Thanks to: Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao, Christian Lamparter, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Darren Stevens, David Gibson, Diana Craciun, Dmitry V. Levin, Firoz Khan, Geert Uytterhoeven, Greg Kurz, Gustavo Romero, Hari Bathini, Joel Stanley, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal Suchánek, Naveen N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras, Ram Pai, Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen Rothwell, Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian Tang, Yue Haibing" * tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (201 commits) Revert "powerpc/fsl_pci: simplify fsl_pci_dma_set_mask" powerpc/zImage: Also check for stdout-path powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y macintosh: Use of_node_name_{eq, prefix} for node name comparisons ide: Use of_node_name_eq for node name comparisons powerpc: Use of_node_name_eq for node name comparisons powerpc/pseries/pmem: Convert to %pOFn instead of device_node.name powerpc/mm: Remove very old comment in hash-4k.h powerpc/pseries: Fix node leak in update_lmb_associativity_index() powerpc/configs/85xx: Enable CONFIG_DEBUG_KERNEL powerpc/dts/fsl: Fix dtc-flagged interrupt errors clk: qoriq: add more compatibles strings powerpc/fsl: Use new clockgen binding powerpc/83xx: handle machine check caused by watchdog timer powerpc/fsl-rio: fix spelling mistake "reserverd" -> "reserved" powerpc/fsl_pci: simplify fsl_pci_dma_set_mask arch/powerpc/fsl_rmu: Use dma_zalloc_coherent vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver vfio_pci: Allow regions to add own capabilities vfio_pci: Allow mapping extra regions ...
| * powerpc/pseries/cpuidle: Fix preempt warningBreno Leitao2018-12-041-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When booting a pseries kernel with PREEMPT enabled, it dumps the following warning: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is pseries_processor_idle_init+0x5c/0x22c CPU: 13 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc3-00090-g12201a0128bc-dirty #828 Call Trace: [c000000429437ab0] [c0000000009c8878] dump_stack+0xec/0x164 (unreliable) [c000000429437b00] [c0000000005f2f24] check_preemption_disabled+0x154/0x160 [c000000429437b90] [c000000000cab8e8] pseries_processor_idle_init+0x5c/0x22c [c000000429437c10] [c000000000010ed4] do_one_initcall+0x64/0x300 [c000000429437ce0] [c000000000c54500] kernel_init_freeable+0x3f0/0x500 [c000000429437db0] [c0000000000112dc] kernel_init+0x2c/0x160 [c000000429437e20] [c00000000000c1d0] ret_from_kernel_thread+0x5c/0x6c This happens because the code calls get_lppaca() which calls get_paca() and it checks if preemption is disabled through check_preemption_disabled(). Preemption should be disabled because the per CPU variable may make no sense if there is a preemption (and a CPU switch) after it reads the per CPU data and when it is used. In this device driver specifically, it is not a problem, because this code just needs to have access to one lppaca struct, and it does not matter if it is the current per CPU lppaca struct or not (i.e. when there is a preemption and a CPU migration). That said, the most appropriate fix seems to be related to avoiding the debug_smp_processor_id() call at get_paca(), instead of calling preempt_disable() before get_paca(). Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* | cpuidle: Add 'above' and 'below' idle state metricsRafael J. Wysocki2018-12-122-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | Add two new metrics for CPU idle states, "above" and "below", to count the number of times the given state had been asked for (or entered from the kernel's perspective), but the observed idle duration turned out to be too short or too long for it (respectively). These metrics help to estimate the quality of the CPU idle governor in use. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpuidle: big.LITTLE: fix refcount leakYangtao Li2018-12-111-1/+6
| | | | | | | | | | | | | | | | | | | | of_find_node_by_path() acquires a reference to the node returned by it and that reference needs to be dropped by its caller. bl_idle_init() doesn't do that, so fix it. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpuidle: Add cpuidle.governor= command line parameterRafael J. Wysocki2018-12-113-2/+9
| | | | | | | | | | | | | | | | | | | | Add cpuidle.governor= command line parameter to allow the default cpuidle governor to be replaced. That is useful, for example, if someone running a tickful kernel wants to use the menu governor on it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpuidle: poll_state: Disregard disable idle statesRafael J. Wysocki2018-12-111-1/+10
| | | | | | | | | | | | | | | | | | | | When computing the limit of time to spend in the loop in poll_idle(), use the target residency of the first enabled idle state deeper than state 0 instead of always using the target residency of state 1. This helps when state 1 is disabled for diagnostics, for instance. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | ARM: cpuidle: Convert to use cpuidle_register|unregister()Ulf Hansson2018-11-081-27/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only reason that remains, to why the ARM cpuidle driver calls cpuidle_register_driver(), is to avoid printing an error message in case another driver already have been registered for the CPU. This seems a bit silly, but more importantly, if that is a common scenario, perhaps we should change cpuidle_register() accordingly instead. In either case, let's consolidate the code, by converting to use cpuidle_register|unregister(), which also avoids the unnecessary allocation of the struct cpuidle_device. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | ARM: cpuidle: Don't register the driver when back-end init returns -ENXIOUlf Hansson2018-11-081-12/+10
|/ | | | | | | | | | | | | | | | | There's no point to register the cpuidle driver for the current CPU, when the initialization of the arch specific back-end data fails by returning -ENXIO. Instead, let's re-order the sequence to its original flow, by first trying to initialize the back-end part and then act accordingly on the returned error code. Additionally, let's print the error message, no matter of what error code that was returned. Fixes: a0d46a3dfdc3 (ARM: cpuidle: Register per cpuidle device) Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: 4.19+ <stable@vger.kernel.org> # v4.19+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* Merge tag 'pm-4.20-rc1-2' of ↵Linus Torvalds2018-10-301-19/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These remove a questionable heuristic from the menu cpuidle governor, fix a recent build regression in the intel_pstate driver, clean up ARM big-Little support in cpufreq and fix up hung task watchdog's interaction with system-wide power management transitions. Specifics: - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov)" * tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: kernel: hung_task.c: disable on suspend cpufreq: remove unused arm_big_little_dt driver cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64 cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI cpuidle: menu: Remove get_loadavg() from the performance multiplier sched: Factor out nr_iowait and nr_iowait_cpu
| * cpuidle: menu: Remove get_loadavg() from the performance multiplierDaniel Lezcano2018-10-251-19/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq *this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit 372ba8cb46b2 (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { struct rq *rq = this_rq(); *nr_waiters = atomic_read(&rq->nr_iowait); *load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOADJohannes Weiner2018-10-261-4/+0
|/ | | | | | | | | | | | | | | | | | | | | | | There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cpuidle: menu: Avoid computations when result will be discardedRafael J. Wysocki2018-10-181-3/+16
| | | | | | | | | | | | | | | | If the minimum interval taken into account in the average computation loop in get_typical_interval() is less than the expected idle duration determined so far, the resultant average cannot be greater than that value as well and the entire return result of the function is going to be discarded anyway going forward. In that case, it is a waste of time to carry out the remaining computations in get_typical_interval(), so avoid that by returning early if the minimum interval is not below the expected idle duration. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: menu: Drop redundant comparisonRafael J. Wysocki2018-10-181-6/+1
| | | | | | | | | | | | | | | | Since the correction factor cannot be greater than RESOLUTION * DECAY, the result of the predicted_us computation in menu_select() cannot be greater than data->next_timer_us, so it is not necessary to compare the "typical interval" value coming from get_typical_interval() with data->next_timer_us separately. It is sufficient to copmare predicted_us with the return value of get_typical_interval() directly, so do that and drop the now redundant expected_interval variable. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: menu: Simplify checks related to the polling stateRafael J. Wysocki2018-10-121-4/+4
| | | | | | | | | | | | | | | | | | | | | | | After some recent menu governor changes, the promotion of the "polling" state to a physical one is mostly controlled by the latency limit (resulting from the "interactivity" factor) and not by the time to the closest timer event, so it should be sufficient to check the exit latency of that state for this purpose (of course, its target residency still needs to be within the next timer event range for energy-efficiency). Also, the physical state the "polling" one is promoted to need not be the next one in principle (in case the next state is disabled, for example). For these reasons, simplify the checks made to decide whether or not to promote the "polling" state to a physical one and update the target idle duration when it is promoted in case the residency of the new state turns out to be above the tick boundary (in which case there is no reason to stop the tick). Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: poll_state: Revise loop termination conditionRafael J. Wysocki2018-10-041-2/+2
| | | | | | | | | | | | | | | | | If need_resched() returns "false", breaking out of the loop in poll_idle() will cause a new idle state to be selected, so in fact it usually doesn't make sense to spin in it longer than the target residency of the second state. [Note that the "polling" state is used only if there is at least one "real" state defined in addition to it, so the second state is always there.] On the other hand, breaking out of it early (say in case the next state is disabled) shouldn't hurt as it is polling anyway. For this reason, make the loop in poll_idle() break if the CPU has been spinning longer than the target residency of the second state (the "polling" state can only be state[0]). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpuidle: menu: Move the latency_req == 0 special case checkRafael J. Wysocki2018-10-041-7/+1
| | | | | | | | | | It is better to always update data->bucket before returning from menu_select() to avoid updating the correction factor for a stale bucket, so combine the latency_req == 0 special check with the more general check below. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
* cpuidle: menu: Avoid computations for very close timersRafael J. Wysocki2018-10-041-0/+12
| | | | | | | | | | | | | | If the next timer event (not including the tick) is closer than the target residency of the second state or the PM QoS latency constraint is below its exit latency, state[0] will be used regardless of any other factors, so skip the computations in menu_select() then and return 0 straight away from it. Still, do that after the bucket has been determined to avoid updating the correction factor for a stale bucket. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
* cpuidle: menu: Do not update last_state_idx in menu_select()Rafael J. Wysocki2018-10-041-5/+2
| | | | | | | | | | | | | | | | | It is not necessary to update data->last_state_idx in menu_select() as it only is used in menu_update() which only runs when data->needs_update is set and that is set only when updating data->last_state_idx in menu_reflect(). Accordingly, drop the update of data->last_state_idx from menu_select() and get rid of the (now redundant) "out" label from it. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
* cpuidle: menu: Get rid of first_idx from menu_select()Rafael J. Wysocki2018-10-041-18/+14
| | | | | | | | | | | | | | Rearrange the code in menu_select() so that the loop over idle states always starts from 0 and get rid of the first_idx variable. While at it, add two empty lines to separate conditional statements from one another. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
OpenPOWER on IntegriCloud