summaryrefslogtreecommitdiffstats
path: root/kernel/sched/sched.h
Commit message (Collapse)AuthorAgeFilesLines
* jump_label: move 'asm goto' support test to KconfigMasahiro Yamada2019-01-061-3/+3
| | | | | | | | | | | | | | | | | | | | | Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
* sched/fair: Add over-utilization/tipping point indicatorMorten Rasmussen2018-12-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% regardless of how many additional tasks we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at its highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For systems where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. [ peterz: Added a comment explaining why new tasks are not accounted during overutilization detection. ] Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-13-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Clean-up update_sg_lb_stats parametersQuentin Perret2018-12-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for the introduction of a new root domain flag which can be set during load balance (the 'overutilized' flag), clean-up the set of parameters passed to update_sg_lb_stats(). More specifically, the 'local_group' and 'local_idx' parameters can be removed since they can easily be reconstructed from within the function. While at it, transform the 'overload' parameter into a flag stored in the 'sg_status' parameter hence facilitating the definition of new flags when needed. Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-12-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/toplogy: Introduce the 'sched_energy_present' static keyQuentin Perret2018-12-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to make sure Energy Aware Scheduling (EAS) will not impact systems where no Energy Model is available, introduce a static key guarding the access to EAS code. Since EAS is enabled on a per-root-domain basis, the static key is enabled when at least one root domain meets all conditions for EAS. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-10-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/topology: Make Energy Aware Scheduling depend on schedutilQuentin Perret2018-12-111-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Energy Aware Scheduling (EAS) is designed with the assumption that frequencies of CPUs follow their utilization value. When using a CPUFreq governor other than schedutil, the chances of this assumption being true are small, if any. When schedutil is being used, EAS' predictions are at least consistent with the frequency requests. Although those requests have no guarantees to be honored by the hardware, they should at least guide DVFS in the right direction and provide some hope in regards to the EAS model being accurate. To make sure EAS is only used in a sane configuration, create a strong dependency on schedutil being used. Since having sugov compiled-in does not provide that guarantee, make CPUFreq call a scheduler function on governor changes hence letting it rebuild the scheduling domains, check the governors of the online CPUs, and enable/disable EAS accordingly. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-9-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/topology: Add lowest CPU asymmetry sched_domain level pointerQuentin Perret2018-12-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_asym_cpucapacity, points to the lowest level at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the sd_asym shortcut to sd_asym_packing to avoid confusions. Generally speaking, the largest opportunity to save energy via scheduling comes from a smarter exploitation of heterogeneous platforms (i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will be used at first as the lowest domain where Energy-Aware Scheduling (EAS) should be applied. For example, it is possible to apply EAS within a socket on a multi-socket system, as long as each socket has an asymmetric topology. Energy-aware cross-sockets wake-up balancing will only happen when the system is over-utilized, or this_cpu and prev_cpu are in different sockets. Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-7-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/topology: Reference the Energy Model of CPUs when availableQuentin Perret2018-12-111-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing scheduling domain hierarchy is defined to map to the cache topology of the system. However, Energy Aware Scheduling (EAS) requires more knowledge about the platform, and specifically needs to know about the span of Performance Domains (PD), which do not always align with caches. To address this issue, use the Energy Model (EM) of the system to extend the scheduler topology code with a representation of the PDs, alongside the scheduling domains. More specifically, a linked list of PDs is attached to each root domain. When multiple root domains are in use, each list contains only the PDs covering the CPUs of its root domain. If a PD spans over CPUs of multiple different root domains, it will be duplicated in all lists. The lists are fully maintained by the scheduler from partition_sched_domains() in order to cope with hotplug and cpuset changes. As for scheduling domains, the list are protected by RCU to ensure safe concurrent updates. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-6-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/cpufreq: Prepare schedutil for Energy Aware SchedulingQuentin Perret2018-12-111-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Schedutil requests frequency by aggregating utilization signals from the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of them. Since Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests, it needs to forecast the decisions made by the governor. In order to prepare the introduction of EAS, introduce schedutil_freq_util() to centralize the aforementioned signal aggregation and make it available to both schedutil and EAS. Since frequency selection and energy estimation still need to deal with RT and DL signals slightly differently, schedutil_freq_util() is called with a different 'type' parameter in those two contexts, and returns an aggregated utilization signal accordingly. While at it, introduce the map_util_freq() function which is designed to make schedutil's 25% margin usable easily for both sugov and EAS. As EAS will be able to predict schedutil's frequency requests more accurately than any other governor by design, it'd be sensible to make sure EAS cannot be used without schedutil. This will be done later, once EAS has actually been introduced. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-3-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/topology: Relocate arch_scale_cpu_capacity() to the internal headerQuentin Perret2018-12-111-18/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, arch_scale_cpu_capacity() is only visible from within the kernel/sched folder. Relocate it to include/linux/sched/topology.h to make it visible to other clients needing to know about the capacity of CPUs, such as the Energy Model framework. This also shrinks the <linux/sched/topology.h> public header. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-2-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/topology: Remove the ::smt_gain field from 'struct sched_domain'Vincent Guittot2018-12-111-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ::smt_gain is used to compute the capacity of CPUs of a SMT core with the constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs per core. The field has_free_capacity of struct numa_stat, which was the last user of this computation of number of CPUs per core, has been removed by: 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") We can now remove this constraint on core capacity and use the defautl value SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE becomes the maximum compute capacity of CPUs on every systems. This should help to simplify some code and remove fields like rd->max_cpu_capacity Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other places in the code when it wants the capacity of a CPUs to scale some metrics like in pelt, deadline or schedutil. In case on SMT, the value returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE. So remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Fix various typos in commentsIngo Molnar2018-12-031-2/+2
| | | | | | | | | | | | | Go over the scheduler source code and fix common typos in comments - and a typo in an actual variable name. No change in functionality intended. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'v4.20-rc5' into sched/core, to pick up fixesIngo Molnar2018-12-031-3/+1
|\ | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched/smt: Expose sched_smt_present static keyThomas Gleixner2018-11-281-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the scheduler's 'sched_smt_present' static key globaly available, so it can be used in the x86 speculation control code. Provide a query function and a stub for the CONFIG_SMP=n case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
* | sched/core: Clean up the #ifdef block in add_nr_running()Viresh Kumar2018-11-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no point in keeping the conditional statement of the #if block outside of the #ifdef block, while all of its body is contained within the #ifdef block. Move the conditional statement under the #ifdef block as well. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/78cbd78a615d6f9fdcd3327f1ead68470f92593e.1541482935.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | sched/core: Create task_has_idle_policy() helperViresh Kumar2018-11-121-0/+5
|/ | | | | | | | | | | | | | | | | | We already have task_has_rt_policy() and task_has_dl_policy() helpers, create task_has_idle_policy() as well and update sched core to start using it. While at it, use task_has_dl_policy() at one more place. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* psi: pressure stall information for CPU, memory, and IOJohannes Weiner2018-10-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: introduce this_rq_lock_irq()Johannes Weiner2018-10-261-0/+12
| | | | | | | | | | | | | | | | | | | | | | | do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next patch is adding another site with the same pattern, so provide a convenience function for it. Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: sched.h: make rq locking and clock functions available in stats.hJohannes Weiner2018-10-261-82/+82
| | | | | | | | | | | | | | | | | | | | | | | | kernel/sched/sched.h includes "stats.h" half-way through the file. The next patch introduces users of sched.h's rq locking functions and update_rq_clock() in kernel/sched/stats.h. Move those definitions up in the file so they are available in stats.h. Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2018-10-231-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...
| * x86/stackprotector: Remove the call to boot_init_stack_canary() from ↵Christophe Leroy2018-10-221-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpu_startup_entry() The following commit: d7880812b359 ("idle: Add the stack canary init to cpu_startup_entry()") ... added an x86 specific boot_init_stack_canary() call to the generic cpu_startup_entry() as a temporary hack, with the intention to remove the #ifdef CONFIG_X86 later. More than 5 years later let's finally realize that plan! :-) While implementing stack protector support for PowerPC, we found that calling boot_init_stack_canary() is also needed for PowerPC which uses per task (TLS) stack canary like the X86. However, calling boot_init_stack_canary() would break architectures using a global stack canary (ARM, SH, MIPS and XTENSA). Instead of modifying the #ifdef CONFIG_X86 to an even messier: #if defined(CONFIG_X86) || defined(CONFIG_PPC) PowerPC implemented the call to boot_init_stack_canary() in the function calling cpu_startup_entry(). Let's try the same cleanup on the x86 side as well. On x86 we have two functions calling cpu_startup_entry(): - start_secondary() - cpu_bringup_and_idle() start_secondary() already calls boot_init_stack_canary(), so it's good, and this patch adds the call to boot_init_stack_canary() in cpu_bringup_and_idle(). I.e. now x86 catches up to the rest of the world and the ugly init sequence in init/main.c can be removed from cpu_startup_entry(). As a final benefit we can also remove the <linux/stackprotector.h> dependency from <linux/sched.h>. [ mingo: Improved the changelog a bit, added language explaining x86 borkage and sched.h change. ] Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev@lists.ozlabs.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20181020072649.5B59310483E@pc16082vm.idsi0.si.c-s.fr Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2018-10-231-8/+15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes are: - Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems, to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen, Valentin Schneider) - Topology handling improvements, in particular when CPU capacity changes and related load-balancing fixes/improvements (Morten Rasmussen) - ... plus misc other improvements, fixes and updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions sched/completions/Documentation: Clean up the document some more sched/completions/Documentation: Fix a couple of punctuation nits cpu/SMT: State SMT is disabled even with nosmt and without "=force" sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load() sched/fair: Remove setting task's se->runnable_weight during PELT update sched/fair: Disable LB_BIAS by default sched/pelt: Fix warning and clean up IRQ PELT config sched/topology: Make local variables static sched/debug: Use symbolic names for task state constants sched/numa: Remove unused numa_stats::nr_running field sched/numa: Remove unused code from update_numa_stats() sched/debug: Explicitly cast sched_feat() to bool sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains sched/fair: Don't move tasks to lower capacity CPUs unless necessary sched/fair: Set rq->rd->overload when misfit sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE() sched/core: Change root_domain->overload type to int sched/fair: Change 'prefer_sibling' type to bool sched/fair: Kick nohz balance if rq->misfit_task_load ...
| * | sched/pelt: Fix warning and clean up IRQ PELT configVincent Guittot2018-10-021-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a config for enabling irq load tracking in the scheduler. irq load tracking is useful only when irq or paravirtual time is accounted but it's only possible with SMP for now. Also use __maybe_unused to remove the compilation warning in update_rq_clock_task() that has been introduced by: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Suggested-by: Ingo Molnar <mingo@redhat.com> Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: dou_liyang@163.com Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar2018-10-021-1/+2
| |\ \ | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/debug: Explicitly cast sched_feat() to boolPeter Zijlstra2018-09-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LLVM has a warning that tags expressions like: if (foo && non-bool-const) This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends up being whatever bit we select. Avoid the warning with an explicit cast to bool. Reported-by: Philipp Klocke <philipp97kl@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/fair: Set rq->rd->overload when misfitValentin Schneider2018-09-101-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 | a a a a _ _ LITTLE_1 | b b b b _ _ ---------|------------- big_0 | c c c c a a big_1 | d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()Valentin Schneider2018-09-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong can happen because of the compiler. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Change root_domain->overload type to intValentin Schneider2018-09-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sizeof(_Bool) is implementation defined, so let's just go with 'int' as is done for other structures e.g. sched_domain_shared->has_idle_cores. The local 'overload' variable used in update_sd_lb_stats can remain bool, as it won't impact any struct layout and can be assigned to the root_domain field. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/fair: Add sched_group per-CPU max capacityMorten Rasmussen2018-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/fair: Add 'group_misfit_task' load-balance typeMorten Rasmussen2018-09-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/topology: Add static_key for asymmetric CPU capacity optimizationsMorten Rasmussen2018-09-101-0/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | sched/fair: Fix throttle_list starvation with low CFS quotaPhil Auld2018-10-111-0/+2
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, distribute_cfs_runtime may not empty the throttled_list before it runs out of runtime to distribute. In that case, due to the change from c06f04c7048 to put throttled entries at the head of the list, later entries on the list will starve. Essentially, the same X processes will get pulled off the list, given CPU time and then, when expired, get put back on the head of the list where distribute_cfs_runtime will give runtime to the same set of processes leaving the rest. Fix the issue by setting a bit in struct cfs_bandwidth when distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can decide to put the throttled entry on the tail or the head of the list. The bit is set/cleared by the callers of distribute_cfs_runtime while they hold cfs_bandwidth->lock. This is easy to reproduce with a handful of CPU consumers. I use 'crash' on the live system. In some cases you can simply look at the throttled list and see the later entries are not changing: crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -976050 2 ffff90b56cb2cc00 -484925 3 ffff90b56cb2bc00 -658814 4 ffff90b56cb2ba00 -275365 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -994147 2 ffff90b56cb2cc00 -306051 3 ffff90b56cb2bc00 -961321 4 ffff90b56cb2ba00 -24490 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 Sometimes it is easier to see by finding a process getting starved and looking at the sched_info: crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, Signed-off-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | sched/numa: Pass destination CPU as a parameter to migrate_task_rqSrikar Dronamraju2018-10-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This additional parameter (new_cpu) is used later for identifying if task migration is across nodes. No functional change. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 203353 200668 -1.32036 1 328205 321791 -1.95427 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 214384 204848 -4.44809 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188553 188098 -0.241311 1 196273 200351 2.07772 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 57581.2 58145.9 0.980702 1 103468 103798 0.318939 Brings out the variance between different specjbb2005 runs. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,941,377 13,912,183 migrations 1,157,323 1,155,931 faults 382,175 367,139 cache-misses 54,993,823,500 54,240,196,814 sched:sched_move_numa 2,005 1,571 sched:sched_stick_numa 14 9 sched:sched_swap_numa 529 463 migrate:mm_migrate_pages 1,573 703 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 67099 50155 numa_hint_faults_local 58456 45264 numa_hit 240416 239652 numa_huge_pte_updates 18 36 numa_interleave 65 68 numa_local 240339 239576 numa_other 77 76 numa_pages_migrated 1574 680 numa_pte_updates 77182 71146 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,176,453 3,156,720 migrations 30,238 30,354 faults 87,869 97,261 cache-misses 12,544,479,391 12,400,026,826 sched:sched_move_numa 23 4 sched:sched_stick_numa 0 0 sched:sched_swap_numa 6 1 migrate:mm_migrate_pages 10 20 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 236 272 numa_hint_faults_local 201 186 numa_hit 72293 71362 numa_huge_pte_updates 0 0 numa_interleave 26 23 numa_local 72233 71299 numa_other 60 63 numa_pages_migrated 8 2 numa_pte_updates 0 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,478,820 8,606,824 migrations 171,323 155,352 faults 307,499 301,409 cache-misses 240,353,599 157,759,224 sched:sched_move_numa 214 168 sched:sched_stick_numa 0 0 sched:sched_swap_numa 4 3 migrate:mm_migrate_pages 89 125 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 5301 4650 numa_hint_faults_local 4745 3946 numa_hit 92943 90489 numa_huge_pte_updates 0 0 numa_interleave 899 892 numa_local 92345 90034 numa_other 598 455 numa_pages_migrated 88 124 numa_pte_updates 5505 4818 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,066,172 2,113,167 migrations 11,076 10,533 faults 149,544 142,727 cache-misses 10,398,067 5,594,192 sched:sched_move_numa 43 10 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 6 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 3552 744 numa_hint_faults_local 3347 584 numa_hit 25611 25551 numa_huge_pte_updates 0 0 numa_interleave 213 263 numa_local 25583 25302 numa_other 28 249 numa_pages_migrated 6 6 numa_pte_updates 3535 744 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 99,358,136 101,227,352 migrations 4,041,607 4,151,829 faults 749,653 745,233 cache-misses 225,562,543,251 224,669,561,766 sched:sched_move_numa 771 617 sched:sched_stick_numa 14 2 sched:sched_swap_numa 204 187 migrate:mm_migrate_pages 1,180 316 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 27409 24195 numa_hint_faults_local 20677 21639 numa_hit 239988 238331 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 239983 238331 numa_other 5 0 numa_pages_migrated 1016 204 numa_pte_updates 27916 24561 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 60,899,307 62,738,978 migrations 544,668 562,702 faults 270,834 228,465 cache-misses 74,543,455,635 75,778,067,952 sched:sched_move_numa 735 648 sched:sched_stick_numa 25 13 sched:sched_swap_numa 174 137 migrate:mm_migrate_pages 816 733 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 11059 10281 numa_hint_faults_local 4733 3242 numa_hit 41384 36338 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 41383 36338 numa_other 1 0 numa_pages_migrated 815 706 numa_pte_updates 11323 10176 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | sched/numa: Stop multiple tasks from moving to the CPU at the same timeSrikar Dronamraju2018-10-021-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Task migration under NUMA balancing can happen in parallel. More than one task might choose to migrate to the same CPU at the same time. This can result in: - During task swap, choosing a task that was not part of the evaluation. - During task swap, task which just got moved into its preferred node, moving to a completely different node. - During task swap, task failing to move to the preferred node, will have to wait an extra interval for the next migrate opportunity. - During task movement, multiple task movements can cause load imbalance. This problem is more likely if there are more cores per node or more nodes in the system. Use a per run-queue variable to check if NUMA-balance is active on the run-queue. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 200194 203353 1.57797 1 311331 328205 5.41995 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 197654 214384 8.46429 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 192605 188553 -2.10379 1 213402 196273 -8.02664 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 52227.1 57581.2 10.2516 1 102529 103468 0.915838 There is a regression on power 9 box. If we look at the details, that box has a sudden jump in cache-misses with this patch. All other parameters seem to be pointing towards NUMA consolidation. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,345,784 13,941,377 migrations 1,127,820 1,157,323 faults 374,736 382,175 cache-misses 55,132,054,603 54,993,823,500 sched:sched_move_numa 1,923 2,005 sched:sched_stick_numa 52 14 sched:sched_swap_numa 595 529 migrate:mm_migrate_pages 1,932 1,573 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 60605 67099 numa_hint_faults_local 51804 58456 numa_hit 239945 240416 numa_huge_pte_updates 14 18 numa_interleave 60 65 numa_local 239865 240339 numa_other 80 77 numa_pages_migrated 1931 1574 numa_pte_updates 67823 77182 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,016,467 3,176,453 migrations 37,326 30,238 faults 115,342 87,869 cache-misses 11,692,155,554 12,544,479,391 sched:sched_move_numa 965 23 sched:sched_stick_numa 8 0 sched:sched_swap_numa 35 6 migrate:mm_migrate_pages 1,168 10 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 16286 236 numa_hint_faults_local 11863 201 numa_hit 112482 72293 numa_huge_pte_updates 33 0 numa_interleave 20 26 numa_local 112419 72233 numa_other 63 60 numa_pages_migrated 1144 8 numa_pte_updates 32859 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,629,724 8,478,820 migrations 221,052 171,323 faults 308,661 307,499 cache-misses 135,574,913 240,353,599 sched:sched_move_numa 147 214 sched:sched_stick_numa 0 0 sched:sched_swap_numa 2 4 migrate:mm_migrate_pages 64 89 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11481 5301 numa_hint_faults_local 10968 4745 numa_hit 89773 92943 numa_huge_pte_updates 0 0 numa_interleave 1116 899 numa_local 89220 92345 numa_other 553 598 numa_pages_migrated 62 88 numa_pte_updates 11694 5505 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,272,887 2,066,172 migrations 12,206 11,076 faults 163,704 149,544 cache-misses 4,801,186 10,398,067 sched:sched_move_numa 44 43 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 17 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 2261 3552 numa_hint_faults_local 1993 3347 numa_hit 25726 25611 numa_huge_pte_updates 0 0 numa_interleave 239 213 numa_local 25498 25583 numa_other 228 28 numa_pages_migrated 17 6 numa_pte_updates 2266 3535 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 117,980,962 99,358,136 migrations 3,950,220 4,041,607 faults 736,979 749,653 cache-misses 224,976,072,879 225,562,543,251 sched:sched_move_numa 504 771 sched:sched_stick_numa 50 14 sched:sched_swap_numa 239 204 migrate:mm_migrate_pages 1,260 1,180 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 18293 27409 numa_hint_faults_local 11969 20677 numa_hit 240854 239988 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 240851 239983 numa_other 3 5 numa_pages_migrated 1190 1016 numa_pte_updates 18106 27916 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 61,053,158 60,899,307 migrations 551,586 544,668 faults 244,174 270,834 cache-misses 74,326,766,973 74,543,455,635 sched:sched_move_numa 344 735 sched:sched_stick_numa 24 25 sched:sched_swap_numa 140 174 migrate:mm_migrate_pages 568 816 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 6461 11059 numa_hint_faults_local 2283 4733 numa_hit 35661 41384 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 35661 41383 numa_other 0 1 numa_pages_migrated 568 815 numa_pte_updates 6518 11323 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Modify migrate_swap() to accept additional parametersSrikar Dronamraju2018-07-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are checks in migrate_swap_stop() that check if the task/CPU combination is as per migrate_swap_arg before migrating. However atleast one of the two tasks to be swapped by migrate_swap() could have migrated to a completely different CPU before updating the migrate_swap_arg. The new CPU where the task is currently running could be a different node too. If the task has migrated, numa balancer might end up placing a task in a wrong node. Instead of achieving node consolidation, it may end up spreading the load across nodes. To avoid that pass the CPUs as additional parameters. While here, place migrate_swap under CONFIG_NUMA_BALANCING. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25377.3 25226.6 -0.59 1 72287 73326 1.437 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Remove #ifdefs from scale_rt_capacity()Vincent Guittot2018-07-251-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Reuse cpu_util_irq() that has been defined for schedutil and set irq util to 0 when !CONFIG_IRQ_TIME_ACCOUNTING. But the compiler is not able to optimize the sequence (at least with aarch64 GCC 7.2.1): free *= (max - irq); free /= max; when irq is fixed to 0 Add a new inline function scale_irq_capacity() that will scale utilization when irq is accounted. Reuse this funciton in schedutil which applies similar formula. Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/sysctl: Remove unused sched_time_avg_ms sysctlVincent Guittot2018-07-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere, remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/core: Remove the rt_avg codeVincent Guittot2018-07-161-17/+0
| | | | | | | | | | | | | | | | | | | | | | | | rt_avg is not used anywhere anymore, so we can remove all related code. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/cpufreq: Remove sugov_aggregate_util()Vincent Guittot2018-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | There is no reason why sugov_get_util() and sugov_aggregate_util() were in fact separate functions. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Rebased after adding irq tracking and fixed some compilation errors. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-9-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* cpufreq/schedutil: Take time spent in interrupts into accountVincent Guittot2018-07-151-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The time spent executing IRQ handlers can be significant but it is not reflected in the utilization of CPU when deciding to choose an OPP. Now that we have access to this metric, schedutil can take it into account when selecting the OPP for a CPU. RQS utilization don't see the time spend under interrupt context and report their value in the normal context time window. We need to compensate this when adding interrupt utilization The CPU utilization is: IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg A test with iperf on hikey (octo arm64) gives the following speedup: iperf -c server_address -r -t 5 w/o patch w/ patch Tx 276 Mbits/sec 304 Mbits/sec +10% Rx 299 Mbits/sec 328 Mbits/sec +9% 8 iterations stdev is lower than 1% Only WFI idle state is enabled (shallowest idle state). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/irq: Add IRQ utilization trackingVincent Guittot2018-07-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | interrupt and steal time are the only remaining activities tracked by rt_avg. Like for sched classes, we can use PELT to track their average utilization of the CPU. But unlike sched class, we don't track when entering/leaving interrupt; Instead, we take into account the time spent under interrupt context when we update rqs' clock (rq_clock_task). This also means that we have to decay the normal context time and account for interrupt time during the update. That's also important to note that because: rq_clock == rq_clock_task + interrupt time and rq_clock_task is used by a sched class to compute its utilization, the util_avg of a sched class only reflects the utilization of the time spent in normal context and not of the whole time of the CPU. The utilization of interrupt gives an more accurate level of utilization of CPU. The CPU utilization is: avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq Most of the time, avg_irq is small and neglictible so the use of the approximation CPU utilization = /Sum avg_rq was enough. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* cpufreq/schedutil: Use DL utilization trackingVincent Guittot2018-07-151-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have both the DL class bandwidth requirement and the DL class utilization, we can detect when CPU is fully used so we should run at max. Otherwise, we keep using the DL bandwidth requirement to define the utilization of the CPU. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1530200714-4504-6-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/dl: Add dl_rq utilization trackingVincent Guittot2018-07-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to what happens with RT tasks, CFS tasks can be preempted by DL tasks and the CFS's utilization might no longer describes the real utilization level. Current DL bandwidth reflects the requirements to meet deadline when tasks are enqueued but not the current utilization of the DL sched class. We track DL class utilization to estimate the system utilization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/rt: Add rt_rq utilization trackingVincent Guittot2018-07-151-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks are preempted by RT tasks and in this case util_avg reflects the remaining capacity but not what CFS want to use. In such case, schedutil can select a lower OPP whereas the CPU is overloaded. In order to have a more accurate view of the utilization of the CPU, we track the utilization of RT tasks. Only util_avg is correctly tracked but not load_avg and runnable_load_avg which are useless for rt_rq. rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are the same at the root group level, so the PELT windows of the util_sum are aligned. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/pelt: Move PELT related code in a dedicated fileVincent Guittot2018-07-151-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to track rt_rq's utilization as a part of the estimation of the whole rq's utilization. This is necessary because rt tasks can steal utilization to cfs tasks and make them lighter than they are. As we want to use the same load tracking mecanism for both and prevent useless dependency between cfs and rt code, PELT code is moved in a dedicated file. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Fix bandwidth timer clock drift conditionXunlei Pang2018-07-031-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/rt: Fix call to cpufreq_update_util()Vincent Guittot2018-07-031-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit: 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") the schedutil governor uses rq->rt.rt_nr_running to detect whether an RT task is currently running on the CPU and to set frequency to max if necessary. cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is called so schedutil still considers that an RT task is running when the last task is dequeued. The update of rq->rt.rt_nr_running happens later in dequeue_rt_stack(). In fact, we can take advantage of the sequence that the dequeue then re-enqueue rt entities when a rt task is enqueued or dequeued; As a result enqueue_top_rt_rq() is always called when a task is enqueued or dequeued and also when groups are throttled or unthrottled. The only place that not use enqueue_top_rt_rq() is when root rt_rq is throttled. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: viresh.kumar@linaro.org Fixes: 8f111bc357aa ('cpufreq/schedutil: Rewrite CPUFREQ_RT support') Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2018-06-041-0/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - power-aware scheduling improvements (Patrick Bellasi) - NUMA balancing improvements (Mel Gorman) - vCPU scheduling fixes (Rohit Jain) * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Update util_est before updating schedutil sched/cpufreq: Modify aggregate utilization to always include blocked FAIR utilization sched/deadline/Documentation: Add overrun signal and GRUB-PA documentation sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu() sched/wait: Include <linux/wait.h> in <linux/swait.h> sched/numa: Stagger NUMA balancing scan periods for new threads sched/core: Don't schedule threads on pre-empted vCPUs sched/fair: Avoid calling sync_entity_load_avg() unnecessarily sched/fair: Rearrange select_task_rq_fair() to optimize it
| * Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar2018-05-251-2/+3
| |\ | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | sched/numa: Stagger NUMA balancing scan periods for new threadsMel Gorman2018-05-141-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Threads share an address space and each can change the protections of the same address space to trap NUMA faults. This is redundant and potentially counter-productive as any thread doing the update will suffice. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created. The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity. The following is the time reported by NAS Parallel Benchmark using unbound openmp threads and a D size class: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 Time bt.D 442.77 ( 0.00%) 419.70 ( 5.21%) Time cg.D 171.90 ( 0.00%) 180.85 ( -5.21%) Time ep.D 33.10 ( 0.00%) 32.90 ( 0.60%) Time is.D 9.59 ( 0.00%) 9.42 ( 1.77%) Time lu.D 306.75 ( 0.00%) 304.65 ( 0.68%) Time mg.D 54.56 ( 0.00%) 52.38 ( 4.00%) Time sp.D 1020.03 ( 0.00%) 903.77 ( 11.40%) Time ua.D 400.58 ( 0.00%) 386.49 ( 3.52%) Note it's not a universal win but we have no prior knowledge of which thread matters but the number of threads created often exceeds the size of the node when the threads are not bound. However, there is a reducation of overall system CPU usage: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 sys-time-bt.D 48.78 ( 0.00%) 48.22 ( 1.15%) sys-time-cg.D 25.31 ( 0.00%) 26.63 ( -5.22%) sys-time-ep.D 1.65 ( 0.00%) 0.62 ( 62.42%) sys-time-is.D 40.05 ( 0.00%) 24.45 ( 38.95%) sys-time-lu.D 37.55 ( 0.00%) 29.02 ( 22.72%) sys-time-mg.D 47.52 ( 0.00%) 34.92 ( 26.52%) sys-time-sp.D 119.01 ( 0.00%) 109.05 ( 8.37%) sys-time-ua.D 51.52 ( 0.00%) 45.13 ( 12.40%) NUMA scan activity is also reduced: NUMA alloc local 1042828 1342670 NUMA base PTE updates 140481138 93577468 NUMA huge PMD updates 272171 180766 NUMA page range updates 279832690 186129660 NUMA hint faults 1395972 1193897 NUMA hint local faults 877925 855053 NUMA hint local percent 62 71 NUMA pages migrated 12057909 9158023 Similar observations are made for other thread-intensive workloads. System CPU usage is lower even though the headline gains in performance tend to be small. For example, specjbb 2005 shows almost no difference in performance but scan activity is reduced by a third on a 4-socket box. I didn't find a workload (thread intensive or otherwise) that suffered badly. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | sched/headers: Fix typoDavidlohr Bueso2018-05-311-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | I cannot spell 'throttling'. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
OpenPOWER on IntegriCloud