diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/RCU/trace.txt | 100 | ||||
-rw-r--r-- | Documentation/kernel-per-CPU-kthreads.txt | 47 | ||||
-rw-r--r-- | Documentation/timers/NO_HZ.txt | 79 |
3 files changed, 118 insertions, 108 deletions
diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt index c776968f4463..f3778f8952da 100644 --- a/Documentation/RCU/trace.txt +++ b/Documentation/RCU/trace.txt @@ -530,113 +530,21 @@ o "nos" counts the number of times we balked for other reasons, e.g., the grace period ended first. -CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats +CONFIG_TINY_RCU debugfs Files and Formats These implementations of RCU provides a single debugfs file under the top-level directory RCU, namely rcu/rcudata, which displays fields in -rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU, -rcu_preempt_ctrlblk. +rcu_bh_ctrlblk and rcu_sched_ctrlblk. The output of "cat rcu/rcudata" is as follows: -rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=... - ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274 - normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0 - exp balk: bt=0 nos=0 rcu_sched: qlen: 0 rcu_bh: qlen: 0 -This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the -rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds. -The last three lines of the rcu_preempt section appear only in -CONFIG_RCU_BOOST kernel builds. The fields are as follows: +This is split into rcu_sched and rcu_bh sections. The field is as +follows: o "qlen" is the number of RCU callbacks currently waiting either for an RCU grace period or waiting to be invoked. This is the only field present for rcu_sched and rcu_bh, due to the short-circuiting of grace period in those two cases. - -o "gp" is the number of grace periods that have completed. - -o "g197/p197/c197" displays the grace-period state, with the - "g" number being the number of grace periods that have started - (mod 256), the "p" number being the number of grace periods - that the CPU has responded to (also mod 256), and the "c" - number being the number of grace periods that have completed - (once again mode 256). - - Why have both "gp" and "g"? Because the data flowing into - "gp" is only present in a CONFIG_RCU_TRACE kernel. - -o "tasks" is a set of bits. The first bit is "T" if there are - currently tasks that have recently blocked within an RCU - read-side critical section, the second bit is "N" if any of the - aforementioned tasks are blocking the current RCU grace period, - and the third bit is "E" if any of the aforementioned tasks are - blocking the current expedited grace period. Each bit is "." - if the corresponding condition does not hold. - -o "ttb" is a single bit. It is "B" if any of the blocked tasks - need to be priority boosted and "." otherwise. - -o "btg" indicates whether boosting has been carried out during - the current grace period, with "exp" indicating that boosting - is in progress for an expedited grace period, "no" indicating - that boosting has not yet started for a normal grace period, - "begun" indicating that boosting has bebug for a normal grace - period, and "done" indicating that boosting has completed for - a normal grace period. - -o "ntb" is the total number of tasks subjected to RCU priority boosting - periods since boot. - -o "neb" is the number of expedited grace periods that have had - to resort to RCU priority boosting since boot. - -o "nnb" is the number of normal grace periods that have had - to resort to RCU priority boosting since boot. - -o "j" is the low-order 16 bits of the jiffies counter in hexadecimal. - -o "bt" is the low-order 16 bits of the value that the jiffies counter - will have at the next time that boosting is scheduled to begin. - -o In the line beginning with "normal balk", the fields are as follows: - - o "nt" is the number of times that the system balked from - boosting because there were no blocked tasks to boost. - Note that the system will balk from boosting even if the - grace period is overdue when the currently running task - is looping within an RCU read-side critical section. - There is no point in boosting in this case, because - boosting a running task won't make it run any faster. - - o "gt" is the number of times that the system balked - from boosting because, although there were blocked tasks, - none of them were preventing the current grace period - from completing. - - o "bt" is the number of times that the system balked - from boosting because boosting was already in progress. - - o "b" is the number of times that the system balked from - boosting because boosting had already completed for - the grace period in question. - - o "ny" is the number of times that the system balked from - boosting because it was not yet time to start boosting - the grace period in question. - - o "nos" is the number of times that the system balked from - boosting for inexplicable ("not otherwise specified") - reasons. This can actually happen due to races involving - increments of the jiffies counter. - -o In the line beginning with "exp balk", the fields are as follows: - - o "bt" is the number of times that the system balked from - boosting because there were no blocked tasks to boost. - - o "nos" is the number of times that the system balked from - boosting for inexplicable ("not otherwise specified") - reasons. diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index cbf7ae412da4..5f39ef55c6f6 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -157,6 +157,53 @@ RCU_SOFTIRQ: Do at least one of the following: calls and by forcing both kernel threads and interrupts to execute elsewhere. +Name: kworker/%u:%d%s (cpu, id, priority) +Purpose: Execute workqueue requests +To reduce its OS jitter, do any of the following: +1. Run your workload at a real-time priority, which will allow + preempting the kworker daemons. +2. Do any of the following needed to avoid jitter that your + application cannot tolerate: + a. Build your kernel with CONFIG_SLUB=y rather than + CONFIG_SLAB=y, thus avoiding the slab allocator's periodic + use of each CPU's workqueues to run its cache_reap() + function. + b. Avoid using oprofile, thus avoiding OS jitter from + wq_sync_buffer(). + c. Limit your CPU frequency so that a CPU-frequency + governor is not required, possibly enlisting the aid of + special heatsinks or other cooling technologies. If done + correctly, and if you CPU architecture permits, you should + be able to build your kernel with CONFIG_CPU_FREQ=n to + avoid the CPU-frequency governor periodically running + on each CPU, including cs_dbs_timer() and od_dbs_timer(). + WARNING: Please check your CPU specifications to + make sure that this is safe on your particular system. + d. It is not possible to entirely get rid of OS jitter + from vmstat_update() on CONFIG_SMP=y systems, but you + can decrease its frequency by writing a large value to + /proc/sys/vm/stat_interval. The default value is HZ, + for an interval of one second. Of course, larger values + will make your virtual-memory statistics update more + slowly. Of course, you can also run your workload at + a real-time priority, thus preempting vmstat_update(). + e. If running on high-end powerpc servers, build with + CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS + daemon from running on each CPU every second or so. + (This will require editing Kconfig files and will defeat + this platform's RAS functionality.) This avoids jitter + due to the rtas_event_scan() function. + WARNING: Please check your CPU specifications to + make sure that this is safe on your particular system. + f. If running on Cell Processor, build your kernel with + CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from + spu_gov_work(). + WARNING: Please check your CPU specifications to + make sure that this is safe on your particular system. + g. If running on PowerMAC, build your kernel with + CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, + avoiding OS jitter from rackmeter_do_timer(). + Name: rcuc/%u Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. To reduce its OS jitter, do at least one of the following: diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt index 5b5322024067..88697584242b 100644 --- a/Documentation/timers/NO_HZ.txt +++ b/Documentation/timers/NO_HZ.txt @@ -7,21 +7,59 @@ efficiency and reducing OS jitter. Reducing OS jitter is important for some types of computationally intensive high-performance computing (HPC) applications and for real-time applications. -There are two main contexts in which the number of scheduling-clock -interrupts can be reduced compared to the old-school approach of sending -a scheduling-clock interrupt to all CPUs every jiffy whether they need -it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels): +There are three main ways of managing scheduling-clock interrupts +(also known as "scheduling-clock ticks" or simply "ticks"): -1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels). +1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or + CONFIG_NO_HZ=n for older kernels). You normally will -not- + want to choose this option. -2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y). +2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or + CONFIG_NO_HZ=y for older kernels). This is the most common + approach, and should be the default. -These two cases are described in the following two sections, followed +3. Omit scheduling-clock ticks on CPUs that are either idle or that + have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you + are running realtime applications or certain types of HPC + workloads, you will normally -not- want this option. + +These three cases are described in the following three sections, followed by a third section on RCU-specific considerations and a fourth and final section listing known issues. -IDLE CPUs +NEVER OMIT SCHEDULING-CLOCK TICKS + +Very old versions of Linux from the 1990s and the very early 2000s +are incapable of omitting scheduling-clock ticks. It turns out that +there are some situations where this old-school approach is still the +right approach, for example, in heavy workloads with lots of tasks +that use short bursts of CPU, where there are very frequent idle +periods, but where these idle periods are also quite short (tens or +hundreds of microseconds). For these types of workloads, scheduling +clock interrupts will normally be delivered any way because there +will frequently be multiple runnable tasks per CPU. In these cases, +attempting to turn off the scheduling clock interrupt will have no effect +other than increasing the overhead of switching to and from idle and +transitioning between user and kernel execution. + +This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or +CONFIG_NO_HZ=n for older kernels). + +However, if you are instead running a light workload with long idle +periods, failing to omit scheduling-clock interrupts will result in +excessive power consumption. This is especially bad on battery-powered +devices, where it results in extremely short battery lifetimes. If you +are running light workloads, you should therefore read the following +section. + +In addition, if you are running either a real-time workload or an HPC +workload with short iterations, the scheduling-clock interrupts can +degrade your applications performance. If this describes your workload, +you should read the following two sections. + + +OMIT SCHEDULING-CLOCK TICKS FOR IDLE CPUs If a CPU is idle, there is little point in sending it a scheduling-clock interrupt. After all, the primary purpose of a scheduling-clock interrupt @@ -59,10 +97,12 @@ By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling dyntick-idle mode. -CPUs WITH ONLY ONE RUNNABLE TASK +OMIT SCHEDULING-CLOCK TICKS FOR CPUs WITH ONLY ONE RUNNABLE TASK If a CPU has only one runnable task, there is little point in sending it a scheduling-clock interrupt because there is no other task to switch to. +Note that omitting scheduling-clock ticks for CPUs with only one runnable +task implies also omitting them for idle CPUs. The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to CPUs with a single runnable task, @@ -238,6 +278,11 @@ o Adaptive-ticks does not do anything unless there is only one single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER tasks, even though these interrupts are unnecessary. + And even when there are multiple runnable tasks on a given CPU, + there is little point in interrupting that CPU until the current + running task's timeslice expires, which is almost always way + longer than the time of the next scheduling-clock interrupt. + Better handling of these sorts of situations is future work. o A reboot is required to reconfigure both adaptive idle and RCU @@ -268,6 +313,16 @@ o Unless all CPUs are idle, at least one CPU must keep the scheduling-clock interrupt going in order to support accurate timekeeping. -o If there are adaptive-ticks CPUs, there will be at least one - CPU keeping the scheduling-clock interrupt going, even if all - CPUs are otherwise idle. +o If there might potentially be some adaptive-ticks CPUs, there + will be at least one CPU keeping the scheduling-clock interrupt + going, even if all CPUs are otherwise idle. + + Better handling of this situation is ongoing work. + +o Some process-handling operations still require the occasional + scheduling-clock tick. These operations include calculating CPU + load, maintaining sched average, computing CFS entity vruntime, + computing avenrun, and carrying out load balancing. They are + currently accommodated by scheduling-clock tick every second + or so. On-going work will eliminate the need even for these + infrequent scheduling-clock ticks. |