summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/RCU/trace.txt100
-rw-r--r--Documentation/kernel-per-CPU-kthreads.txt47
-rw-r--r--Documentation/timers/NO_HZ.txt79
3 files changed, 118 insertions, 108 deletions
diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index c776968f4463..f3778f8952da 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -530,113 +530,21 @@ o "nos" counts the number of times we balked for other
reasons, e.g., the grace period ended first.
-CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats
+CONFIG_TINY_RCU debugfs Files and Formats
These implementations of RCU provides a single debugfs file under the
top-level directory RCU, namely rcu/rcudata, which displays fields in
-rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU,
-rcu_preempt_ctrlblk.
+rcu_bh_ctrlblk and rcu_sched_ctrlblk.
The output of "cat rcu/rcudata" is as follows:
-rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=...
- ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274
- normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0
- exp balk: bt=0 nos=0
rcu_sched: qlen: 0
rcu_bh: qlen: 0
-This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the
-rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds.
-The last three lines of the rcu_preempt section appear only in
-CONFIG_RCU_BOOST kernel builds. The fields are as follows:
+This is split into rcu_sched and rcu_bh sections. The field is as
+follows:
o "qlen" is the number of RCU callbacks currently waiting either
for an RCU grace period or waiting to be invoked. This is the
only field present for rcu_sched and rcu_bh, due to the
short-circuiting of grace period in those two cases.
-
-o "gp" is the number of grace periods that have completed.
-
-o "g197/p197/c197" displays the grace-period state, with the
- "g" number being the number of grace periods that have started
- (mod 256), the "p" number being the number of grace periods
- that the CPU has responded to (also mod 256), and the "c"
- number being the number of grace periods that have completed
- (once again mode 256).
-
- Why have both "gp" and "g"? Because the data flowing into
- "gp" is only present in a CONFIG_RCU_TRACE kernel.
-
-o "tasks" is a set of bits. The first bit is "T" if there are
- currently tasks that have recently blocked within an RCU
- read-side critical section, the second bit is "N" if any of the
- aforementioned tasks are blocking the current RCU grace period,
- and the third bit is "E" if any of the aforementioned tasks are
- blocking the current expedited grace period. Each bit is "."
- if the corresponding condition does not hold.
-
-o "ttb" is a single bit. It is "B" if any of the blocked tasks
- need to be priority boosted and "." otherwise.
-
-o "btg" indicates whether boosting has been carried out during
- the current grace period, with "exp" indicating that boosting
- is in progress for an expedited grace period, "no" indicating
- that boosting has not yet started for a normal grace period,
- "begun" indicating that boosting has bebug for a normal grace
- period, and "done" indicating that boosting has completed for
- a normal grace period.
-
-o "ntb" is the total number of tasks subjected to RCU priority boosting
- periods since boot.
-
-o "neb" is the number of expedited grace periods that have had
- to resort to RCU priority boosting since boot.
-
-o "nnb" is the number of normal grace periods that have had
- to resort to RCU priority boosting since boot.
-
-o "j" is the low-order 16 bits of the jiffies counter in hexadecimal.
-
-o "bt" is the low-order 16 bits of the value that the jiffies counter
- will have at the next time that boosting is scheduled to begin.
-
-o In the line beginning with "normal balk", the fields are as follows:
-
- o "nt" is the number of times that the system balked from
- boosting because there were no blocked tasks to boost.
- Note that the system will balk from boosting even if the
- grace period is overdue when the currently running task
- is looping within an RCU read-side critical section.
- There is no point in boosting in this case, because
- boosting a running task won't make it run any faster.
-
- o "gt" is the number of times that the system balked
- from boosting because, although there were blocked tasks,
- none of them were preventing the current grace period
- from completing.
-
- o "bt" is the number of times that the system balked
- from boosting because boosting was already in progress.
-
- o "b" is the number of times that the system balked from
- boosting because boosting had already completed for
- the grace period in question.
-
- o "ny" is the number of times that the system balked from
- boosting because it was not yet time to start boosting
- the grace period in question.
-
- o "nos" is the number of times that the system balked from
- boosting for inexplicable ("not otherwise specified")
- reasons. This can actually happen due to races involving
- increments of the jiffies counter.
-
-o In the line beginning with "exp balk", the fields are as follows:
-
- o "bt" is the number of times that the system balked from
- boosting because there were no blocked tasks to boost.
-
- o "nos" is the number of times that the system balked from
- boosting for inexplicable ("not otherwise specified")
- reasons.
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index cbf7ae412da4..5f39ef55c6f6 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -157,6 +157,53 @@ RCU_SOFTIRQ: Do at least one of the following:
calls and by forcing both kernel threads and interrupts
to execute elsewhere.
+Name: kworker/%u:%d%s (cpu, id, priority)
+Purpose: Execute workqueue requests
+To reduce its OS jitter, do any of the following:
+1. Run your workload at a real-time priority, which will allow
+ preempting the kworker daemons.
+2. Do any of the following needed to avoid jitter that your
+ application cannot tolerate:
+ a. Build your kernel with CONFIG_SLUB=y rather than
+ CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
+ use of each CPU's workqueues to run its cache_reap()
+ function.
+ b. Avoid using oprofile, thus avoiding OS jitter from
+ wq_sync_buffer().
+ c. Limit your CPU frequency so that a CPU-frequency
+ governor is not required, possibly enlisting the aid of
+ special heatsinks or other cooling technologies. If done
+ correctly, and if you CPU architecture permits, you should
+ be able to build your kernel with CONFIG_CPU_FREQ=n to
+ avoid the CPU-frequency governor periodically running
+ on each CPU, including cs_dbs_timer() and od_dbs_timer().
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ d. It is not possible to entirely get rid of OS jitter
+ from vmstat_update() on CONFIG_SMP=y systems, but you
+ can decrease its frequency by writing a large value to
+ /proc/sys/vm/stat_interval. The default value is HZ,
+ for an interval of one second. Of course, larger values
+ will make your virtual-memory statistics update more
+ slowly. Of course, you can also run your workload at
+ a real-time priority, thus preempting vmstat_update().
+ e. If running on high-end powerpc servers, build with
+ CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
+ daemon from running on each CPU every second or so.
+ (This will require editing Kconfig files and will defeat
+ this platform's RAS functionality.) This avoids jitter
+ due to the rtas_event_scan() function.
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ f. If running on Cell Processor, build your kernel with
+ CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
+ spu_gov_work().
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ g. If running on PowerMAC, build your kernel with
+ CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
+ avoiding OS jitter from rackmeter_do_timer().
+
Name: rcuc/%u
Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
To reduce its OS jitter, do at least one of the following:
diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt
index 5b5322024067..88697584242b 100644
--- a/Documentation/timers/NO_HZ.txt
+++ b/Documentation/timers/NO_HZ.txt
@@ -7,21 +7,59 @@ efficiency and reducing OS jitter. Reducing OS jitter is important for
some types of computationally intensive high-performance computing (HPC)
applications and for real-time applications.
-There are two main contexts in which the number of scheduling-clock
-interrupts can be reduced compared to the old-school approach of sending
-a scheduling-clock interrupt to all CPUs every jiffy whether they need
-it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels):
+There are three main ways of managing scheduling-clock interrupts
+(also known as "scheduling-clock ticks" or simply "ticks"):
-1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels).
+1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or
+ CONFIG_NO_HZ=n for older kernels). You normally will -not-
+ want to choose this option.
-2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y).
+2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or
+ CONFIG_NO_HZ=y for older kernels). This is the most common
+ approach, and should be the default.
-These two cases are described in the following two sections, followed
+3. Omit scheduling-clock ticks on CPUs that are either idle or that
+ have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you
+ are running realtime applications or certain types of HPC
+ workloads, you will normally -not- want this option.
+
+These three cases are described in the following three sections, followed
by a third section on RCU-specific considerations and a fourth and final
section listing known issues.
-IDLE CPUs
+NEVER OMIT SCHEDULING-CLOCK TICKS
+
+Very old versions of Linux from the 1990s and the very early 2000s
+are incapable of omitting scheduling-clock ticks. It turns out that
+there are some situations where this old-school approach is still the
+right approach, for example, in heavy workloads with lots of tasks
+that use short bursts of CPU, where there are very frequent idle
+periods, but where these idle periods are also quite short (tens or
+hundreds of microseconds). For these types of workloads, scheduling
+clock interrupts will normally be delivered any way because there
+will frequently be multiple runnable tasks per CPU. In these cases,
+attempting to turn off the scheduling clock interrupt will have no effect
+other than increasing the overhead of switching to and from idle and
+transitioning between user and kernel execution.
+
+This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or
+CONFIG_NO_HZ=n for older kernels).
+
+However, if you are instead running a light workload with long idle
+periods, failing to omit scheduling-clock interrupts will result in
+excessive power consumption. This is especially bad on battery-powered
+devices, where it results in extremely short battery lifetimes. If you
+are running light workloads, you should therefore read the following
+section.
+
+In addition, if you are running either a real-time workload or an HPC
+workload with short iterations, the scheduling-clock interrupts can
+degrade your applications performance. If this describes your workload,
+you should read the following two sections.
+
+
+OMIT SCHEDULING-CLOCK TICKS FOR IDLE CPUs
If a CPU is idle, there is little point in sending it a scheduling-clock
interrupt. After all, the primary purpose of a scheduling-clock interrupt
@@ -59,10 +97,12 @@ By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling
dyntick-idle mode.
-CPUs WITH ONLY ONE RUNNABLE TASK
+OMIT SCHEDULING-CLOCK TICKS FOR CPUs WITH ONLY ONE RUNNABLE TASK
If a CPU has only one runnable task, there is little point in sending it
a scheduling-clock interrupt because there is no other task to switch to.
+Note that omitting scheduling-clock ticks for CPUs with only one runnable
+task implies also omitting them for idle CPUs.
The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
sending scheduling-clock interrupts to CPUs with a single runnable task,
@@ -238,6 +278,11 @@ o Adaptive-ticks does not do anything unless there is only one
single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
tasks, even though these interrupts are unnecessary.
+ And even when there are multiple runnable tasks on a given CPU,
+ there is little point in interrupting that CPU until the current
+ running task's timeslice expires, which is almost always way
+ longer than the time of the next scheduling-clock interrupt.
+
Better handling of these sorts of situations is future work.
o A reboot is required to reconfigure both adaptive idle and RCU
@@ -268,6 +313,16 @@ o Unless all CPUs are idle, at least one CPU must keep the
scheduling-clock interrupt going in order to support accurate
timekeeping.
-o If there are adaptive-ticks CPUs, there will be at least one
- CPU keeping the scheduling-clock interrupt going, even if all
- CPUs are otherwise idle.
+o If there might potentially be some adaptive-ticks CPUs, there
+ will be at least one CPU keeping the scheduling-clock interrupt
+ going, even if all CPUs are otherwise idle.
+
+ Better handling of this situation is ongoing work.
+
+o Some process-handling operations still require the occasional
+ scheduling-clock tick. These operations include calculating CPU
+ load, maintaining sched average, computing CFS entity vruntime,
+ computing avenrun, and carrying out load balancing. They are
+ currently accommodated by scheduling-clock tick every second
+ or so. On-going work will eliminate the need even for these
+ infrequent scheduling-clock ticks.
OpenPOWER on IntegriCloud