diff options
Diffstat (limited to 'Documentation')
40 files changed, 1355 insertions, 1033 deletions
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index da5c087462b1..c3c705591532 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -819,7 +819,7 @@ printk(KERN_INFO "my ip: %pI4\n", &ipaddress); certain condition is true. They must be used carefully to ensure there is no race condition. You declare a <type>wait_queue_head_t</type>, and then processes which want to - wait for that condition declare a <type>wait_queue_t</type> + wait for that condition declare a <type>wait_queue_entry_t</type> referring to themselves, and place that in the queue. </para> diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index 1672573b037a..f46980c060aa 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -28,8 +28,6 @@ stallwarn.txt - RCU CPU stall warnings (module parameter rcu_cpu_stall_suppress) torture.txt - RCU Torture Test Operation (CONFIG_RCU_TORTURE_TEST) -trace.txt - - CONFIG_RCU_TRACE debugfs files and formats UP.txt - RCU on Uniprocessor Systems whatisRCU.txt diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index f60adf112663..95b30fa25d56 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -559,9 +559,7 @@ The <tt>rcu_access_pointer()</tt> on line 6 is similar to For <tt>remove_gp_synchronous()</tt>, as long as all modifications to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, the above optimizations are harmless. - However, - with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt>, - <tt>sparse</tt> will complain if you + However, <tt>sparse</tt> will complain if you define <tt>gp</tt> with <tt>__rcu</tt> and then access it without using either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. @@ -1849,7 +1847,8 @@ mass storage, or user patience, whichever comes first. If the nesting is not visible to the compiler, as is the case with mutually recursive functions each in its own translation unit, stack overflow will result. -If the nesting takes the form of loops, either the control variable +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. Nevertheless, this class of RCU implementations is one of the most composable constructs in existence. @@ -1977,9 +1976,8 @@ guard against mishaps and misuse: and <tt>rcu_dereference()</tt>, perhaps (incorrectly) substituting a simple assignment. To catch this sort of error, a given RCU-protected pointer may be - tagged with <tt>__rcu</tt>, after which running sparse - with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt> will complain - about simple-assignment accesses to that pointer. + tagged with <tt>__rcu</tt>, after which sparse + will complain about simple-assignment accesses to that pointer. Arnd Bergmann made me aware of this requirement, and also supplied the needed <a href="https://lwn.net/Articles/376011/">patch series</a>. @@ -2036,7 +2034,7 @@ guard against mishaps and misuse: some other synchronization mechanism, for example, reference counting. <li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related - information is provided via both debugfs and event tracing. + information is provided via event tracing. <li> Open-coded use of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt> to create typical linked data structures can be surprisingly error-prone. @@ -2519,11 +2517,7 @@ It is similarly socially unacceptable to interrupt an <tt>nohz_full</tt> CPU running in userspace. RCU must therefore track <tt>nohz_full</tt> userspace execution. -And in -<a href="https://lwn.net/Articles/558284/"><tt>CONFIG_NO_HZ_FULL_SYSIDLE=y</tt></a> -kernels, RCU must separately track idle CPUs on the one hand and -CPUs that are either idle or executing in userspace on the other. -In both cases, RCU must be able to sample state at two points in +RCU must therefore be able to sample state at two points in time, and be able to determine whether or not some other CPU spent any time idle and/or executing in userspace. @@ -2936,6 +2930,20 @@ to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt> need not exclude CPU-hotplug operations. <p> +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. +This means that in the current SRCU implementation, expediting a +future grace period has the side effect of expediting all prior +grace periods that have not yet completed. +(But please note that this is a property of the current implementation, +not necessarily of future implementations.) +In addition, if SRCU has been idle for longer than the interval +specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter +(25 microseconds by default), +and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, +that invocation will be automatically expedited. + +<p> As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a locking bottleneck present in prior kernel versions. Although this will allow users to put much heavier stress on diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 877947130ebe..6beda556faf3 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -413,11 +413,11 @@ over a rather long period of time, but improvements are always welcome! read-side critical sections. It is the responsibility of the RCU update-side primitives to deal with this. -17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the - __rcu sparse checks (enabled by CONFIG_SPARSE_RCU_POINTER) to - validate your RCU code. These can help find problems as follows: +17. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the + __rcu sparse checks to validate your RCU code. These can help + find problems as follows: - CONFIG_PROVE_RCU: check that accesses to RCU-protected data + CONFIG_PROVE_LOCKING: check that accesses to RCU-protected data structures are carried out under the proper RCU read-side critical section, while holding the right combination of locks, or whatever other conditions diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt deleted file mode 100644 index 6549012033f9..000000000000 --- a/Documentation/RCU/trace.txt +++ /dev/null @@ -1,535 +0,0 @@ -CONFIG_RCU_TRACE debugfs Files and Formats - - -The rcutree and rcutiny implementations of RCU provide debugfs trace -output that summarizes counters and state. This information is useful for -debugging RCU itself, and can sometimes also help to debug abuses of RCU. -The following sections describe the debugfs files and formats, first -for rcutree and next for rcutiny. - - -CONFIG_TREE_RCU and CONFIG_PREEMPT_RCU debugfs Files and Formats - -These implementations of RCU provide several debugfs directories under the -top-level directory "rcu": - -rcu/rcu_bh -rcu/rcu_preempt -rcu/rcu_sched - -Each directory contains files for the corresponding flavor of RCU. -Note that rcu/rcu_preempt is only present for CONFIG_PREEMPT_RCU. -For CONFIG_TREE_RCU, the RCU flavor maps onto the RCU-sched flavor, -so that activity for both appears in rcu/rcu_sched. - -In addition, the following file appears in the top-level directory: -rcu/rcutorture. This file displays rcutorture test progress. The output -of "cat rcu/rcutorture" looks as follows: - -rcutorture test sequence: 0 (test in progress) -rcutorture update version number: 615 - -The first line shows the number of rcutorture tests that have completed -since boot. If a test is currently running, the "(test in progress)" -string will appear as shown above. The second line shows the number of -update cycles that the current test has started, or zero if there is -no test in progress. - - -Within each flavor directory (rcu/rcu_bh, rcu/rcu_sched, and possibly -also rcu/rcu_preempt) the following files will be present: - -rcudata: - Displays fields in struct rcu_data. -rcuexp: - Displays statistics for expedited grace periods. -rcugp: - Displays grace-period counters. -rcuhier: - Displays the struct rcu_node hierarchy. -rcu_pending: - Displays counts of the reasons rcu_pending() decided that RCU had - work to do. -rcuboost: - Displays RCU boosting statistics. Only present if - CONFIG_RCU_BOOST=y. - -The output of "cat rcu/rcu_preempt/rcudata" looks as follows: - - 0!c=30455 g=30456 cnq=1/0:1 dt=126535/140000000000000/0 df=2002 of=4 ql=0/0 qs=N... b=10 ci=74572 nci=0 co=1131 ca=716 - 1!c=30719 g=30720 cnq=1/0:0 dt=132007/140000000000000/0 df=1874 of=10 ql=0/0 qs=N... b=10 ci=123209 nci=0 co=685 ca=982 - 2!c=30150 g=30151 cnq=1/1:1 dt=138537/140000000000000/0 df=1707 of=8 ql=0/0 qs=N... b=10 ci=80132 nci=0 co=1328 ca=1458 - 3 c=31249 g=31250 cnq=1/1:0 dt=107255/140000000000000/0 df=1749 of=6 ql=0/450 qs=NRW. b=10 ci=151700 nci=0 co=509 ca=622 - 4!c=29502 g=29503 cnq=1/0:1 dt=83647/140000000000000/0 df=965 of=5 ql=0/0 qs=N... b=10 ci=65643 nci=0 co=1373 ca=1521 - 5 c=31201 g=31202 cnq=1/0:1 dt=70422/0/0 df=535 of=7 ql=0/0 qs=.... b=10 ci=58500 nci=0 co=764 ca=698 - 6!c=30253 g=30254 cnq=1/0:1 dt=95363/140000000000000/0 df=780 of=5 ql=0/0 qs=N... b=10 ci=100607 nci=0 co=1414 ca=1353 - 7 c=31178 g=31178 cnq=1/0:0 dt=91536/0/0 df=547 of=4 ql=0/0 qs=.... b=10 ci=109819 nci=0 co=1115 ca=969 - -This file has one line per CPU, or eight for this 8-CPU system. -The fields are as follows: - -o The number at the beginning of each line is the CPU number. - CPUs numbers followed by an exclamation mark are offline, - but have been online at least once since boot. There will be - no output for CPUs that have never been online, which can be - a good thing in the surprisingly common case where NR_CPUS is - substantially larger than the number of actual CPUs. - -o "c" is the count of grace periods that this CPU believes have - completed. Offlined CPUs and CPUs in dynticks idle mode may lag - quite a ways behind, for example, CPU 4 under "rcu_sched" above, - which has been offline through 16 RCU grace periods. It is not - unusual to see offline CPUs lagging by thousands of grace periods. - Note that although the grace-period number is an unsigned long, - it is printed out as a signed long to allow more human-friendly - representation near boot time. - -o "g" is the count of grace periods that this CPU believes have - started. Again, offlined CPUs and CPUs in dynticks idle mode - may lag behind. If the "c" and "g" values are equal, this CPU - has already reported a quiescent state for the last RCU grace - period that it is aware of, otherwise, the CPU believes that it - owes RCU a quiescent state. - -o "pq" indicates that this CPU has passed through a quiescent state - for the current grace period. It is possible for "pq" to be - "1" and "c" different than "g", which indicates that although - the CPU has passed through a quiescent state, either (1) this - CPU has not yet reported that fact, (2) some other CPU has not - yet reported for this grace period, or (3) both. - -o "qp" indicates that RCU still expects a quiescent state from - this CPU. Offlined CPUs and CPUs in dyntick idle mode might - well have qp=1, which is OK: RCU is still ignoring them. - -o "dt" is the current value of the dyntick counter that is incremented - when entering or leaving idle, either due to a context switch or - due to an interrupt. This number is even if the CPU is in idle - from RCU's viewpoint and odd otherwise. The number after the - first "/" is the interrupt nesting depth when in idle state, - or a large number added to the interrupt-nesting depth when - running a non-idle task. Some architectures do not accurately - count interrupt nesting when running in non-idle kernel context, - which can result in interesting anomalies such as negative - interrupt-nesting levels. The number after the second "/" - is the NMI nesting depth. - -o "df" is the number of times that some other CPU has forced a - quiescent state on behalf of this CPU due to this CPU being in - idle state. - -o "of" is the number of times that some other CPU has forced a - quiescent state on behalf of this CPU due to this CPU being - offline. In a perfect world, this might never happen, but it - turns out that offlining and onlining a CPU can take several grace - periods, and so there is likely to be an extended period of time - when RCU believes that the CPU is online when it really is not. - Please note that erring in the other direction (RCU believing a - CPU is offline when it is really alive and kicking) is a fatal - error, so it makes sense to err conservatively. - -o "ql" is the number of RCU callbacks currently residing on - this CPU. The first number is the number of "lazy" callbacks - that are known to RCU to only be freeing memory, and the number - after the "/" is the total number of callbacks, lazy or not. - These counters count callbacks regardless of what phase of - grace-period processing that they are in (new, waiting for - grace period to start, waiting for grace period to end, ready - to invoke). - -o "qs" gives an indication of the state of the callback queue - with four characters: - - "N" Indicates that there are callbacks queued that are not - ready to be handled by the next grace period, and thus - will be handled by the grace period following the next - one. - - "R" Indicates that there are callbacks queued that are - ready to be handled by the next grace period. - - "W" Indicates that there are callbacks queued that are - waiting on the current grace period. - - "D" Indicates that there are callbacks queued that have - already been handled by a prior grace period, and are - thus waiting to be invoked. Note that callbacks in - the process of being invoked are not counted here. - Callbacks in the process of being invoked are those - that have been removed from the rcu_data structures - queues by rcu_do_batch(), but which have not yet been - invoked. - - If there are no callbacks in a given one of the above states, - the corresponding character is replaced by ".". - -o "b" is the batch limit for this CPU. If more than this number - of RCU callbacks is ready to invoke, then the remainder will - be deferred. - -o "ci" is the number of RCU callbacks that have been invoked for - this CPU. Note that ci+nci+ql is the number of callbacks that have - been registered in absence of CPU-hotplug activity. - -o "nci" is the number of RCU callbacks that have been offloaded from - this CPU. This will always be zero unless the kernel was built - with CONFIG_RCU_NOCB_CPU=y and the "rcu_nocbs=" kernel boot - parameter was specified. - -o "co" is the number of RCU callbacks that have been orphaned due to - this CPU going offline. These orphaned callbacks have been moved - to an arbitrarily chosen online CPU. - -o "ca" is the number of RCU callbacks that have been adopted by this - CPU due to other CPUs going offline. Note that ci+co-ca+ql is - the number of RCU callbacks registered on this CPU. - - -Kernels compiled with CONFIG_RCU_BOOST=y display the following from -/debug/rcu/rcu_preempt/rcudata: - - 0!c=12865 g=12866 cnq=1/0:1 dt=83113/140000000000000/0 df=288 of=11 ql=0/0 qs=N... kt=0/O ktl=944 b=10 ci=60709 nci=0 co=748 ca=871 - 1 c=14407 g=14408 cnq=1/0:0 dt=100679/140000000000000/0 df=378 of=7 ql=0/119 qs=NRW. kt=0/W ktl=9b6 b=10 ci=109740 nci=0 co=589 ca=485 - 2 c=14407 g=14408 cnq=1/0:0 dt=105486/0/0 df=90 of=9 ql=0/89 qs=NRW. kt=0/W ktl=c0c b=10 ci=83113 nci=0 co=533 ca=490 - 3 c=14407 g=14408 cnq=1/0:0 dt=107138/0/0 df=142 of=8 ql=0/188 qs=NRW. kt=0/W ktl=b96 b=10 ci=121114 nci=0 co=426 ca=290 - 4 c=14405 g=14406 cnq=1/0:1 dt=50238/0/0 df=706 of=7 ql=0/0 qs=.... kt=0/W ktl=812 b=10 ci=34929 nci=0 co=643 ca=114 - 5!c=14168 g=14169 cnq=1/0:0 dt=45465/140000000000000/0 df=161 of=11 ql=0/0 qs=N... kt=0/O ktl=b4d b=10 ci=47712 nci=0 co=677 ca=722 - 6 c=14404 g=14405 cnq=1/0:0 dt=59454/0/0 df=94 of=6 ql=0/0 qs=.... kt=0/W ktl=e57 b=10 ci=55597 nci=0 co=701 ca=811 - 7 c=14407 g=14408 cnq=1/0:1 dt=68850/0/0 df=31 of=8 ql=0/0 qs=.... kt=0/W ktl=14bd b=10 ci=77475 nci=0 co=508 ca=1042 - -This is similar to the output discussed above, but contains the following -additional fields: - -o "kt" is the per-CPU kernel-thread state. The digit preceding - the first slash is zero if there is no work pending and 1 - otherwise. The character between the first pair of slashes is - as follows: - - "S" The kernel thread is stopped, in other words, all - CPUs corresponding to this rcu_node structure are - offline. - - "R" The kernel thread is running. - - "W" The kernel thread is waiting because there is no work - for it to do. - - "O" The kernel thread is waiting because it has been - forced off of its designated CPU or because its - ->cpus_allowed mask permits it to run on other than - its designated CPU. - - "Y" The kernel thread is yielding to avoid hogging CPU. - - "?" Unknown value, indicates a bug. - - The number after the final slash is the CPU that the kthread - is actually running on. - - This field is displayed only for CONFIG_RCU_BOOST kernels. - -o "ktl" is the low-order 16 bits (in hexadecimal) of the count of - the number of times that this CPU's per-CPU kthread has gone - through its loop servicing invoke_rcu_cpu_kthread() requests. - - This field is displayed only for CONFIG_RCU_BOOST kernels. - - -The output of "cat rcu/rcu_preempt/rcuexp" looks as follows: - -s=21872 wd1=0 wd2=0 wd3=5 enq=0 sc=21872 - -These fields are as follows: - -o "s" is the sequence number, with an odd number indicating that - an expedited grace period is in progress. - -o "wd1", "wd2", and "wd3" are the number of times that an attempt - to start an expedited grace period found that someone else had - completed an expedited grace period that satisfies the attempted - request. "Our work is done." - -o "enq" is the number of quiescent states still outstanding. - -o "sc" is the number of times that the attempt to start a - new expedited grace period succeeded. - - -The output of "cat rcu/rcu_preempt/rcugp" looks as follows: - -completed=31249 gpnum=31250 age=1 max=18 - -These fields are taken from the rcu_state structure, and are as follows: - -o "completed" is the number of grace periods that have completed. - It is comparable to the "c" field from rcu/rcudata in that a - CPU whose "c" field matches the value of "completed" is aware - that the corresponding RCU grace period has completed. - -o "gpnum" is the number of grace periods that have started. It is - similarly comparable to the "g" field from rcu/rcudata in that - a CPU whose "g" field matches the value of "gpnum" is aware that - the corresponding RCU grace period has started. - - If these two fields are equal, then there is no grace period - in progress, in other words, RCU is idle. On the other hand, - if the two fields differ (as they are above), then an RCU grace - period is in progress. - -o "age" is the number of jiffies that the current grace period - has extended for, or zero if there is no grace period currently - in effect. - -o "max" is the age in jiffies of the longest-duration grace period - thus far. - -The output of "cat rcu/rcu_preempt/rcuhier" looks as follows: - -c=14407 g=14408 s=0 jfq=2 j=c863 nfqs=12040/nfqsng=0(12040) fqlh=1051 oqlen=0/0 -3/3 ..>. 0:7 ^0 -e/e ..>. 0:3 ^0 d/d ..>. 4:7 ^1 - -The fields are as follows: - -o "c" is exactly the same as "completed" under rcu/rcu_preempt/rcugp. - -o "g" is exactly the same as "gpnum" under rcu/rcu_preempt/rcugp. - -o "s" is the current state of the force_quiescent_state() - state machine. - -o "jfq" is the number of jiffies remaining for this grace period - before force_quiescent_state() is invoked to help push things - along. Note that CPUs in idle mode throughout the grace period - will not report on their own, but rather must be check by some - other CPU via force_quiescent_state(). - -o "j" is the low-order four hex digits of the jiffies counter. - Yes, Paul did run into a number of problems that turned out to - be due to the jiffies counter no longer counting. Why do you ask? - -o "nfqs" is the number of calls to force_quiescent_state() since - boot. - -o "nfqsng" is the number of useless calls to force_quiescent_state(), - where there wasn't actually a grace period active. This can - no longer happen due to grace-period processing being pushed - into a kthread. The number in parentheses is the difference - between "nfqs" and "nfqsng", or the number of times that - force_quiescent_state() actually did some real work. - -o "fqlh" is the number of calls to force_quiescent_state() that - exited immediately (without even being counted in nfqs above) - due to contention on ->fqslock. - -o Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node - structure. Each line represents one level of the hierarchy, - from root to leaves. It is best to think of the rcu_data - structures as forming yet another level after the leaves. - Note that there might be either one, two, three, or even four - levels of rcu_node structures, depending on the relationship - between CONFIG_RCU_FANOUT, CONFIG_RCU_FANOUT_LEAF (possibly - adjusted using the rcu_fanout_leaf kernel boot parameter), and - CONFIG_NR_CPUS (possibly adjusted using the nr_cpu_ids count of - possible CPUs for the booting hardware). - - o The numbers separated by the "/" are the qsmask followed - by the qsmaskinit. The qsmask will have one bit - set for each entity in the next lower level that has - not yet checked in for the current grace period ("e" - indicating CPUs 5, 6, and 7 in the example above). - The qsmaskinit will have one bit for each entity that is - currently expected to check in during each grace period. - The value of qsmaskinit is assigned to that of qsmask - at the beginning of each grace period. - - o The characters separated by the ">" indicate the state - of the blocked-tasks lists. A "G" preceding the ">" - indicates that at least one task blocked in an RCU - read-side critical section blocks the current grace - period, while a "E" preceding the ">" indicates that - at least one task blocked in an RCU read-side critical - section blocks the current expedited grace period. - A "T" character following the ">" indicates that at - least one task is blocked within an RCU read-side - critical section, regardless of whether any current - grace period (expedited or normal) is inconvenienced. - A "." character appears if the corresponding condition - does not hold, so that "..>." indicates that no tasks - are blocked. In contrast, "GE>T" indicates maximal - inconvenience from blocked tasks. CONFIG_TREE_RCU - builds of the kernel will always show "..>.". - - o The numbers separated by the ":" are the range of CPUs - served by this struct rcu_node. This can be helpful - in working out how the hierarchy is wired together. - - For example, the example rcu_node structure shown above - has "0:7", indicating that it covers CPUs 0 through 7. - - o The number after the "^" indicates the bit in the - next higher level rcu_node structure that this rcu_node - structure corresponds to. For example, the "d/d ..>. 4:7 - ^1" has a "1" in this position, indicating that it - corresponds to the "1" bit in the "3" shown in the - "3/3 ..>. 0:7 ^0" entry on the next level up. - - -The output of "cat rcu/rcu_sched/rcu_pending" looks as follows: - - 0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 ndw=0 - 1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 ndw=0 - 2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 ndw=0 - 3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 ndw=0 - 4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 ndw=0 - 5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 ndw=0 - 6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 ndw=0 - 7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 ndw=0 - -The fields are as follows: - -o The leading number is the CPU number, with "!" indicating - an offline CPU. - -o "np" is the number of times that __rcu_pending() has been invoked - for the corresponding flavor of RCU. - -o "qsp" is the number of times that the RCU was waiting for a - quiescent state from this CPU. - -o "rpq" is the number of times that the CPU had passed through - a quiescent state, but not yet reported it to RCU. - -o "cbr" is the number of times that this CPU had RCU callbacks - that had passed through a grace period, and were thus ready - to be invoked. - -o "cng" is the number of times that this CPU needed another - grace period while RCU was idle. - -o "gpc" is the number of times that an old grace period had - completed, but this CPU was not yet aware of it. - -o "gps" is the number of times that a new grace period had started, - but this CPU was not yet aware of it. - -o "ndw" is the number of times that a wakeup of an rcuo - callback-offload kthread had to be deferred in order to avoid - deadlock. - -o "nn" is the number of times that this CPU needed nothing. - - -The output of "cat rcu/rcuboost" looks as follows: - -0:3 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894 - balk: nt=0 egt=4695 bt=0 nb=0 ny=56 nos=0 -4:7 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894 - balk: nt=0 egt=6541 bt=0 nb=0 ny=126 nos=0 - -This information is output only for rcu_preempt. Each two-line entry -corresponds to a leaf rcu_node structure. The fields are as follows: - -o "n:m" is the CPU-number range for the corresponding two-line - entry. In the sample output above, the first entry covers - CPUs zero through three and the second entry covers CPUs four - through seven. - -o "tasks=TNEB" gives the state of the various segments of the - rnp->blocked_tasks list: - - "T" This indicates that there are some tasks that blocked - while running on one of the corresponding CPUs while - in an RCU read-side critical section. - - "N" This indicates that some of the blocked tasks are preventing - the current normal (non-expedited) grace period from - completing. - - "E" This indicates that some of the blocked tasks are preventing - the current expedited grace period from completing. - - "B" This indicates that some of the blocked tasks are in - need of RCU priority boosting. - - Each character is replaced with "." if the corresponding - condition does not hold. - -o "kt" is the state of the RCU priority-boosting kernel - thread associated with the corresponding rcu_node structure. - The state can be one of the following: - - "S" The kernel thread is stopped, in other words, all - CPUs corresponding to this rcu_node structure are - offline. - - "R" The kernel thread is running. - - "W" The kernel thread is waiting because there is no work - for it to do. - - "Y" The kernel thread is yielding to avoid hogging CPU. - - "?" Unknown value, indicates a bug. - -o "ntb" is the number of tasks boosted. - -o "neb" is the number of tasks boosted in order to complete an - expedited grace period. - -o "nnb" is the number of tasks boosted in order to complete a - normal (non-expedited) grace period. When boosting a task - that was blocking both an expedited and a normal grace period, - it is counted against the expedited total above. - -o "j" is the low-order 16 bits of the jiffies counter in - hexadecimal. - -o "bt" is the low-order 16 bits of the value that the jiffies - counter will have when we next start boosting, assuming that - the current grace period does not end beforehand. This is - also in hexadecimal. - -o "balk: nt" counts the number of times we didn't boost (in - other words, we balked) even though it was time to boost because - there were no blocked tasks to boost. This situation occurs - when there is one blocked task on one rcu_node structure and - none on some other rcu_node structure. - -o "egt" counts the number of times we balked because although - there were blocked tasks, none of them were blocking the - current grace period, whether expedited or otherwise. - -o "bt" counts the number of times we balked because boosting - had already been initiated for the current grace period. - -o "nb" counts the number of times we balked because there - was at least one task blocking the current non-expedited grace - period that never had blocked. If it is already running, it - just won't help to boost its priority! - -o "ny" counts the number of times we balked because it was - not yet time to start boosting. - -o "nos" counts the number of times we balked for other - reasons, e.g., the grace period ended first. - - -CONFIG_TINY_RCU debugfs Files and Formats - -These implementations of RCU provides a single debugfs file under the -top-level directory RCU, namely rcu/rcudata, which displays fields in -rcu_bh_ctrlblk and rcu_sched_ctrlblk. - -The output of "cat rcu/rcudata" is as follows: - -rcu_sched: qlen: 0 -rcu_bh: qlen: 0 - -This is split into rcu_sched and rcu_bh sections. The field is as -follows: - -o "qlen" is the number of RCU callbacks currently waiting either - for an RCU grace period or waiting to be invoked. This is the - only field present for rcu_sched and rcu_bh, due to the - short-circuiting of grace period in those two cases. diff --git a/Documentation/acpi/acpi-lid.txt b/Documentation/acpi/acpi-lid.txt index 22cb3091f297..effe7af3a5af 100644 --- a/Documentation/acpi/acpi-lid.txt +++ b/Documentation/acpi/acpi-lid.txt @@ -59,20 +59,28 @@ button driver uses the following 3 modes in order not to trigger issues. If the userspace hasn't been prepared to ignore the unreliable "opened" events and the unreliable initial state notification, Linux users can use the following kernel parameters to handle the possible issues: -A. button.lid_init_state=open: +A. button.lid_init_state=method: + When this option is specified, the ACPI button driver reports the + initial lid state using the returning value of the _LID control method + and whether the "opened"/"closed" events are paired fully relies on the + firmware implementation. + This option can be used to fix some platforms where the returning value + of the _LID control method is reliable but the initial lid state + notification is missing. + This option is the default behavior during the period the userspace + isn't ready to handle the buggy AML tables. +B. button.lid_init_state=open: When this option is specified, the ACPI button driver always reports the initial lid state as "opened" and whether the "opened"/"closed" events are paired fully relies on the firmware implementation. This may fix some platforms where the returning value of the _LID control method is not reliable and the initial lid state notification is missing. - This option is the default behavior during the period the userspace - isn't ready to handle the buggy AML tables. If the userspace has been prepared to ignore the unreliable "opened" events and the unreliable initial state notification, Linux users should always use the following kernel parameter: -B. button.lid_init_state=ignore: +C. button.lid_init_state=ignore: When this option is specified, the ACPI button driver never reports the initial lid state and there is a compensation mechanism implemented to ensure that the reliable "closed" notifications can always be delievered diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 15f79c27748d..f59aad5c2270 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -866,6 +866,15 @@ dscc4.setup= [NET] + dt_cpu_ftrs= [PPC] + Format: {"off" | "known"} + Control how the dt_cpu_ftrs device-tree binding is + used for CPU feature discovery and setup (if it + exists). + off: Do not use it, fall back to legacy cpu table. + known: Do not pass through unknown features to guests + or userspace, only those that the kernel is aware of. + dump_apple_properties [X86] Dump name and content of EFI device properties on x86 Macs. Useful for driver authors to determine @@ -3229,21 +3238,17 @@ rcutree.gp_cleanup_delay= [KNL] Set the number of jiffies to delay each step of - RCU grace-period cleanup. This only has effect - when CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP is set. + RCU grace-period cleanup. rcutree.gp_init_delay= [KNL] Set the number of jiffies to delay each step of - RCU grace-period initialization. This only has - effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT - is set. + RCU grace-period initialization. rcutree.gp_preinit_delay= [KNL] Set the number of jiffies to delay each step of RCU grace-period pre-initialization, that is, the propagation of recent CPU-hotplug changes up - the rcu_node combining tree. This only has effect - when CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT is set. + the rcu_node combining tree. rcutree.rcu_fanout_exact= [KNL] Disable autobalancing of the rcu_node combining @@ -3319,6 +3324,17 @@ This wake_up() will be accompanied by a WARN_ONCE() splat and an ftrace_dump(). + rcuperf.gp_async= [KNL] + Measure performance of asynchronous + grace-period primitives such as call_rcu(). + + rcuperf.gp_async_max= [KNL] + Specify the maximum number of outstanding + callbacks per writer thread. When a writer + thread exceeds this limit, it invokes the + corresponding flavor of rcu_barrier() to allow + previously posted callbacks to drain. + rcuperf.gp_exp= [KNL] Measure performance of expedited synchronous grace-period primitives. @@ -3346,17 +3362,22 @@ rcuperf.perf_runnable= [BOOT] Start rcuperf running at boot time. + rcuperf.perf_type= [KNL] + Specify the RCU implementation to test. + rcuperf.shutdown= [KNL] Shut the system down after performance tests complete. This is useful for hands-off automated testing. - rcuperf.perf_type= [KNL] - Specify the RCU implementation to test. - rcuperf.verbose= [KNL] Enable additional printk() statements. + rcuperf.writer_holdoff= [KNL] + Write-side holdoff between grace periods, + in microseconds. The default of zero says + no holdoff. + rcutorture.cbflood_inter_holdoff= [KNL] Set holdoff time (jiffies) between successive callback-flood tests. @@ -3794,6 +3815,15 @@ spia_pedr= spia_peddr= + srcutree.counter_wrap_check [KNL] + Specifies how frequently to check for + grace-period sequence counter wrap for the + srcu_data structure's ->srcu_gp_seq_needed field. + The greater the number of bits set in this kernel + parameter, the less frequently counter wrap will + be checked for. Note that the bottom two bits + are ignored. + srcutree.exp_holdoff [KNL] Specifies how many nanoseconds must elapse since the end of the last SRCU grace period for @@ -3802,6 +3832,13 @@ expediting. Set to zero to disable automatic expediting. + stack_guard_gap= [MM] + override the default stack gap protection. The value + is in page units and it defines how many pages prior + to (for stacks growing down) resp. after (for stacks + growing up) the main stack are reserved for no other + mapping. Default value is 256 pages. + stacktrace [FTRACE] Enabled the stack tracer on boot up. diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 289c80f7760e..09aa2e949787 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -1,4 +1,5 @@ .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` +.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` ======================= CPU Performance Scaling @@ -75,7 +76,7 @@ feedback registers, as that information is typically specific to the hardware interface it comes from and may not be easily represented in an abstract, platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers to bypass the governor layer and implement their own performance scaling -algorithms. That is done by the ``intel_pstate`` scaling driver. +algorithms. That is done by the |intel_pstate| scaling driver. ``CPUFreq`` Policy Objects @@ -174,13 +175,13 @@ necessary to restart the scaling governor so that it can take the new online CPU into account. That is achieved by invoking the governor's ``->stop`` and ``->start()`` callbacks, in this order, for the entire policy. -As mentioned before, the ``intel_pstate`` scaling driver bypasses the scaling +As mentioned before, the |intel_pstate| scaling driver bypasses the scaling governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. -Consequently, if ``intel_pstate`` is used, scaling governors are not attached to +Consequently, if |intel_pstate| is used, scaling governors are not attached to new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked to register per-CPU utilization update callbacks for each policy. These callbacks are invoked by the CPU scheduler in the same way as for scaling -governors, but in the ``intel_pstate`` case they both determine the P-state to +governors, but in the |intel_pstate| case they both determine the P-state to use and change the hardware configuration accordingly in one go from scheduler context. @@ -257,7 +258,7 @@ are the following: ``scaling_available_governors`` List of ``CPUFreq`` scaling governors present in the kernel that can - be attached to this policy or (if the ``intel_pstate`` scaling driver is + be attached to this policy or (if the |intel_pstate| scaling driver is in use) list of scaling algorithms provided by the driver that can be applied to this policy. @@ -274,7 +275,7 @@ are the following: the CPU is actually running at (due to hardware design and other limitations). - Some scaling drivers (e.g. ``intel_pstate``) attempt to provide + Some scaling drivers (e.g. |intel_pstate|) attempt to provide information more precisely reflecting the current CPU frequency through this attribute, but that still may not be the exact current CPU frequency as seen by the hardware at the moment. @@ -284,13 +285,13 @@ are the following: ``scaling_governor`` The scaling governor currently attached to this policy or (if the - ``intel_pstate`` scaling driver is in use) the scaling algorithm + |intel_pstate| scaling driver is in use) the scaling algorithm provided by the driver that is currently applied to this policy. This attribute is read-write and writing to it will cause a new scaling governor to be attached to this policy or a new scaling algorithm provided by the scaling driver to be applied to it (in the - ``intel_pstate`` case), as indicated by the string written to this + |intel_pstate| case), as indicated by the string written to this attribute (which must be one of the names listed by the ``scaling_available_governors`` attribute described above). @@ -619,7 +620,7 @@ This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls the "boost" setting for the whole system. It is not present if the underlying scaling driver does not support the frequency boost mechanism (or supports it, but provides a driver-specific interface for controlling it, like -``intel_pstate``). +|intel_pstate|). If the value in this file is 1, the frequency boost mechanism is enabled. This means that either the hardware can be put into states in which it is able to diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst index c80f087321fc..7f148f76f432 100644 --- a/Documentation/admin-guide/pm/index.rst +++ b/Documentation/admin-guide/pm/index.rst @@ -6,6 +6,7 @@ Power Management :maxdepth: 2 cpufreq + intel_pstate .. only:: subproject and html diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst new file mode 100644 index 000000000000..33d703989ea8 --- /dev/null +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -0,0 +1,755 @@ +=============================================== +``intel_pstate`` CPU Performance Scaling Driver +=============================================== + +:: + + Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> + + +General Information +=================== + +``intel_pstate`` is a part of the +:doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel +(``CPUFreq``). It is a scaling driver for the Sandy Bridge and later +generations of Intel processors. Note, however, that some of those processors +may not be supported. [To understand ``intel_pstate`` it is necessary to know +how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if +you have not done that yet.] + +For the processors supported by ``intel_pstate``, the P-state concept is broader +than just an operating frequency or an operating performance point (see the +`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more +information about that). For this reason, the representation of P-states used +by ``intel_pstate`` internally follows the hardware specification (for details +refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual +Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core +uses frequencies for identifying operating performance points of CPUs and +frequencies are involved in the user space interface exposed by it, so +``intel_pstate`` maps its internal representation of P-states to frequencies too +(fortunately, that mapping is unambiguous). At the same time, it would not be +practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of +available frequencies due to the possible size of it, so the driver does not do +that. Some functionality of the core is limited by that. + +Since the hardware P-state selection interface used by ``intel_pstate`` is +available at the logical CPU level, the driver always works with individual +CPUs. Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy +object corresponds to one logical CPU and ``CPUFreq`` policies are effectively +equivalent to CPUs. In particular, this means that they become "inactive" every +time the corresponding CPU is taken offline and need to be re-initialized when +it goes back online. + +``intel_pstate`` is not modular, so it cannot be unloaded, which means that the +only way to pass early-configuration-time parameters to it is via the kernel +command line. However, its configuration can be adjusted via ``sysfs`` to a +great extent. In some configurations it even is possible to unregister it via +``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and +registered (see `below <status_attr_>`_). + + +Operation Modes +=============== + +``intel_pstate`` can operate in three different modes: in the active mode with +or without hardware-managed P-states support and in the passive mode. Which of +them will be in effect depends on what kernel command line options are used and +on the capabilities of the processor. + +Active Mode +----------- + +This is the default operation mode of ``intel_pstate``. If it works in this +mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` +policies contains the string "intel_pstate". + +In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and +provides its own scaling algorithms for P-state selection. Those algorithms +can be applied to ``CPUFreq`` policies in the same way as generic scaling +governors (that is, through the ``scaling_governor`` policy attribute in +``sysfs``). [Note that different P-state selection algorithms may be chosen for +different policies, but that is not recommended.] + +They are not generic scaling governors, but their names are the same as the +names of some of those governors. Moreover, confusingly enough, they generally +do not work in the same way as the generic governors they share the names with. +For example, the ``powersave`` P-state selection algorithm provided by +``intel_pstate`` is not a counterpart of the generic ``powersave`` governor +(roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors). + +There are two P-state selection algorithms provided by ``intel_pstate`` in the +active mode: ``powersave`` and ``performance``. The way they both operate +depends on whether or not the hardware-managed P-states (HWP) feature has been +enabled in the processor and possibly on the processor model. + +Which of the P-state selection algorithms is used by default depends on the +:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option. +Namely, if that option is set, the ``performance`` algorithm will be used by +default, and the other one will be used by default if it is not set. + +Active Mode With HWP +~~~~~~~~~~~~~~~~~~~~ + +If the processor supports the HWP feature, it will be enabled during the +processor initialization and cannot be disabled after that. It is possible +to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the +kernel in the command line. + +If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to +select P-states by itself, but still it can give hints to the processor's +internal P-state selection logic. What those hints are depends on which P-state +selection algorithm has been applied to the given policy (or to the CPU it +corresponds to). + +Even though the P-state selection is carried out by the processor automatically, +``intel_pstate`` registers utilization update callbacks with the CPU scheduler +in this mode. However, they are not used for running a P-state selection +algorithm, but for periodic updates of the current CPU frequency information to +be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``. + +HWP + ``performance`` +..................... + +In this configuration ``intel_pstate`` will write 0 to the processor's +Energy-Performance Preference (EPP) knob (if supported) or its +Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's +internal P-state selection logic is expected to focus entirely on performance. + +This will override the EPP/EPB setting coming from the ``sysfs`` interface +(see `Energy vs Performance Hints`_ below). + +Also, in this configuration the range of P-states available to the processor's +internal P-state selection logic is always restricted to the upper boundary +(that is, the maximum P-state that the driver is allowed to use). + +HWP + ``powersave`` +................... + +In this configuration ``intel_pstate`` will set the processor's +Energy-Performance Preference (EPP) knob (if supported) or its +Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was +previously set to via ``sysfs`` (or whatever default value it was +set to by the platform firmware). This usually causes the processor's +internal P-state selection logic to be less performance-focused. + +Active Mode Without HWP +~~~~~~~~~~~~~~~~~~~~~~~ + +This is the default operation mode for processors that do not support the HWP +feature. It also is used by default with the ``intel_pstate=no_hwp`` argument +in the kernel command line. However, in this mode ``intel_pstate`` may refuse +to work with the given processor if it does not recognize it. [Note that +``intel_pstate`` will never refuse to work with any processor with the HWP +feature enabled.] + +In this mode ``intel_pstate`` registers utilization update callbacks with the +CPU scheduler in order to run a P-state selection algorithm, either +``powersave`` or ``performance``, depending on the ``scaling_cur_freq`` policy +setting in ``sysfs``. The current CPU frequency information to be made +available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is +periodically updated by those utilization update callbacks too. + +``performance`` +............... + +Without HWP, this P-state selection algorithm is always the same regardless of +the processor model and platform configuration. + +It selects the maximum P-state it is allowed to use, subject to limits set via +``sysfs``, every time the P-state selection computations are carried out by the +driver's utilization update callback for the given CPU (that does not happen +more often than every 10 ms), but the hardware configuration will not be changed +if the new P-state is the same as the current one. + +This is the default P-state selection algorithm if the +:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option +is set. + +``powersave`` +............. + +Without HWP, this P-state selection algorithm generally depends on the +processor model and/or the system profile setting in the ACPI tables and there +are two variants of it. + +One of them is used with processors from the Atom line and (regardless of the +processor model) on platforms with the system profile in the ACPI tables set to +"mobile" (laptops mostly), "tablet", "appliance PC", "desktop", or +"workstation". It is also used with processors supporting the HWP feature if +that feature has not been enabled (that is, with the ``intel_pstate=no_hwp`` +argument in the kernel command line). It is similar to the algorithm +implemented by the generic ``schedutil`` scaling governor except that the +utilization metric used by it is based on numbers coming from feedback +registers of the CPU. It generally selects P-states proportional to the +current CPU utilization, so it is referred to as the "proportional" algorithm. + +The second variant of the ``powersave`` P-state selection algorithm, used in all +of the other cases (generally, on processors from the Core line, so it is +referred to as the "Core" algorithm), is based on the values read from the APERF +and MPERF feedback registers and the previously requested target P-state. +It does not really take CPU utilization into account explicitly, but as a rule +it causes the CPU P-state to ramp up very quickly in response to increased +utilization which is generally desirable in server environments. + +Regardless of the variant, this algorithm is run by the driver's utilization +update callback for the given CPU when it is invoked by the CPU scheduler, but +not more often than every 10 ms (that can be tweaked via ``debugfs`` in `this +particular case <Tuning Interface in debugfs_>`_). Like in the ``performance`` +case, the hardware configuration is not touched if the new P-state turns out to +be the same as the current one. + +This is the default P-state selection algorithm if the +:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option +is not set. + +Passive Mode +------------ + +This mode is used if the ``intel_pstate=passive`` argument is passed to the +kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too). +Like in the active mode without HWP support, in this mode ``intel_pstate`` may +refuse to work with the given processor if it does not recognize it. + +If the driver works in this mode, the ``scaling_driver`` policy attribute in +``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq". +Then, the driver behaves like a regular ``CPUFreq`` scaling driver. That is, +it is invoked by generic scaling governors when necessary to talk to the +hardware in order to change the P-state of a CPU (in particular, the +``schedutil`` governor can invoke it directly from scheduler context). + +While in this mode, ``intel_pstate`` can be used with all of the (generic) +scaling governors listed by the ``scaling_available_governors`` policy attribute +in ``sysfs`` (and the P-state selection algorithms described above are not +used). Then, it is responsible for the configuration of policy objects +corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling +governors attached to the policy objects) with accurate information on the +maximum and minimum operating frequencies supported by the hardware (including +the so-called "turbo" frequency ranges). In other words, in the passive mode +the entire range of available P-states is exposed by ``intel_pstate`` to the +``CPUFreq`` core. However, in this mode the driver does not register +utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq`` +information comes from the ``CPUFreq`` core (and is the last frequency selected +by the current scaling governor for the given policy). + + +.. _turbo: + +Turbo P-states Support +====================== + +In the majority of cases, the entire range of P-states available to +``intel_pstate`` can be divided into two sub-ranges that correspond to +different types of processor behavior, above and below a boundary that +will be referred to as the "turbo threshold" in what follows. + +The P-states above the turbo threshold are referred to as "turbo P-states" and +the whole sub-range of P-states they belong to is referred to as the "turbo +range". These names are related to the Turbo Boost technology allowing a +multicore processor to opportunistically increase the P-state of one or more +cores if there is enough power to do that and if that is not going to cause the +thermal envelope of the processor package to be exceeded. + +Specifically, if software sets the P-state of a CPU core within the turbo range +(that is, above the turbo threshold), the processor is permitted to take over +performance scaling control for that core and put it into turbo P-states of its +choice going forward. However, that permission is interpreted differently by +different processor generations. Namely, the Sandy Bridge generation of +processors will never use any P-states above the last one set by software for +the given core, even if it is within the turbo range, whereas all of the later +processor generations will take it as a license to use any P-states from the +turbo range, even above the one set by software. In other words, on those +processors setting any P-state from the turbo range will enable the processor +to put the given core into all turbo P-states up to and including the maximum +supported one as it sees fit. + +One important property of turbo P-states is that they are not sustainable. More +precisely, there is no guarantee that any CPUs will be able to stay in any of +those states indefinitely, because the power distribution within the processor +package may change over time or the thermal envelope it was designed for might +be exceeded if a turbo P-state was used for too long. + +In turn, the P-states below the turbo threshold generally are sustainable. In +fact, if one of them is set by software, the processor is not expected to change +it to a lower one unless in a thermal stress or a power limit violation +situation (a higher P-state may still be used if it is set for another CPU in +the same package at the same time, for example). + +Some processors allow multiple cores to be in turbo P-states at the same time, +but the maximum P-state that can be set for them generally depends on the number +of cores running concurrently. The maximum turbo P-state that can be set for 3 +cores at the same time usually is lower than the analogous maximum P-state for +2 cores, which in turn usually is lower than the maximum turbo P-state that can +be set for 1 core. The one-core maximum turbo P-state is thus the maximum +supported one overall. + +The maximum supported turbo P-state, the turbo threshold (the maximum supported +non-turbo P-state) and the minimum supported P-state are specific to the +processor model and can be determined by reading the processor's model-specific +registers (MSRs). Moreover, some processors support the Configurable TDP +(Thermal Design Power) feature and, when that feature is enabled, the turbo +threshold effectively becomes a configurable value that can be set by the +platform firmware. + +Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes +the entire range of available P-states, including the whole turbo range, to the +``CPUFreq`` core and (in the passive mode) to generic scaling governors. This +generally causes turbo P-states to be set more often when ``intel_pstate`` is +used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_ +for more information). + +Moreover, since ``intel_pstate`` always knows what the real turbo threshold is +(even if the Configurable TDP feature is enabled in the processor), its +``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should +work as expected in all cases (that is, if set to disable turbo P-states, it +always should prevent ``intel_pstate`` from using them). + + +Processor Support +================= + +To handle a given processor ``intel_pstate`` requires a number of different +pieces of information on it to be known, including: + + * The minimum supported P-state. + + * The maximum supported `non-turbo P-state <turbo_>`_. + + * Whether or not turbo P-states are supported at all. + + * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states + are supported). + + * The scaling formula to translate the driver's internal representation + of P-states into frequencies and the other way around. + +Generally, ways to obtain that information are specific to the processor model +or family. Although it often is possible to obtain all of it from the processor +itself (using model-specific registers), there are cases in which hardware +manuals need to be consulted to get to it too. + +For this reason, there is a list of supported processors in ``intel_pstate`` and +the driver initialization will fail if the detected processor is not in that +list, unless it supports the `HWP feature <Active Mode_>`_. [The interface to +obtain all of the information listed above is the same for all of the processors +supporting the HWP feature, which is why they all are supported by +``intel_pstate``.] + + +User Space Interface in ``sysfs`` +================================= + +Global Attributes +----------------- + +``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to +control its functionality at the system level. They are located in the +``/sys/devices/system/cpu/cpufreq/intel_pstate/`` directory and affect all +CPUs. + +Some of them are not present if the ``intel_pstate=per_cpu_perf_limits`` +argument is passed to the kernel in the command line. + +``max_perf_pct`` + Maximum P-state the driver is allowed to set in percent of the + maximum supported performance level (the highest supported `turbo + P-state <turbo_>`_). + + This attribute will not be exposed if the + ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel + command line. + +``min_perf_pct`` + Minimum P-state the driver is allowed to set in percent of the + maximum supported performance level (the highest supported `turbo + P-state <turbo_>`_). + + This attribute will not be exposed if the + ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel + command line. + +``num_pstates`` + Number of P-states supported by the processor (between 0 and 255 + inclusive) including both turbo and non-turbo P-states (see + `Turbo P-states Support`_). + + The value of this attribute is not affected by the ``no_turbo`` + setting described `below <no_turbo_attr_>`_. + + This attribute is read-only. + +``turbo_pct`` + Ratio of the `turbo range <turbo_>`_ size to the size of the entire + range of supported P-states, in percent. + + This attribute is read-only. + +.. _no_turbo_attr: + +``no_turbo`` + If set (equal to 1), the driver is not allowed to set any turbo P-states + (see `Turbo P-states Support`_). If unset (equalt to 0, which is the + default), turbo P-states can be set by the driver. + [Note that ``intel_pstate`` does not support the general ``boost`` + attribute (supported by some other scaling drivers) which is replaced + by this one.] + + This attrubute does not affect the maximum supported frequency value + supplied to the ``CPUFreq`` core and exposed via the policy interface, + but it affects the maximum possible value of per-policy P-state limits + (see `Interpretation of Policy Attributes`_ below for details). + +.. _status_attr: + +``status`` + Operation mode of the driver: "active", "passive" or "off". + + "active" + The driver is functional and in the `active mode + <Active Mode_>`_. + + "passive" + The driver is functional and in the `passive mode + <Passive Mode_>`_. + + "off" + The driver is not functional (it is not registered as a scaling + driver with the ``CPUFreq`` core). + + This attribute can be written to in order to change the driver's + operation mode or to unregister it. The string written to it must be + one of the possible values of it and, if successful, the write will + cause the driver to switch over to the operation mode represented by + that string - or to be unregistered in the "off" case. [Actually, + switching over from the active mode to the passive mode or the other + way around causes the driver to be unregistered and registered again + with a different set of callbacks, so all of its settings (the global + as well as the per-policy ones) are then reset to their default + values, possibly depending on the target operation mode.] + + That only is supported in some configurations, though (for example, if + the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, + the operation mode of the driver cannot be changed), and if it is not + supported in the current configuration, writes to this attribute with + fail with an appropriate error. + +Interpretation of Policy Attributes +----------------------------------- + +The interpretation of some ``CPUFreq`` policy attributes described in +:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver +and it generally depends on the driver's `operation mode <Operation Modes_>`_. + +First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and +``scaling_cur_freq`` attributes are produced by applying a processor-specific +multiplier to the internal P-state representation used by ``intel_pstate``. +Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq`` +attributes are capped by the frequency corresponding to the maximum P-state that +the driver is allowed to set. + +If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is +not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq`` +and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency. +Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and +``scaling_min_freq`` to go down to that value if they were above it before. +However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be +restored after unsetting ``no_turbo``, unless these attributes have been written +to after ``no_turbo`` was set. + +If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq`` +and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state, +which also is the value of ``cpuinfo_max_freq`` in either case. + +Next, the following policy attributes have special meaning if +``intel_pstate`` works in the `active mode <Active Mode_>`_: + +``scaling_available_governors`` + List of P-state selection algorithms provided by ``intel_pstate``. + +``scaling_governor`` + P-state selection algorithm provided by ``intel_pstate`` currently in + use with the given policy. + +``scaling_cur_freq`` + Frequency of the average P-state of the CPU represented by the given + policy for the time interval between the last two invocations of the + driver's utilization update callback by the CPU scheduler for that CPU. + +The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the +same as for other scaling drivers. + +Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate`` +depends on the operation mode of the driver. Namely, it is either +"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the +`passive mode <Passive Mode_>`_). + +Coordination of P-State Limits +------------------------------ + +``intel_pstate`` allows P-state limits to be set in two ways: with the help of +the ``max_perf_pct`` and ``min_perf_pct`` `global attributes +<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq`` +``CPUFreq`` policy attributes. The coordination between those limits is based +on the following rules, regardless of the current operation mode of the driver: + + 1. All CPUs are affected by the global limits (that is, none of them can be + requested to run faster than the global maximum and none of them can be + requested to run slower than the global minimum). + + 2. Each individual CPU is affected by its own per-policy limits (that is, it + cannot be requested to run faster than its own per-policy maximum and it + cannot be requested to run slower than its own per-policy minimum). + + 3. The global and per-policy limits can be set independently. + +If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the +resulting effective values are written into its registers whenever the limits +change in order to request its internal P-state selection logic to always set +P-states within these limits. Otherwise, the limits are taken into account by +scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver +every time before setting a new P-state for a CPU. + +Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument +is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed +at all and the only way to set the limits is by using the policy attributes. + + +Energy vs Performance Hints +--------------------------- + +If ``intel_pstate`` works in the `active mode with the HWP feature enabled +<Active Mode With HWP_>`_ in the processor, additional attributes are present +in every ``CPUFreq`` policy directory in ``sysfs``. They are intended to allow +user space to help ``intel_pstate`` to adjust the processor's internal P-state +selection logic by focusing it on performance or on energy-efficiency, or +somewhere between the two extremes: + +``energy_performance_preference`` + Current value of the energy vs performance hint for the given policy + (or the CPU represented by it). + + The hint can be changed by writing to this attribute. + +``energy_performance_available_preferences`` + List of strings that can be written to the + ``energy_performance_preference`` attribute. + + They represent different energy vs performance hints and should be + self-explanatory, except that ``default`` represents whatever hint + value was set by the platform firmware. + +Strings written to the ``energy_performance_preference`` attribute are +internally translated to integer values written to the processor's +Energy-Performance Preference (EPP) knob (if supported) or its +Energy-Performance Bias (EPB) knob. + +[Note that tasks may by migrated from one CPU to another by the scheduler's +load-balancing algorithm and if different energy vs performance hints are +set for those CPUs, that may lead to undesirable outcomes. To avoid such +issues it is better to set the same energy vs performance hint for all CPUs +or to pin every task potentially sensitive to them to a specific CPU.] + +.. _acpi-cpufreq: + +``intel_pstate`` vs ``acpi-cpufreq`` +==================================== + +On the majority of systems supported by ``intel_pstate``, the ACPI tables +provided by the platform firmware contain ``_PSS`` objects returning information +that can be used for CPU performance scaling (refer to the `ACPI specification`_ +for details on the ``_PSS`` objects and the format of the information returned +by them). + +The information returned by the ACPI ``_PSS`` objects is used by the +``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate`` +the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling +interface, but the set of P-states it can use is limited by the ``_PSS`` +output. + +On those systems each ``_PSS`` object returns a list of P-states supported by +the corresponding CPU which basically is a subset of the P-states range that can +be used by ``intel_pstate`` on the same system, with one exception: the whole +`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By +convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz +than the frequency of the highest non-turbo P-state listed by it, but the +corresponding P-state representation (following the hardware specification) +returned for it matches the maximum supported turbo P-state (or is the +special value 255 meaning essentially "go as high as you can get"). + +The list of P-states returned by ``_PSS`` is reflected by the table of +available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and +scaling governors and the minimum and maximum supported frequencies reported by +it come from that list as well. In particular, given the special representation +of the turbo range described above, this means that the maximum supported +frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency +of the highest supported non-turbo P-state listed by ``_PSS`` which, of course, +affects decisions made by the scaling governors, except for ``powersave`` and +``performance``. + +For example, if a given governor attempts to select a frequency proportional to +estimated CPU load and maps the load of 100% to the maximum supported frequency +(possibly multiplied by a constant), then it will tend to choose P-states below +the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because +in that case the turbo range corresponds to a small fraction of the frequency +band it can use (1 MHz vs 1 GHz or more). In consequence, it will only go to +the turbo range for the highest loads and the other loads above 50% that might +benefit from running at turbo frequencies will be given non-turbo P-states +instead. + +One more issue related to that may appear on systems supporting the +`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the +turbo threshold. Namely, if that is not coordinated with the lists of P-states +returned by ``_PSS`` properly, there may be more than one item corresponding to +a turbo P-state in those lists and there may be a problem with avoiding the +turbo range (if desirable or necessary). Usually, to avoid using turbo +P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed +by ``_PSS``, but that is not sufficient when there are other turbo P-states in +the list returned by it. + +Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the +`passive mode <Passive Mode_>`_, except that the number of P-states it can set +is limited to the ones listed by the ACPI ``_PSS`` objects. + + +Kernel Command Line Options for ``intel_pstate`` +================================================ + +Several kernel command line options can be used to pass early-configuration-time +parameters to ``intel_pstate`` in order to enforce specific behavior of it. All +of them have to be prepended with the ``intel_pstate=`` prefix. + +``disable`` + Do not register ``intel_pstate`` as the scaling driver even if the + processor is supported by it. + +``passive`` + Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to + start with. + + This option implies the ``no_hwp`` one described below. + +``force`` + Register ``intel_pstate`` as the scaling driver instead of + ``acpi-cpufreq`` even if the latter is preferred on the given system. + + This may prevent some platform features (such as thermal controls and + power capping) that rely on the availability of ACPI P-states + information from functioning as expected, so it should be used with + caution. + + This option does not work with processors that are not supported by + ``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling + driver is used instead of ``acpi-cpufreq``. + +``no_hwp`` + Do not enable the `hardware-managed P-states (HWP) feature + <Active Mode With HWP_>`_ even if it is supported by the processor. + +``hwp_only`` + Register ``intel_pstate`` as the scaling driver only if the + `hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is + supported by the processor. + +``support_acpi_ppc`` + Take ACPI ``_PPC`` performance limits into account. + + If the preferred power management profile in the FADT (Fixed ACPI + Description Table) is set to "Enterprise Server" or "Performance + Server", the ACPI ``_PPC`` limits are taken into account by default + and this option has no effect. + +``per_cpu_perf_limits`` + Use per-logical-CPU P-State limits (see `Coordination of P-state + Limits`_ for details). + + +Diagnostics and Tuning +====================== + +Trace Events +------------ + +There are two static trace events that can be used for ``intel_pstate`` +diagnostics. One of them is the ``cpu_frequency`` trace event generally used +by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific +to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if +it works in the `active mode <Active Mode_>`_. + +The following sequence of shell commands can be used to enable them and see +their output (if the kernel is generally configured to support event tracing):: + + # cd /sys/kernel/debug/tracing/ + # echo 1 > events/power/pstate_sample/enable + # echo 1 > events/power/cpu_frequency/enable + # cat trace + gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476 + cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 + +If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the +``cpu_frequency`` trace event will be triggered either by the ``schedutil`` +scaling governor (for the policies it is attached to), or by the ``CPUFreq`` +core (for the policies with other scaling governors). + +``ftrace`` +---------- + +The ``ftrace`` interface can be used for low-level diagnostics of +``intel_pstate``. For example, to check how often the function to set a +P-state is called, the ``ftrace`` filter can be set to to +:c:func:`intel_pstate_set_pstate`:: + + # cd /sys/kernel/debug/tracing/ + # cat available_filter_functions | grep -i pstate + intel_pstate_set_pstate + intel_pstate_cpu_init + ... + # echo intel_pstate_set_pstate > set_ftrace_filter + # echo function > current_tracer + # cat trace | head -15 + # tracer: function + # + # entries-in-buffer/entries-written: 80/80 #P:4 + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / delay + # TASK-PID CPU# |||| TIMESTAMP FUNCTION + # | | | |||| | | + Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func + gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func + gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func + <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func + +Tuning Interface in ``debugfs`` +------------------------------- + +The ``powersave`` algorithm provided by ``intel_pstate`` for `the Core line of +processors in the active mode <powersave_>`_ is based on a `PID controller`_ +whose parameters were chosen to address a number of different use cases at the +same time. However, it still is possible to fine-tune it to a specific workload +and the ``debugfs`` interface under ``/sys/kernel/debug/pstate_snb/`` is +provided for this purpose. [Note that the ``pstate_snb`` directory will be +present only if the specific P-state selection algorithm matching the interface +in it actually is in use.] + +The following files present in that directory can be used to modify the PID +controller parameters at run time: + +| ``deadband`` +| ``d_gain_pct`` +| ``i_gain_pct`` +| ``p_gain_pct`` +| ``sample_rate_ms`` +| ``setpoint`` + +Note, however, that achieving desirable results this way generally requires +expert-level understanding of the power vs performance tradeoff, so extra care +is recommended when attempting to do that. + + +.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf +.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html +.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf +.. _PID controller: https://en.wikipedia.org/wiki/PID_controller diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 01ddeaf64b0f..9490f2845f06 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -632,7 +632,7 @@ to i/o submission, if the bio fields are likely to be accessed after the i/o is issued (since the bio may otherwise get freed in case i/o completion happens in the meantime). -The bio_clone() routine may be used to duplicate a bio, where the clone +The bio_clone_fast() routine may be used to duplicate a bio, where the clone shares the bio_vec_list with the original bio (i.e. both point to the same bio_vec_list). This would typically be used for splitting i/o requests in lvm or md. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst index 55e43f1c80de..fce929144ccd 100644 --- a/Documentation/core-api/atomic_ops.rst +++ b/Documentation/core-api/atomic_ops.rst @@ -303,6 +303,11 @@ defined which accomplish this:: void smp_mb__before_atomic(void); void smp_mb__after_atomic(void); +Preceding a non-value-returning read-modify-write atomic operation with +smp_mb__before_atomic() and following it with smp_mb__after_atomic() +provides the same full ordering that is provided by value-returning +read-modify-write atomic operations. + For example, smp_mb__before_atomic() can be used like so:: obj->dead = 1; diff --git a/Documentation/cpu-freq/intel-pstate.txt b/Documentation/cpu-freq/intel-pstate.txt deleted file mode 100644 index 3fdcdfd968ba..000000000000 --- a/Documentation/cpu-freq/intel-pstate.txt +++ /dev/null @@ -1,281 +0,0 @@ -Intel P-State driver --------------------- - -This driver provides an interface to control the P-State selection for the -SandyBridge+ Intel processors. - -The following document explains P-States: -http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf -As stated in the document, P-State doesn’t exactly mean a frequency. However, for -the sake of the relationship with cpufreq, P-State and frequency are used -interchangeably. - -Understanding the cpufreq core governors and policies are important before -discussing more details about the Intel P-State driver. Based on what callbacks -a cpufreq driver provides to the cpufreq core, it can support two types of -drivers: -- with target_index() callback: In this mode, the drivers using cpufreq core -simply provide the minimum and maximum frequency limits and an additional -interface target_index() to set the current frequency. The cpufreq subsystem -has a number of scaling governors ("performance", "powersave", "ondemand", -etc.). Depending on which governor is in use, cpufreq core will call for -transitions to a specific frequency using target_index() callback. -- setpolicy() callback: In this mode, drivers do not provide target_index() -callback, so cpufreq core can't request a transition to a specific frequency. -The driver provides minimum and maximum frequency limits and callbacks to set a -policy. The policy in cpufreq sysfs is referred to as the "scaling governor". -The cpufreq core can request the driver to operate in any of the two policies: -"performance" and "powersave". The driver decides which frequency to use based -on the above policy selection considering minimum and maximum frequency limits. - -The Intel P-State driver falls under the latter category, which implements the -setpolicy() callback. This driver decides what P-State to use based on the -requested policy from the cpufreq core. If the processor is capable of -selecting its next P-State internally, then the driver will offload this -responsibility to the processor (aka HWP: Hardware P-States). If not, the -driver implements algorithms to select the next P-State. - -Since these policies are implemented in the driver, they are not same as the -cpufreq scaling governors implementation, even if they have the same name in -the cpufreq sysfs (scaling_governors). For example the "performance" policy is -similar to cpufreq’s "performance" governor, but "powersave" is completely -different than the cpufreq "powersave" governor. The strategy here is similar -to cpufreq "ondemand", where the requested P-State is related to the system load. - -Sysfs Interface - -In addition to the frequency-controlling interfaces provided by the cpufreq -core, the driver provides its own sysfs files to control the P-State selection. -These files have been added to /sys/devices/system/cpu/intel_pstate/. -Any changes made to these files are applicable to all CPUs (even in a -multi-package system, Refer to later section on placing "Per-CPU limits"). - - max_perf_pct: Limits the maximum P-State that will be requested by - the driver. It states it as a percentage of the available performance. The - available (P-State) performance may be reduced by the no_turbo - setting described below. - - min_perf_pct: Limits the minimum P-State that will be requested by - the driver. It states it as a percentage of the max (non-turbo) - performance level. - - no_turbo: Limits the driver to selecting P-State below the turbo - frequency range. - - turbo_pct: Displays the percentage of the total performance that - is supported by hardware that is in the turbo range. This number - is independent of whether turbo has been disabled or not. - - num_pstates: Displays the number of P-States that are supported - by hardware. This number is independent of whether turbo has - been disabled or not. - -For example, if a system has these parameters: - Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State) - Max non turbo ratio: 0x17 - Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio) - -Sysfs will show : - max_perf_pct:100, which corresponds to 1 core ratio - min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio - no_turbo:0, turbo is not disabled - num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1) - turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates - -Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual -Volume 3: System Programming Guide" to understand ratios. - -There is one more sysfs attribute in /sys/devices/system/cpu/intel_pstate/ -that can be used for controlling the operation mode of the driver: - - status: Three settings are possible: - "off" - The driver is not in use at this time. - "active" - The driver works as a P-state governor (default). - "passive" - The driver works as a regular cpufreq one and collaborates - with the generic cpufreq governors (it sets P-states as - requested by those governors). - The current setting is returned by reads from this attribute. Writing one - of the above strings to it changes the operation mode as indicated by that - string, if possible. If HW-managed P-states (HWP) are enabled, it is not - possible to change the driver's operation mode and attempts to write to - this attribute will fail. - -cpufreq sysfs for Intel P-State - -Since this driver registers with cpufreq, cpufreq sysfs is also presented. -There are some important differences, which need to be considered. - -scaling_cur_freq: This displays the real frequency which was used during -the last sample period instead of what is requested. Some other cpufreq driver, -like acpi-cpufreq, displays what is requested (Some changes are on the -way to fix this for acpi-cpufreq driver). The same is true for frequencies -displayed at /proc/cpuinfo. - -scaling_governor: This displays current active policy. Since each CPU has a -cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this -is not possible with Intel P-States, as there is one common policy for all -CPUs. Here, the last requested policy will be applicable to all CPUs. It is -suggested that one use the cpupower utility to change policy to all CPUs at the -same time. - -scaling_setspeed: This attribute can never be used with Intel P-State. - -scaling_max_freq/scaling_min_freq: This interface can be used similarly to -the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies -are converted to nearest possible P-State, this is prone to rounding errors. -This method is not preferred to limit performance. - -affected_cpus: Not used -related_cpus: Not used - -For contemporary Intel processors, the frequency is controlled by the -processor itself and the P-State exposed to software is related to -performance levels. The idea that frequency can be set to a single -frequency is fictional for Intel Core processors. Even if the scaling -driver selects a single P-State, the actual frequency the processor -will run at is selected by the processor itself. - -Per-CPU limits - -The kernel command line option "intel_pstate=per_cpu_perf_limits" forces -the intel_pstate driver to use per-CPU performance limits. When it is set, -the sysfs control interface described above is subject to limitations. -- The following controls are not available for both read and write - /sys/devices/system/cpu/intel_pstate/max_perf_pct - /sys/devices/system/cpu/intel_pstate/min_perf_pct -- The following controls can be used to set performance limits, as far as the -architecture of the processor permits: - /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq - /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq - /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -- User can still observe turbo percent and number of P-States from - /sys/devices/system/cpu/intel_pstate/turbo_pct - /sys/devices/system/cpu/intel_pstate/num_pstates -- User can read write system wide turbo status - /sys/devices/system/cpu/no_turbo - -Support of energy performance hints -It is possible to provide hints to the HWP algorithms in the processor -to be more performance centric to more energy centric. When the driver -is using HWP, two additional cpufreq sysfs attributes are presented for -each logical CPU. -These attributes are: - - energy_performance_available_preferences - - energy_performance_preference - -To get list of supported hints: -$ cat energy_performance_available_preferences - default performance balance_performance balance_power power - -The current preference can be read or changed via cpufreq sysfs -attribute "energy_performance_preference". Reading from this attribute -will display current effective setting. User can write any of the valid -preference string to this attribute. User can always restore to power-on -default by writing "default". - -Since threads can migrate to different CPUs, this is possible that the -new CPU may have different energy performance preference than the previous -one. To avoid such issues, either threads can be pinned to specific CPUs -or set the same energy performance preference value to all CPUs. - -Tuning Intel P-State driver - -When the performance can be tuned using PID (Proportional Integral -Derivative) controller, debugfs files are provided for adjusting performance. -They are presented under: -/sys/kernel/debug/pstate_snb/ - -The PID tunable parameters are: - deadband - d_gain_pct - i_gain_pct - p_gain_pct - sample_rate_ms - setpoint - -To adjust these parameters, some understanding of driver implementation is -necessary. There are some tweeks described here, but be very careful. Adjusting -them requires expert level understanding of power and performance relationship. -These limits are only useful when the "powersave" policy is active. - --To make the system more responsive to load changes, sample_rate_ms can -be adjusted (current default is 10ms). --To make the system use higher performance, even if the load is lower, setpoint -can be adjusted to a lower number. This will also lead to faster ramp up time -to reach the maximum P-State. -If there are no derivative and integral coefficients, The next P-State will be -equal to: - current P-State - ((setpoint - current cpu load) * p_gain_pct) - -For example, if the current PID parameters are (Which are defaults for the core -processors like SandyBridge): - deadband = 0 - d_gain_pct = 0 - i_gain_pct = 0 - p_gain_pct = 20 - sample_rate_ms = 10 - setpoint = 97 - -If the current P-State = 0x08 and current load = 100, this will result in the -next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State -goes up by only 1. If during next sample interval the current load doesn't -change and still 100, then P-State goes up by one again. This process will -continue as long as the load is more than the setpoint until the maximum P-State -is reached. - -For the same load at setpoint = 60, this will result in the next P-State -= 0x08 - ((60 - 100) * 0.2) = 16 -So by changing the setpoint from 97 to 60, there is an increase of the -next P-State from 9 to 16. So this will make processor execute at higher -P-State for the same CPU load. If the load continues to be more than the -setpoint during next sample intervals, then P-State will go up again till the -maximum P-State is reached. But the ramp up time to reach the maximum P-State -will be much faster when the setpoint is 60 compared to 97. - -Debugging Intel P-State driver - -Event tracing -To debug P-State transition, the Linux event tracing interface can be used. -There are two specific events, which can be enabled (Provided the kernel -configs related to event tracing are enabled). - -# cd /sys/kernel/debug/tracing/ -# echo 1 > events/power/pstate_sample/enable -# echo 1 > events/power/cpu_frequency/enable -# cat trace -gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 - scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 - freq=2474476 -cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 - - -Using ftrace - -If function level tracing is required, the Linux ftrace interface can be used. -For example if we want to check how often a function to set a P-State is -called, we can set ftrace filter to intel_pstate_set_pstate. - -# cd /sys/kernel/debug/tracing/ -# cat available_filter_functions | grep -i pstate -intel_pstate_set_pstate -intel_pstate_cpu_init -... - -# echo intel_pstate_set_pstate > set_ftrace_filter -# echo function > current_tracer -# cat trace | head -15 -# tracer: function -# -# entries-in-buffer/entries-written: 80/80 #P:4 -# -# _-----=> irqs-off -# / _----=> need-resched -# | / _---=> hardirq/softirq -# || / _--=> preempt-depth -# ||| / delay -# TASK-PID CPU# |||| TIMESTAMP FUNCTION -# | | | |||| | | - Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func - gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func - gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func - <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func diff --git a/Documentation/dev-tools/sparse.rst b/Documentation/dev-tools/sparse.rst index ffdcc97f6f5a..78aa00a604a0 100644 --- a/Documentation/dev-tools/sparse.rst +++ b/Documentation/dev-tools/sparse.rst @@ -103,9 +103,3 @@ have already built it. The optional make variable CF can be used to pass arguments to sparse. The build system passes -Wbitwise to sparse automatically. - -Checking RCU annotations -~~~~~~~~~~~~~~~~~~~~~~~~ - -RCU annotations are not checked by default. To enable RCU annotation -checks, include -DCONFIG_SPARSE_RCU_POINTER in your CF flags. diff --git a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt index e9c5a1d9834a..f465647a4dd2 100644 --- a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt +++ b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt @@ -22,7 +22,8 @@ Required properties : - #clock-cells : must contain 1 - #reset-cells : must contain 1 -For the PRCM CCUs on H3/A64, one more clock is needed: +For the PRCM CCUs on H3/A64, two more clocks are needed: +- "pll-periph": the SoC's peripheral PLL from the main CCU - "iosc": the SoC's internal frequency oscillator Example for generic CCU: @@ -39,8 +40,8 @@ Example for PRCM CCU: r_ccu: clock@01f01400 { compatible = "allwinner,sun50i-a64-r-ccu"; reg = <0x01f01400 0x100>; - clocks = <&osc24M>, <&osc32k>, <&iosc>; - clock-names = "hosc", "losc", "iosc"; + clocks = <&osc24M>, <&osc32k>, <&iosc>, <&ccu CLK_PLL_PERIPH0>; + clock-names = "hosc", "losc", "iosc", "pll-periph"; #clock-cells = <1>; #reset-cells = <1>; }; diff --git a/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt b/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt index 42c3bb2d53e8..01e331a5f3e7 100644 --- a/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt +++ b/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt @@ -41,9 +41,9 @@ Required properties: Optional properties: In order to use the GPIO lines in PWM mode, some additional optional -properties are required. Only Armada 370 and XP support these properties. +properties are required. -- compatible: Must contain "marvell,armada-370-xp-gpio" +- compatible: Must contain "marvell,armada-370-gpio" - reg: an additional register set is needed, for the GPIO Blink Counter on/off registers. @@ -71,7 +71,7 @@ Example: }; gpio1: gpio@18140 { - compatible = "marvell,armada-370-xp-gpio"; + compatible = "marvell,armada-370-gpio"; reg = <0x18140 0x40>, <0x181c8 0x08>; reg-names = "gpio", "pwm"; ngpios = <17>; diff --git a/Documentation/devicetree/bindings/input/touchscreen/edt-ft5x06.txt b/Documentation/devicetree/bindings/input/touchscreen/edt-ft5x06.txt index 6db22103e2dd..025cf8c9324a 100644 --- a/Documentation/devicetree/bindings/input/touchscreen/edt-ft5x06.txt +++ b/Documentation/devicetree/bindings/input/touchscreen/edt-ft5x06.txt @@ -36,7 +36,7 @@ Optional properties: control gpios - threshold: allows setting the "click"-threshold in the range - from 20 to 80. + from 0 to 80. - gain: allows setting the sensitivity in the range from 0 to 31. Note that lower values indicate higher diff --git a/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt b/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt index 05485699d70e..9630ac0e4b56 100644 --- a/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt +++ b/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt @@ -16,6 +16,11 @@ Required properties: - reg: Base address of PMIC on Hi6220 SoC. - interrupt-controller: Hi655x has internal IRQs (has own IRQ domain). - pmic-gpios: The GPIO used by PMIC IRQ. +- #clock-cells: From common clock binding; shall be set to 0 + +Optional properties: +- clock-output-names: From common clock binding to override the + default output clock name Example: pmic: pmic@f8000000 { @@ -24,4 +29,5 @@ Example: interrupt-controller; #interrupt-cells = <2>; pmic-gpios = <&gpio1 2 GPIO_ACTIVE_HIGH>; + #clock-cells = <0>; } diff --git a/Documentation/devicetree/bindings/mfd/stm32-timers.txt b/Documentation/devicetree/bindings/mfd/stm32-timers.txt index bbd083f5600a..1db6e0057a63 100644 --- a/Documentation/devicetree/bindings/mfd/stm32-timers.txt +++ b/Documentation/devicetree/bindings/mfd/stm32-timers.txt @@ -31,7 +31,7 @@ Example: compatible = "st,stm32-timers"; reg = <0x40010000 0x400>; clocks = <&rcc 0 160>; - clock-names = "clk_int"; + clock-names = "int"; pwm { compatible = "st,stm32-pwm"; diff --git a/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt b/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt index e25436861867..9029b45b8a22 100644 --- a/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt +++ b/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt @@ -18,6 +18,8 @@ Optional properties: "ext_clock" (External clock provided to the card). - post-power-on-delay-ms : Delay in ms after powering the card and de-asserting the reset-gpios (if any) +- power-off-delay-us : Delay in us after asserting the reset-gpios (if any) + during power off of the card. Example: diff --git a/Documentation/devicetree/bindings/net/dsa/b53.txt b/Documentation/devicetree/bindings/net/dsa/b53.txt index d6c6e41648d4..8ec2ca21adeb 100644 --- a/Documentation/devicetree/bindings/net/dsa/b53.txt +++ b/Documentation/devicetree/bindings/net/dsa/b53.txt @@ -34,7 +34,7 @@ Required properties: "brcm,bcm6328-switch" "brcm,bcm6368-switch" and the mandatory "brcm,bcm63xx-switch" -See Documentation/devicetree/bindings/dsa/dsa.txt for a list of additional +See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional required and optional properties. Examples: diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt b/Documentation/devicetree/bindings/net/dsa/marvell.txt index 7ef9dbb08957..1d4d0f49c9d0 100644 --- a/Documentation/devicetree/bindings/net/dsa/marvell.txt +++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt @@ -26,6 +26,10 @@ Optional properties: - interrupt-controller : Indicates the switch is itself an interrupt controller. This is used for the PHY interrupts. #interrupt-cells = <2> : Controller uses two cells, number and flag +- eeprom-length : Set to the length of an EEPROM connected to the + switch. Must be set if the switch can not detect + the presence and/or size of a connected EEPROM, + otherwise optional. - mdio : Container of PHY and devices on the switches MDIO bus. - mdio? : Container of PHYs and devices on the external MDIO diff --git a/Documentation/devicetree/bindings/net/fsl-fec.txt b/Documentation/devicetree/bindings/net/fsl-fec.txt index a1e3693cca16..6f55bdd52f8a 100644 --- a/Documentation/devicetree/bindings/net/fsl-fec.txt +++ b/Documentation/devicetree/bindings/net/fsl-fec.txt @@ -15,6 +15,10 @@ Optional properties: - phy-reset-active-high : If present then the reset sequence using the GPIO specified in the "phy-reset-gpios" property is reversed (H=reset state, L=operation state). +- phy-reset-post-delay : Post reset delay in milliseconds. If present then + a delay of phy-reset-post-delay milliseconds will be observed after the + phy-reset-gpios has been toggled. Can be omitted thus no delay is + observed. Delay is in range of 1ms to 1000ms. Other delays are invalid. - phy-supply : regulator that powers the Ethernet PHY. - phy-handle : phandle to the PHY device connected to this device. - fixed-link : Assume a fixed link. See fixed-link.txt in the same directory. diff --git a/Documentation/devicetree/bindings/net/smsc911x.txt b/Documentation/devicetree/bindings/net/smsc911x.txt index 16c3a9501f5d..acfafc8e143c 100644 --- a/Documentation/devicetree/bindings/net/smsc911x.txt +++ b/Documentation/devicetree/bindings/net/smsc911x.txt @@ -27,6 +27,7 @@ Optional properties: of the device. On many systems this is wired high so the device goes out of reset at power-on, but if it is under program control, this optional GPIO can wake up in response to it. +- vdd33a-supply, vddvario-supply : 3.3V analog and IO logic power supplies Examples: diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt index 71a3c134af1b..f01d154090da 100644 --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt @@ -247,7 +247,6 @@ bias-bus-hold - latch weakly bias-pull-up - pull up the pin bias-pull-down - pull down the pin bias-pull-pin-default - use pin-default pull state -bi-directional - pin supports simultaneous input/output operations drive-push-pull - drive actively high and low drive-open-drain - drive with open drain drive-open-source - drive with open source @@ -260,7 +259,6 @@ input-debounce - debounce mode with debound time X power-source - select between different power supplies low-power-enable - enable low power mode low-power-disable - disable low power mode -output-enable - enable output on pin regardless of output value output-low - set the pin to output mode with low level output-high - set the pin to output mode with high level slew-rate - set the slew rate diff --git a/Documentation/devicetree/bindings/staging/ion/hi6220-ion.txt b/Documentation/devicetree/bindings/staging/ion/hi6220-ion.txt deleted file mode 100644 index c59e27c632c1..000000000000 --- a/Documentation/devicetree/bindings/staging/ion/hi6220-ion.txt +++ /dev/null @@ -1,31 +0,0 @@ -Hi6220 SoC ION -=================================================================== -Required properties: -- compatible : "hisilicon,hi6220-ion" -- list of the ION heaps - - heap name : maybe heap_sys_user@0 - - heap id : id should be unique in the system. - - heap base : base ddr address of the heap,0 means that - it is dynamic. - - heap size : memory size and 0 means it is dynamic. - - heap type : the heap type of the heap, please also - see the define in ion.h(drivers/staging/android/uapi/ion.h) -------------------------------------------------------------------- -Example: - hi6220-ion { - compatible = "hisilicon,hi6220-ion"; - heap_sys_user@0 { - heap-name = "sys_user"; - heap-id = <0x0>; - heap-base = <0x0>; - heap-size = <0x0>; - heap-type = "ion_system"; - }; - heap_sys_contig@0 { - heap-name = "sys_contig"; - heap-id = <0x1>; - heap-base = <0x0>; - heap-size = <0x0>; - heap-type = "ion_system_contig"; - }; - }; diff --git a/Documentation/devicetree/bindings/usb/dwc2.txt b/Documentation/devicetree/bindings/usb/dwc2.txt index 00bea038639e..fcf199b64d3d 100644 --- a/Documentation/devicetree/bindings/usb/dwc2.txt +++ b/Documentation/devicetree/bindings/usb/dwc2.txt @@ -10,6 +10,7 @@ Required properties: - "rockchip,rk3288-usb", "rockchip,rk3066-usb", "snps,dwc2": for rk3288 Soc; - "lantiq,arx100-usb": The DWC2 USB controller instance in Lantiq ARX SoCs; - "lantiq,xrx200-usb": The DWC2 USB controller instance in Lantiq XRX SoCs; + - "amlogic,meson8-usb": The DWC2 USB controller instance in Amlogic Meson8 SoCs; - "amlogic,meson8b-usb": The DWC2 USB controller instance in Amlogic Meson8b SoCs; - "amlogic,meson-gxbb-usb": The DWC2 USB controller instance in Amlogic S905 SoCs; - "amcc,dwc-otg": The DWC2 USB controller instance in AMCC Canyonlands 460EX SoCs; diff --git a/Documentation/filesystems/autofs4.txt b/Documentation/filesystems/autofs4.txt index f10dd590f69f..8444dc3d57e8 100644 --- a/Documentation/filesystems/autofs4.txt +++ b/Documentation/filesystems/autofs4.txt @@ -316,7 +316,7 @@ For version 5, the format of the message is: struct autofs_v5_packet { int proto_version; /* Protocol version */ int type; /* Type of packet */ - autofs_wqt_t wait_queue_token; + autofs_wqt_t wait_queue_entry_token; __u32 dev; __u64 ino; __u32 uid; @@ -341,12 +341,12 @@ The pipe will be set to "packet mode" (equivalent to passing `O_DIRECT`) to _pipe2(2)_ so that a read from the pipe will return at most one packet, and any unread portion of a packet will be discarded. -The `wait_queue_token` is a unique number which can identify a +The `wait_queue_entry_token` is a unique number which can identify a particular request to be acknowledged. When a message is sent over the pipe the affected dentry is marked as either "active" or "expiring" and other accesses to it block until the message is acknowledged using one of the ioctls below and the relevant -`wait_queue_token`. +`wait_queue_entry_token`. Communicating with autofs: root directory ioctls ------------------------------------------------ @@ -358,7 +358,7 @@ capability, or must be the automount daemon. The available ioctl commands are: - **AUTOFS_IOC_READY**: a notification has been handled. The argument - to the ioctl command is the "wait_queue_token" number + to the ioctl command is the "wait_queue_entry_token" number corresponding to the notification being acknowledged. - **AUTOFS_IOC_FAIL**: similar to above, but indicates failure with the error code `ENOENT`. @@ -382,14 +382,14 @@ The available ioctl commands are: struct autofs_packet_expire_multi { int proto_version; /* Protocol version */ int type; /* Type of packet */ - autofs_wqt_t wait_queue_token; + autofs_wqt_t wait_queue_entry_token; int len; char name[NAME_MAX+1]; }; is required. This is filled in with the name of something that can be unmounted or removed. If nothing can be expired, - `errno` is set to `EAGAIN`. Even though a `wait_queue_token` + `errno` is set to `EAGAIN`. Even though a `wait_queue_entry_token` is present in the structure, no "wait queue" is established and no acknowledgment is needed. - **AUTOFS_IOC_EXPIRE_MULTI**: This is similar to diff --git a/Documentation/input/devices/edt-ft5x06.rst b/Documentation/input/devices/edt-ft5x06.rst index 2032f0b7a8fa..1ccc94b192b7 100644 --- a/Documentation/input/devices/edt-ft5x06.rst +++ b/Documentation/input/devices/edt-ft5x06.rst @@ -15,7 +15,7 @@ It has been tested with the following devices: The driver allows configuration of the touch screen via a set of sysfs files: /sys/class/input/eventX/device/device/threshold: - allows setting the "click"-threshold in the range from 20 to 80. + allows setting the "click"-threshold in the range from 0 to 80. /sys/class/input/eventX/device/device/gain: allows setting the sensitivity in the range from 0 to 31. Note that diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index df31e30b6a02..2cb7dc5c0e0d 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -109,13 +109,12 @@ SCHED_SOFTIRQ: Do all of the following: on that CPU. If a thread that expects to run on the de-jittered CPU awakens, the scheduler will send an IPI that can result in a subsequent SCHED_SOFTIRQ. -2. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y, - CONFIG_NO_HZ_FULL=y, and, in addition, ensure that the CPU - to be de-jittered is marked as an adaptive-ticks CPU using the - "nohz_full=" boot parameter. This reduces the number of - scheduler-clock interrupts that the de-jittered CPU receives, - minimizing its chances of being selected to do the load balancing - work that runs in SCHED_SOFTIRQ context. +2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered + is marked as an adaptive-ticks CPU using the "nohz_full=" + boot parameter. This reduces the number of scheduler-clock + interrupts that the de-jittered CPU receives, minimizing its + chances of being selected to do the load balancing work that + runs in SCHED_SOFTIRQ context. 3. To the extent possible, keep the CPU out of the kernel when it is non-idle, for example, by avoiding system calls and by forcing both kernel threads and interrupts to execute elsewhere. @@ -135,11 +134,10 @@ HRTIMER_SOFTIRQ: Do all of the following: RCU_SOFTIRQ: Do at least one of the following: 1. Offload callbacks and keep the CPU in either dyntick-idle or adaptive-ticks state by doing all of the following: - a. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y, - CONFIG_NO_HZ_FULL=y, and, in addition ensure that the CPU - to be de-jittered is marked as an adaptive-ticks CPU using - the "nohz_full=" boot parameter. Bind the rcuo kthreads - to housekeeping CPUs, which can tolerate OS jitter. + a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be + de-jittered is marked as an adaptive-ticks CPU using the + "nohz_full=" boot parameter. Bind the rcuo kthreads to + housekeeping CPUs, which can tolerate OS jitter. b. To the extent possible, keep the CPU out of the kernel when it is non-idle, for example, by avoiding system calls and by forcing both kernel threads and interrupts @@ -236,11 +234,10 @@ To reduce its OS jitter, do at least one of the following: is feasible only if your workload never requires RCU priority boosting, for example, if you ensure frequent idle time on all CPUs that might execute within the kernel. -3. Build with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y, - which offloads all RCU callbacks to kthreads that can be moved - off of CPUs susceptible to OS jitter. This approach prevents the - rcuc/%u kthreads from having any work to do, so that they are - never awakened. +3. Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs= + boot parameter offloading RCU callbacks from all CPUs susceptible + to OS jitter. This approach prevents the rcuc/%u kthreads from + having any work to do, so that they are never awakened. 4. Ensure that the CPU never enters the kernel, and, in particular, avoid initiating any CPU hotplug operations on this CPU. This is another way of preventing any callbacks from being queued on the diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 732f10ea382e..9d5e0f853f08 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -27,7 +27,7 @@ The purpose of this document is twofold: (2) to provide a guide as to how to use the barriers that are available. Note that an architecture can provide more than the minimum requirement -for any particular barrier, but if the architecure provides less than +for any particular barrier, but if the architecture provides less than that, that architecture is incorrect. Note also that it is possible that a barrier may be a no-op for an diff --git a/Documentation/networking/dpaa.txt b/Documentation/networking/dpaa.txt new file mode 100644 index 000000000000..76e016d4d344 --- /dev/null +++ b/Documentation/networking/dpaa.txt @@ -0,0 +1,194 @@ +The QorIQ DPAA Ethernet Driver +============================== + +Authors: +Madalin Bucur <madalin.bucur@nxp.com> +Camelia Groza <camelia.groza@nxp.com> + +Contents +======== + + - DPAA Ethernet Overview + - DPAA Ethernet Supported SoCs + - Configuring DPAA Ethernet in your kernel + - DPAA Ethernet Frame Processing + - DPAA Ethernet Features + - Debugging + +DPAA Ethernet Overview +====================== + +DPAA stands for Data Path Acceleration Architecture and it is a +set of networking acceleration IPs that are available on several +generations of SoCs, both on PowerPC and ARM64. + +The Freescale DPAA architecture consists of a series of hardware blocks +that support Ethernet connectivity. The Ethernet driver depends upon the +following drivers in the Linux kernel: + + - Peripheral Access Memory Unit (PAMU) (* needed only for PPC platforms) + drivers/iommu/fsl_* + - Frame Manager (FMan) + drivers/net/ethernet/freescale/fman + - Queue Manager (QMan), Buffer Manager (BMan) + drivers/soc/fsl/qbman + +A simplified view of the dpaa_eth interfaces mapped to FMan MACs: + + dpaa_eth /eth0\ ... /ethN\ + driver | | | | + ------------- ---- ----------- ---- ------------- + -Ports / Tx Rx \ ... / Tx Rx \ + FMan | | | | + -MACs | MAC0 | | MACN | + / dtsec0 \ ... / dtsecN \ (or tgec) + / \ / \(or memac) + --------- -------------- --- -------------- --------- + FMan, FMan Port, FMan SP, FMan MURAM drivers + --------------------------------------------------------- + FMan HW blocks: MURAM, MACs, Ports, SP + --------------------------------------------------------- + +The dpaa_eth relation to the QMan, BMan and FMan: + ________________________________ + dpaa_eth / eth0 \ + driver / \ + --------- -^- -^- -^- --- --------- + QMan driver / \ / \ / \ \ / | BMan | + |Rx | |Rx | |Tx | |Tx | | driver | + --------- |Dfl| |Err| |Cnf| |FQs| | | + QMan HW |FQ | |FQ | |FQs| | | | | + / \ / \ / \ \ / | | + --------- --- --- --- -v- --------- + | FMan QMI | | + | FMan HW FMan BMI | BMan HW | + ----------------------- -------- + +where the acronyms used above (and in the code) are: +DPAA = Data Path Acceleration Architecture +FMan = DPAA Frame Manager +QMan = DPAA Queue Manager +BMan = DPAA Buffers Manager +QMI = QMan interface in FMan +BMI = BMan interface in FMan +FMan SP = FMan Storage Profiles +MURAM = Multi-user RAM in FMan +FQ = QMan Frame Queue +Rx Dfl FQ = default reception FQ +Rx Err FQ = Rx error frames FQ +Tx Cnf FQ = Tx confirmation FQs +Tx FQs = transmission frame queues +dtsec = datapath three speed Ethernet controller (10/100/1000 Mbps) +tgec = ten gigabit Ethernet controller (10 Gbps) +memac = multirate Ethernet MAC (10/100/1000/10000) + +DPAA Ethernet Supported SoCs +============================ + +The DPAA drivers enable the Ethernet controllers present on the following SoCs: + +# PPC +P1023 +P2041 +P3041 +P4080 +P5020 +P5040 +T1023 +T1024 +T1040 +T1042 +T2080 +T4240 +B4860 + +# ARM +LS1043A +LS1046A + +Configuring DPAA Ethernet in your kernel +======================================== + +To enable the DPAA Ethernet driver, the following Kconfig options are required: + +# common for arch/arm64 and arch/powerpc platforms +CONFIG_FSL_DPAA=y +CONFIG_FSL_FMAN=y +CONFIG_FSL_DPAA_ETH=y +CONFIG_FSL_XGMAC_MDIO=y + +# for arch/powerpc only +CONFIG_FSL_PAMU=y + +# common options needed for the PHYs used on the RDBs +CONFIG_VITESSE_PHY=y +CONFIG_REALTEK_PHY=y +CONFIG_AQUANTIA_PHY=y + +DPAA Ethernet Frame Processing +============================== + +On Rx, buffers for the incoming frames are retrieved from one of the three +existing buffers pools. The driver initializes and seeds these, each with +buffers of different sizes: 1KB, 2KB and 4KB. + +On Tx, all transmitted frames are returned to the driver through Tx +confirmation frame queues. The driver is then responsible for freeing the +buffers. In order to do this properly, a backpointer is added to the buffer +before transmission that points to the skb. When the buffer returns to the +driver on a confirmation FQ, the skb can be correctly consumed. + +DPAA Ethernet Features +====================== + +Currently the DPAA Ethernet driver enables the basic features required for +a Linux Ethernet driver. The support for advanced features will be added +gradually. + +The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx +checksum offload feature is enabled by default and cannot be controlled through +ethtool. + +The driver has support for multiple prioritized Tx traffic classes. Priorities +range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with +strict priority levels. Each traffic class contains NR_CPU TX queues. By +default, only one traffic class is enabled and the lowest priority Tx queues +are used. Higher priority traffic classes can be enabled with the mqprio +qdisc. For example, all four traffic classes are enabled on an interface with +the following command. Furthermore, skb priority levels are mapped to traffic +classes as follows: + + * priorities 0 to 3 - traffic class 0 (low priority) + * priorities 4 to 7 - traffic class 1 (medium-low priority) + * priorities 8 to 11 - traffic class 2 (medium-high priority) + * priorities 12 to 15 - traffic class 3 (high priority) + +tc qdisc add dev <int> root handle 1: \ + mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1 + +Debugging +========= + +The following statistics are exported for each interface through ethtool: + + - interrupt count per CPU + - Rx packets count per CPU + - Tx packets count per CPU + - Tx confirmed packets count per CPU + - Tx S/G frames count per CPU + - Tx error count per CPU + - Rx error count per CPU + - Rx error count per type + - congestion related statistics: + - congestion status + - time spent in congestion + - number of time the device entered congestion + - dropped packets count per cause + +The driver also exports the following information in sysfs: + + - the FQ IDs for each FQ type + /sys/devices/platform/dpaa-ethernet.0/net/<int>/fqids + + - the IDs of the buffer pools in use + /sys/devices/platform/dpaa-ethernet.0/net/<int>/bpids diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 59f4db2a0c85..f55639d71d35 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -122,7 +122,7 @@ associated flow of the packet. The hash is either provided by hardware or will be computed in the stack. Capable hardware can pass the hash in the receive descriptor for the packet; this would usually be the same hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in -skb->rx_hash and can be used elsewhere in the stack as a hash of the +skb->hash and can be used elsewhere in the stack as a hash of the packet’s flow. Each receive hardware queue has an associated list of CPUs to which diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt index bdc4c0db51e1..9c7139d57e57 100644 --- a/Documentation/networking/tcp.txt +++ b/Documentation/networking/tcp.txt @@ -1,7 +1,7 @@ TCP protocol ============ -Last updated: 9 February 2008 +Last updated: 3 June 2017 Contents ======== @@ -29,18 +29,19 @@ As of 2.6.13, Linux supports pluggable congestion control algorithms. A congestion control mechanism can be registered through functions in tcp_cong.c. The functions used by the congestion control mechanism are registered via passing a tcp_congestion_ops struct to -tcp_register_congestion_control. As a minimum name, ssthresh, -cong_avoid must be valid. +tcp_register_congestion_control. As a minimum, the congestion control +mechanism must provide a valid name and must implement either ssthresh, +cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook. Private data for a congestion control mechanism is stored in tp->ca_priv. tcp_ca(tp) returns a pointer to this space. This is preallocated space - it is important to check the size of your private data will fit this space, or -alternatively space could be allocated elsewhere and a pointer to it could +alternatively, space could be allocated elsewhere and a pointer to it could be stored here. There are three kinds of congestion control algorithms currently: The simplest ones are derived from TCP reno (highspeed, scalable) and just -provide an alternative the congestion window calculation. More complex +provide an alternative congestion window calculation. More complex ones like BIC try to look at other events to provide better heuristics. There are also round trip time based algorithms like Vegas and Westwood+. @@ -49,21 +50,15 @@ Good TCP congestion control is a complex problem because the algorithm needs to maintain fairness and performance. Please review current research and RFC's before developing new modules. -The method that is used to determine which congestion control mechanism is -determined by the setting of the sysctl net.ipv4.tcp_congestion_control. -The default congestion control will be the last one registered (LIFO); -so if you built everything as modules, the default will be reno. If you -build with the defaults from Kconfig, then CUBIC will be builtin (not a -module) and it will end up the default. +The default congestion control mechanism is chosen based on the +DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default +value then you can set it using sysctl net.ipv4.tcp_congestion_control. The +module will be autoloaded if needed and you will get the expected protocol. If +you ask for an unknown congestion method, then the sysctl attempt will fail. -If you really want a particular default value then you will need -to set it with the sysctl. If you use a sysctl, the module will be autoloaded -if needed and you will get the expected protocol. If you ask for an -unknown congestion method, then the sysctl attempt will fail. - -If you remove a tcp congestion control module, then you will get the next +If you remove a TCP congestion control module, then you will get the next available one. Since reno cannot be built as a module, and cannot be -deleted, it will always be available. +removed, it will always be available. How the new TCP output machine [nyi] works. =========================================== diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt index cbc1b46cbf70..e89e36ec15a5 100644 --- a/Documentation/scheduler/sched-deadline.txt +++ b/Documentation/scheduler/sched-deadline.txt @@ -7,6 +7,8 @@ CONTENTS 0. WARNING 1. Overview 2. Scheduling algorithm + 2.1 Main algorithm + 2.2 Bandwidth reclaiming 3. Scheduling Real-Time Tasks 3.1 Definitions 3.2 Schedulability Analysis for Uniprocessor Systems @@ -44,6 +46,9 @@ CONTENTS 2. Scheduling algorithm ================== +2.1 Main algorithm +------------------ + SCHED_DEADLINE uses three parameters, named "runtime", "period", and "deadline", to schedule tasks. A SCHED_DEADLINE task should receive "runtime" microseconds of execution time every "period" microseconds, and @@ -113,6 +118,160 @@ CONTENTS remaining runtime = remaining runtime + runtime +2.2 Bandwidth reclaiming +------------------------ + + Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy + Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled + when flag SCHED_FLAG_RECLAIM is set. + + The following diagram illustrates the state names for tasks handled by GRUB: + + ------------ + (d) | Active | + ------------->| | + | | Contending | + | ------------ + | A | + ---------- | | + | | | | + | Inactive | |(b) | (a) + | | | | + ---------- | | + A | V + | ------------ + | | Active | + --------------| Non | + (c) | Contending | + ------------ + + A task can be in one of the following states: + + - ActiveContending: if it is ready for execution (or executing); + + - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag + time; + + - Inactive: if it is blocked and has surpassed the 0-lag time. + + State transitions: + + (a) When a task blocks, it does not become immediately inactive since its + bandwidth cannot be immediately reclaimed without breaking the + real-time guarantees. It therefore enters a transitional state called + ActiveNonContending. The scheduler arms the "inactive timer" to fire at + the 0-lag time, when the task's bandwidth can be reclaimed without + breaking the real-time guarantees. + + The 0-lag time for a task entering the ActiveNonContending state is + computed as + + (runtime * dl_period) + deadline - --------------------- + dl_runtime + + where runtime is the remaining runtime, while dl_runtime and dl_period + are the reservation parameters. + + (b) If the task wakes up before the inactive timer fires, the task re-enters + the ActiveContending state and the "inactive timer" is canceled. + In addition, if the task wakes up on a different runqueue, then + the task's utilization must be removed from the previous runqueue's active + utilization and must be added to the new runqueue's active utilization. + In order to avoid races between a task waking up on a runqueue while the + "inactive timer" is running on a different CPU, the "dl_non_contending" + flag is used to indicate that a task is not on a runqueue but is active + (so, the flag is set when the task blocks and is cleared when the + "inactive timer" fires or when the task wakes up). + + (c) When the "inactive timer" fires, the task enters the Inactive state and + its utilization is removed from the runqueue's active utilization. + + (d) When an inactive task wakes up, it enters the ActiveContending state and + its utilization is added to the active utilization of the runqueue where + it has been enqueued. + + For each runqueue, the algorithm GRUB keeps track of two different bandwidths: + + - Active bandwidth (running_bw): this is the sum of the bandwidths of all + tasks in active state (i.e., ActiveContending or ActiveNonContending); + + - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the + runqueue, including the tasks in Inactive state. + + + The algorithm reclaims the bandwidth of the tasks in Inactive state. + It does so by decrementing the runtime of the executing task Ti at a pace equal + to + + dq = -max{ Ui, (1 - Uinact) } dt + + where Uinact is the inactive utilization, computed as (this_bq - running_bw), + and Ui is the bandwidth of task Ti. + + + Let's now see a trivial example of two deadline tasks with runtime equal + to 4 and period equal to 8 (i.e., bandwidth equal to 0.5): + + A Task T1 + | + | | + | | + |-------- |---- + | | V + |---|---|---|---|---|---|---|---|--------->t + 0 1 2 3 4 5 6 7 8 + + + A Task T2 + | + | | + | | + | ------------------------| + | | V + |---|---|---|---|---|---|---|---|--------->t + 0 1 2 3 4 5 6 7 8 + + + A running_bw + | + 1 ----------------- ------ + | | | + 0.5- ----------------- + | | + |---|---|---|---|---|---|---|---|--------->t + 0 1 2 3 4 5 6 7 8 + + + - Time t = 0: + + Both tasks are ready for execution and therefore in ActiveContending state. + Suppose Task T1 is the first task to start execution. + Since there are no inactive tasks, its runtime is decreased as dq = -1 dt. + + - Time t = 2: + + Suppose that task T1 blocks + Task T1 therefore enters the ActiveNonContending state. Since its remaining + runtime is equal to 2, its 0-lag time is equal to t = 4. + Task T2 start execution, with runtime still decreased as dq = -1 dt since + there are no inactive tasks. + + - Time t = 4: + + This is the 0-lag time for Task T1. Since it didn't woken up in the + meantime, it enters the Inactive state. Its bandwidth is removed from + running_bw. + Task T2 continues its execution. However, its runtime is now decreased as + dq = - 0.5 dt because Uinact = 0.5. + Task T2 therefore reclaims the bandwidth unused by Task T1. + + - Time t = 8: + + Task T1 wakes up. It enters the ActiveContending state again, and the + running_bw is incremented. + + 3. Scheduling Real-Time Tasks ============================= @@ -330,6 +489,15 @@ CONTENTS 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for Global EDF. Proceedings of the 22nd Euromicro Conference on Real-Time Systems, 2010. + 15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in + constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time + Systems, 2000. + 16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for + SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), + Dusseldorf, Germany, 2014. + 17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel + or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied + Computing, 2016. 4. Bandwidth management diff --git a/Documentation/sound/hd-audio/models.rst b/Documentation/sound/hd-audio/models.rst index 5338673c88d9..773d2bfacc6c 100644 --- a/Documentation/sound/hd-audio/models.rst +++ b/Documentation/sound/hd-audio/models.rst @@ -16,6 +16,8 @@ ALC880 6-jack in back, 2-jack in front 6stack-digout 6-jack with a SPDIF out +6stack-automute + 6-jack with headphone jack detection ALC260 ====== @@ -62,6 +64,8 @@ lenovo-dock Enables docking station I/O for some Lenovos hp-gpio-led GPIO LED support on HP laptops +hp-dock-gpio-mic1-led + HP dock with mic LED support dell-headset-multi Headset jack, which can also be used as mic-in dell-headset-dock @@ -72,6 +76,12 @@ alc283-sense-combo Combo jack sensing on ALC283 tpt440-dock Pin configs for Lenovo Thinkpad Dock support +tpt440 + Lenovo Thinkpad T440s setup +tpt460 + Lenovo Thinkpad T460/560 setup +dual-codecs + Lenovo laptops with dual codecs ALC66x/67x/892 ============== @@ -97,6 +107,8 @@ inv-dmic Inverted internal mic workaround dell-headset-multi Headset jack, which can also be used as mic-in +dual-codecs + Lenovo laptops with dual codecs ALC680 ====== @@ -114,6 +126,8 @@ inv-dmic Inverted internal mic workaround no-primary-hp VAIO Z/VGC-LN51JGB workaround (for fixed speaker DAC) +dual-codecs + ALC1220 dual codecs for Gaming mobos ALC861/660 ========== @@ -206,65 +220,47 @@ auto Conexant 5045 ============= -laptop-hpsense - Laptop with HP sense (old model laptop) -laptop-micsense - Laptop with Mic sense (old model fujitsu) -laptop-hpmicsense - Laptop with HP and Mic senses -benq - Benq R55E -laptop-hp530 - HP 530 laptop -test - for testing/debugging purpose, almost all controls can be - adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y +cap-mix-amp + Fix max input level on mixer widget +toshiba-p105 + Toshiba P105 quirk +hp-530 + HP 530 quirk Conexant 5047 ============= -laptop - Basic Laptop config -laptop-hp - Laptop config for some HP models (subdevice 30A5) -laptop-eapd - Laptop config with EAPD support -test - for testing/debugging purpose, almost all controls can be - adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y +cap-mix-amp + Fix max input level on mixer widget Conexant 5051 ============= -laptop - Basic Laptop config (default) -hp - HP Spartan laptop -hp-dv6736 - HP dv6736 -hp-f700 - HP Compaq Presario F700 -ideapad - Lenovo IdeaPad laptop -toshiba - Toshiba Satellite M300 +lenovo-x200 + Lenovo X200 quirk Conexant 5066 ============= -laptop - Basic Laptop config (default) -hp-laptop - HP laptops, e g G60 -asus - Asus K52JU, Lenovo G560 -dell-laptop - Dell laptops -dell-vostro - Dell Vostro -olpc-xo-1_5 - OLPC XO 1.5 -ideapad - Lenovo IdeaPad U150 +stereo-dmic + Workaround for inverted stereo digital mic +gpio1 + Enable GPIO1 pin +headphone-mic-pin + Enable headphone mic NID 0x18 without detection +tp410 + Thinkpad T400 & co quirks thinkpad - Lenovo Thinkpad + Thinkpad mute/mic LED quirk +lemote-a1004 + Lemote A1004 quirk +lemote-a1205 + Lemote A1205 quirk +olpc-xo + OLPC XO quirk +mute-led-eapd + Mute LED control via EAPD +hp-dock + HP dock support +mute-led-gpio + Mute LED control via GPIO STAC9200 ======== @@ -444,6 +440,8 @@ dell-eq Dell desktops/laptops alienware Alienware M17x +asus-mobo + Pin configs for ASUS mobo with 5.1/SPDIF out auto BIOS setup (default) @@ -477,6 +475,8 @@ hp-envy-ts-bass Pin fixup for HP Envy TS bass speaker (NID 0x10) hp-bnb13-eq Hardware equalizer setup for HP laptops +hp-envy-ts-bass + HP Envy TS bass support auto BIOS setup (default) @@ -496,10 +496,22 @@ auto Cirrus Logic CS4206/4207 ======================== +mbp53 + MacBook Pro 5,3 mbp55 MacBook Pro 5,5 imac27 IMac 27 Inch +imac27_122 + iMac 12,2 +apple + Generic Apple quirk +mbp101 + MacBookPro 10,1 +mbp81 + MacBookPro 8,1 +mba42 + MacBookAir 4,2 auto BIOS setup (default) @@ -509,6 +521,10 @@ mba6 MacBook Air 6,1 and 6,2 gpio0 Enable GPIO 0 amp +mbp11 + MacBookPro 11,2 +macmini + MacMini 7,1 auto BIOS setup (default) diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt index 6eaf576294f3..2dcaf9adb7a7 100644 --- a/Documentation/timers/NO_HZ.txt +++ b/Documentation/timers/NO_HZ.txt @@ -194,32 +194,9 @@ that the RCU callbacks are processed in a timely fashion. Another approach is to offload RCU callback processing to "rcuo" kthreads using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to -offload may be selected via several methods: - -1. One of three mutually exclusive Kconfig options specify a - build-time default for the CPUs to offload: - - a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in - no CPUs being offloaded. - - b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes - CPU 0 to be offloaded. - - c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all - CPUs to be offloaded. Note that the callbacks will be - offloaded to "rcuo" kthreads, and that those kthreads - will in fact run on some CPU. However, this approach - gives fine-grained control on exactly which CPUs the - callbacks run on, along with their scheduling priority - (including the default of SCHED_OTHER), and it further - allows this control to be varied dynamically at runtime. - -2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated - list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, - 3, 4, and 5. The specified CPUs will be offloaded in addition to - any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or - CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot - parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y. +offload may be selected using The "rcu_nocbs=" kernel boot parameter, +which takes a comma-separated list of CPUs and CPU ranges, for example, +"1,3-5" selects CPUs 1, 3, 4, and 5. The offloaded CPUs will never queue RCU callbacks, and therefore RCU never prevents offloaded CPUs from entering either dyntick-idle mode diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 94a987bd2bc5..fff8ff6d4893 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -1609,7 +1609,7 @@ Doing the same with chrt -r 5 and function-trace set. <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock - <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel diff --git a/Documentation/usb/typec.rst b/Documentation/usb/typec.rst index b67a46779de9..8a7249f2ff04 100644 --- a/Documentation/usb/typec.rst +++ b/Documentation/usb/typec.rst @@ -114,8 +114,7 @@ the details during registration. The class offers the following API for registering/unregistering cables and their plugs: .. kernel-doc:: drivers/usb/typec/typec.c - :functions: typec_register_cable typec_unregister_cable typec_register_plug - typec_unregister_plug + :functions: typec_register_cable typec_unregister_cable typec_register_plug typec_unregister_plug The class will provide a handle to struct typec_cable and struct typec_plug if the registration is successful, or NULL if it isn't. @@ -137,8 +136,7 @@ during connection of a partner or cable, the port driver must use the following APIs to report it to the class: .. kernel-doc:: drivers/usb/typec/typec.c - :functions: typec_set_data_role typec_set_pwr_role typec_set_vconn_role - typec_set_pwr_opmode + :functions: typec_set_data_role typec_set_pwr_role typec_set_vconn_role typec_set_pwr_opmode Alternate Modes ~~~~~~~~~~~~~~~ diff --git a/Documentation/watchdog/watchdog-parameters.txt b/Documentation/watchdog/watchdog-parameters.txt index 4f7d86dd0a5d..914518aeb972 100644 --- a/Documentation/watchdog/watchdog-parameters.txt +++ b/Documentation/watchdog/watchdog-parameters.txt @@ -117,7 +117,7 @@ nowayout: Watchdog cannot be stopped once started ------------------------------------------------- iTCO_wdt: heartbeat: Watchdog heartbeat in seconds. - (2<heartbeat<39 (TCO v1) or 613 (TCO v2), default=30) + (5<=heartbeat<=74 (TCO v1) or 1226 (TCO v2), default=30) nowayout: Watchdog cannot be stopped once started (default=kernel config parameter) ------------------------------------------------- |