19 files changed, 924 insertions, 233 deletions
diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 35bff92cc773..e91a2eb19592 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -27,9 +27,10 @@ Explanation of the Linux-Kernel Memory Consistency Model
   19. AND THEN THERE WAS ALPHA
   20. THE HAPPENS-BEFORE RELATION: hb
   21. THE PROPAGATES-BEFORE RELATION: pb
-  22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
+  22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
   23. LOCKING
-  24. ODDS AND ENDS
+  24. PLAIN ACCESSES AND DATA RACES
+  25. ODDS AND ENDS
 
 
 
@@ -205,7 +206,7 @@ goes like this:
 	P0 stores 1 to buf before storing 1 to flag, since it executes
 	its instructions in order.
 
-	Since an instruction (in this case, P1's store to flag) cannot
+	Since an instruction (in this case, P0's store to flag) cannot
 	execute before itself, the specified outcome is impossible.
 
 However, real computer hardware almost never follows the Sequential
@@ -354,31 +355,25 @@ be extremely complex.
 Optimizing compilers have great freedom in the way they translate
 source code to object code.  They are allowed to apply transformations
 that add memory accesses, eliminate accesses, combine them, split them
-into pieces, or move them around.  Faced with all these possibilities,
-the LKMM basically gives up.  It insists that the code it analyzes
-must contain no ordinary accesses to shared memory; all accesses must
-be performed using READ_ONCE(), WRITE_ONCE(), or one of the other
-atomic or synchronization primitives.  These primitives prevent a
-large number of compiler optimizations.  In particular, it is
-guaranteed that the compiler will not remove such accesses from the
-generated code (unless it can prove the accesses will never be
-executed), it will not change the order in which they occur in the
-code (within limits imposed by the C standard), and it will not
-introduce extraneous accesses.
-
-This explains why the MP and SB examples above used READ_ONCE() and
-WRITE_ONCE() rather than ordinary memory accesses.  Thanks to this
-usage, we can be certain that in the MP example, P0's write event to
-buf really is po-before its write event to flag, and similarly for the
-other shared memory accesses in the examples.
-
-Private variables are not subject to this restriction.  Since they are
-not shared between CPUs, they can be accessed normally without
-READ_ONCE() or WRITE_ONCE(), and there will be no ill effects.  In
-fact, they need not even be stored in normal memory at all -- in
-principle a private variable could be stored in a CPU register (hence
-the convention that these variables have names starting with the
-letter 'r').
+into pieces, or move them around.  The use of READ_ONCE(), WRITE_ONCE(),
+or one of the other atomic or synchronization primitives prevents a
+large number of compiler optimizations.  In particular, it is guaranteed
+that the compiler will not remove such accesses from the generated code
+(unless it can prove the accesses will never be executed), it will not
+change the order in which they occur in the code (within limits imposed
+by the C standard), and it will not introduce extraneous accesses.
+
+The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather
+than ordinary memory accesses.  Thanks to this usage, we can be certain
+that in the MP example, the compiler won't reorder P0's write event to
+buf and P0's write event to flag, and similarly for the other shared
+memory accesses in the examples.
+
+Since private variables are not shared between CPUs, they can be
+accessed normally without READ_ONCE() or WRITE_ONCE().  In fact, they
+need not even be stored in normal memory at all -- in principle a
+private variable could be stored in a CPU register (hence the convention
+that these variables have names starting with the letter 'r').
 
 
 A WARNING
@@ -424,7 +419,7 @@ example:
 
 The object code might call f(5) either before or after g(6); the
 memory model cannot assume there is a fixed program order relation
-between them.  (In fact, if the functions are inlined then the
+between them.  (In fact, if the function calls are inlined then the
 compiler might even interleave their object code.)
 
 
@@ -504,7 +499,7 @@ different CPUs (external reads-from, or rfe).
 
 For our purposes, a memory location's initial value is treated as
 though it had been written there by an imaginary initial store that
-executes on a separate CPU before the program runs.
+executes on a separate CPU before the main program runs.
 
 Usage of the rf relation implicitly assumes that loads will always
 read from a single store.  It doesn't apply properly in the presence
@@ -862,7 +857,7 @@ outlined above.  These restrictions involve the necessity of
 maintaining cache coherence and the fact that a CPU can't operate on a
 value before it knows what that value is, among other things.
 
-The formal version of the LKMM is defined by five requirements, or
+The formal version of the LKMM is defined by six requirements, or
 axioms:
 
 	Sequential consistency per variable: This requires that the
@@ -882,10 +877,14 @@ axioms:
 	grace periods obey the rules of RCU, in particular, the
 	Grace-Period Guarantee.
 
+	Plain-coherence: This requires that plain memory accesses
+	(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey
+	the operational model's rules regarding cache coherence.
+
 The first and second are quite common; they can be found in many
 memory models (such as those for C11/C++11).  The "happens-before" and
 "propagation" axioms have analogs in other memory models as well.  The
-"rcu" axiom is specific to the LKMM.
+"rcu" and "plain-coherence" axioms are specific to the LKMM.
 
 Each of these axioms is discussed below.
 
@@ -960,7 +959,7 @@ atomic update.  This is what the LKMM's "atomic" axiom says.
 THE PRESERVED PROGRAM ORDER RELATION: ppo
 -----------------------------------------
 
-There are many situations where a CPU is obligated to execute two
+There are many situations where a CPU is obliged to execute two
 instructions in program order.  We amalgamate them into the ppo (for
 "preserved program order") relation, which links the po-earlier
 instruction to the po-later instruction and is thus a sub-relation of
@@ -1302,7 +1301,7 @@ followed by an arbitrary number of cumul-fence links, ending with an
 rfe link.  You can concoct more exotic examples, containing more than
 one fence, although this quickly leads to diminishing returns in terms
 of complexity.  For instance, here's an example containing a coe link
-followed by two fences and an rfe link, utilizing the fact that
+followed by two cumul-fences and an rfe link, utilizing the fact that
 release fences are A-cumulative:
 
 	int x, y, z;
@@ -1334,10 +1333,10 @@ If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop
 link from P0's store to its load.  This is because P0's store gets
 overwritten by P1's store since x = 2 at the end (a coe link), the
 smp_wmb() ensures that P1's store to x propagates to P2 before the
-store to y does (the first fence), the store to y propagates to P2
+store to y does (the first cumul-fence), the store to y propagates to P2
 before P2's load and store execute, P2's smp_store_release()
 guarantees that the stores to x and y both propagate to P0 before the
-store to z does (the second fence), and P0's load executes after the
+store to z does (the second cumul-fence), and P0's load executes after the
 store to z has propagated to P0 (an rfe link).
 
 In summary, the fact that the hb relation links memory access events
@@ -1430,8 +1429,8 @@ they execute means that it cannot have cycles.  This requirement is
 the content of the LKMM's "propagation" axiom.
 
 
-RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
-----------------------------------------------------
+RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
+------------------------------------------------------------------------
 
 RCU (Read-Copy-Update) is a powerful synchronization mechanism.  It
 rests on two concepts: grace periods and read-side critical sections.
@@ -1446,17 +1445,19 @@ As far as memory models are concerned, RCU's main feature is its
 Grace-Period Guarantee, which states that a critical section can never
 span a full grace period.  In more detail, the Guarantee says:
 
-	If a critical section starts before a grace period then it
-	must end before the grace period does.  In addition, every
-	store that propagates to the critical section's CPU before the
-	end of the critical section must propagate to every CPU before
-	the end of the grace period.
+	For any critical section C and any grace period G, at least
+	one of the following statements must hold:
+
+(1)	C ends before G does, and in addition, every store that
+	propagates to C's CPU before the end of C must propagate to
+	every CPU before G ends.
+
+(2)	G starts before C does, and in addition, every store that
+	propagates to G's CPU before the start of G must propagate
+	to every CPU before C starts.
 
-	If a critical section ends after a grace period ends then it
-	must start after the grace period does.  In addition, every
-	store that propagates to the grace period's CPU before the
-	start of the grace period must propagate to every CPU before
-	the start of the critical section.
+In particular, it is not possible for a critical section to both start
+before and end after a grace period.
 
 Here is a simple example of RCU in action:
 
@@ -1483,10 +1484,11 @@ The Grace Period Guarantee tells us that when this code runs, it will
 never end with r1 = 1 and r2 = 0.  The reasoning is as follows.  r1 = 1
 means that P0's store to x propagated to P1 before P1 called
 synchronize_rcu(), so P0's critical section must have started before
-P1's grace period.  On the other hand, r2 = 0 means that P0's store to
-y, which occurs before the end of the critical section, did not
-propagate to P1 before the end of the grace period, violating the
-Guarantee.
+P1's grace period, contrary to part (2) of the Guarantee.  On the
+other hand, r2 = 0 means that P0's store to y, which occurs before the
+end of the critical section, did not propagate to P1 before the end of
+the grace period, contrary to part (1).  Together the results violate
+the Guarantee.
 
 In the kernel's implementations of RCU, the requirements for stores
 to propagate to every CPU are fulfilled by placing strong fences at
@@ -1504,11 +1506,11 @@ before" or "ends after" a grace period?  Some aspects of the meaning
 are pretty obvious, as in the example above, but the details aren't
 entirely clear.  The LKMM formalizes this notion by means of the
 rcu-link relation.  rcu-link encompasses a very general notion of
-"before": Among other things, X ->rcu-link Z includes cases where X
-happens-before or is equal to some event Y which is equal to or comes
-before Z in the coherence order.  When Y = Z this says that X ->rfe Z
-implies X ->rcu-link Z.  In addition, when Y = X it says that X ->fr Z
-and X ->co Z each imply X ->rcu-link Z.
+"before": If E and F are RCU fence events (i.e., rcu_read_lock(),
+rcu_read_unlock(), or synchronize_rcu()) then among other things,
+E ->rcu-link F includes cases where E is po-before some memory-access
+event X, F is po-after some memory-access event Y, and we have any of
+X ->rfe Y, X ->co Y, or X ->fr Y.
 
 The formal definition of the rcu-link relation is more than a little
 obscure, and we won't give it here.  It is closely related to the pb
@@ -1516,171 +1518,192 @@ relation, and the details don't matter unless you want to comb through
 a somewhat lengthy formal proof.  Pretty much all you need to know
 about rcu-link is the information in the preceding paragraph.
 
-The LKMM also defines the gp and rscs relations.  They bring grace
-periods and read-side critical sections into the picture, in the
+The LKMM also defines the rcu-gp and rcu-rscsi relations.  They bring
+grace periods and read-side critical sections into the picture, in the
 following way:
 
-	E ->gp F means there is a synchronize_rcu() fence event S such
-	that E ->po S and either S ->po F or S = F.  In simple terms,
-	there is a grace period po-between E and F.
+	E ->rcu-gp F means that E and F are in fact the same event,
+	and that event is a synchronize_rcu() fence (i.e., a grace
+	period).
 
-	E ->rscs F means there is a critical section delimited by an
-	rcu_read_lock() fence L and an rcu_read_unlock() fence U, such
-	that E ->po U and either L ->po F or L = F.  You can think of
-	this as saying that E and F are in the same critical section
-	(in fact, it also allows E to be po-before the start of the
-	critical section and F to be po-after the end).
+	E ->rcu-rscsi F means that E and F are the rcu_read_unlock()
+	and rcu_read_lock() fence events delimiting some read-side
+	critical section.  (The 'i' at the end of the name emphasizes
+	that this relation is "inverted": It links the end of the
+	critical section to the start.)
 
 If we think of the rcu-link relation as standing for an extended
-"before", then X ->gp Y ->rcu-link Z says that X executes before a
-grace period which ends before Z executes.  (In fact it covers more
-than this, because it also includes cases where X executes before a
-grace period and some store propagates to Z's CPU before Z executes
-but doesn't propagate to some other CPU until after the grace period
-ends.)  Similarly, X ->rscs Y ->rcu-link Z says that X is part of (or
-before the start of) a critical section which starts before Z
-executes.
-
-The LKMM goes on to define the rcu-fence relation as a sequence of gp
-and rscs links separated by rcu-link links, in which the number of gp
-links is >= the number of rscs links.  For example:
+"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a
+grace period which ends before Z begins.  (In fact it covers more than
+this, because it also includes cases where some store propagates to
+Z's CPU before Z begins but doesn't propagate to some other CPU until
+after X ends.)  Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is
+the end of a critical section which starts before Z begins.
+
+The LKMM goes on to define the rcu-order relation as a sequence of
+rcu-gp and rcu-rscsi links separated by rcu-link links, in which the
+number of rcu-gp links is >= the number of rcu-rscsi links.  For
+example:
 
-	X ->gp Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V
+	X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
 
-would imply that X ->rcu-fence V, because this sequence contains two
-gp links and only one rscs link.  (It also implies that X ->rcu-fence T
-and Z ->rcu-fence V.)  On the other hand:
+would imply that X ->rcu-order V, because this sequence contains two
+rcu-gp links and one rcu-rscsi link.  (It also implies that
+X ->rcu-order T and Z ->rcu-order V.)  On the other hand:
 
-	X ->rscs Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V
+	X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
 
-does not imply X ->rcu-fence V, because the sequence contains only
-one gp link but two rscs links.
+does not imply X ->rcu-order V, because the sequence contains only
+one rcu-gp link but two rcu-rscsi links.
 
-The rcu-fence relation is important because the Grace Period Guarantee
-means that rcu-fence acts kind of like a strong fence.  In particular,
-if W is a write and we have W ->rcu-fence Z, the Guarantee says that W
-will propagate to every CPU before Z executes.
+The rcu-order relation is important because the Grace Period Guarantee
+means that rcu-order links act kind of like strong fences.  In
+particular, E ->rcu-order F implies not only that E begins before F
+ends, but also that any write po-before E will propagate to every CPU
+before any instruction po-after F can execute.  (However, it does not
+imply that E must execute before F; in fact, each synchronize_rcu()
+fence event is linked to itself by rcu-order as a degenerate case.)
 
 To prove this in full generality requires some intellectual effort.
 We'll consider just a very simple case:
 
-	W ->gp X ->rcu-link Y ->rscs Z.
+	G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F.
+
+This formula means that G and W are the same event (a grace period),
+and there are events X, Y and a read-side critical section C such that:
+
+	1. G = W is po-before or equal to X;
+
+	2. X comes "before" Y in some sense (including rfe, co and fr);
 
-This formula means that there is a grace period G and a critical
-section C such that:
+	3. Y is po-before Z;
 
-	1. W is po-before G;
+	4. Z is the rcu_read_unlock() event marking the end of C;
 
-	2. X is equal to or po-after G;
+	5. F is the rcu_read_lock() event marking the start of C.
 
-	3. X comes "before" Y in some sense;
+From 1 - 4 we deduce that the grace period G ends before the critical
+section C.  Then part (2) of the Grace Period Guarantee says not only
+that G starts before C does, but also that any write which executes on
+G's CPU before G starts must propagate to every CPU before C starts.
+In particular, the write propagates to every CPU before F finishes
+executing and hence before any instruction po-after F can execute.
+This sort of reasoning can be extended to handle all the situations
+covered by rcu-order.
 
-	4. Y is po-before the end of C;
+The rcu-fence relation is a simple extension of rcu-order.  While
+rcu-order only links certain fence events (calls to synchronize_rcu(),
+rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events
+that are separated by an rcu-order link.  This is analogous to the way
+the strong-fence relation links events that are separated by an
+smp_mb() fence event (as mentioned above, rcu-order links act kind of
+like strong fences).  Written symbolically, X ->rcu-fence Y means
+there are fence events E and F such that:
 
-	5. Z is equal to or po-after the start of C.
+	X ->po E ->rcu-order F ->po Y.
 
-From 2 - 4 we deduce that the grace period G ends before the critical
-section C.  Then the second part of the Grace Period Guarantee says
-not only that G starts before C does, but also that W (which executes
-on G's CPU before G starts) must propagate to every CPU before C
-starts.  In particular, W propagates to every CPU before Z executes
-(or finishes executing, in the case where Z is equal to the
-rcu_read_lock() fence event which starts C.)  This sort of reasoning
-can be expanded to handle all the situations covered by rcu-fence.
+From the discussion above, we see this implies not only that X
+executes before Y, but also (if X is a store) that X propagates to
+every CPU before Y executes.  Thus rcu-fence is sort of a
+"super-strong" fence: Unlike the original strong fences (smp_mb() and
+synchronize_rcu()), rcu-fence is able to link events on different
+CPUs.  (Perhaps this fact should lead us to say that rcu-fence isn't
+really a fence at all!)
 
 Finally, the LKMM defines the RCU-before (rb) relation in terms of
 rcu-fence.  This is done in essentially the same way as the pb
 relation was defined in terms of strong-fence.  We will omit the
-details; the end result is that E ->rb F implies E must execute before
-F, just as E ->pb F does (and for much the same reasons).
+details; the end result is that E ->rb F implies E must execute
+before F, just as E ->pb F does (and for much the same reasons).
 
 Putting this all together, the LKMM expresses the Grace Period
 Guarantee by requiring that the rb relation does not contain a cycle.
-Equivalently, this "rcu" axiom requires that there are no events E and
-F with E ->rcu-link F ->rcu-fence E.  Or to put it a third way, the
-axiom requires that there are no cycles consisting of gp and rscs
-alternating with rcu-link, where the number of gp links is >= the
-number of rscs links.
+Equivalently, this "rcu" axiom requires that there are no events E
+and F with E ->rcu-link F ->rcu-order E.  Or to put it a third way,
+the axiom requires that there are no cycles consisting of rcu-gp and
+rcu-rscsi alternating with rcu-link, where the number of rcu-gp links
+is >= the number of rcu-rscsi links.
 
 Justifying the axiom isn't easy, but it is in fact a valid
 formalization of the Grace Period Guarantee.  We won't attempt to go
 through the detailed argument, but the following analysis gives a
-taste of what is involved.  Suppose we have a violation of the first
-part of the Guarantee: A critical section starts before a grace
-period, and some store propagates to the critical section's CPU before
-the end of the critical section but doesn't propagate to some other
-CPU until after the end of the grace period.
+taste of what is involved.  Suppose both parts of the Guarantee are
+violated: A critical section starts before a grace period, and some
+store propagates to the critical section's CPU before the end of the
+critical section but doesn't propagate to some other CPU until after
+the end of the grace period.
 
 Putting symbols to these ideas, let L and U be the rcu_read_lock() and
 rcu_read_unlock() fence events delimiting the critical section in
 question, and let S be the synchronize_rcu() fence event for the grace
 period.  Saying that the critical section starts before S means there
-are events E and F where E is po-after L (which marks the start of the
-critical section), E is "before" F in the sense of the rcu-link
-relation, and F is po-before the grace period S:
+are events Q and R where Q is po-after L (which marks the start of the
+critical section), Q is "before" R in the sense used by the rcu-link
+relation, and R is po-before the grace period S.  Thus we have:
 
-	L ->po E ->rcu-link F ->po S.
+	L ->rcu-link S.
 
-Let W be the store mentioned above, let Z come before the end of the
+Let W be the store mentioned above, let Y come before the end of the
 critical section and witness that W propagates to the critical
-section's CPU by reading from W, and let Y on some arbitrary CPU be a
-witness that W has not propagated to that CPU, where Y happens after
+section's CPU by reading from W, and let Z on some arbitrary CPU be a
+witness that W has not propagated to that CPU, where Z happens after
 some event X which is po-after S.  Symbolically, this amounts to:
 
-	S ->po X ->hb* Y ->fr W ->rf Z ->po U.
+	S ->po X ->hb* Z ->fr W ->rf Y ->po U.
 
-The fr link from Y to W indicates that W has not propagated to Y's CPU
-at the time that Y executes.  From this, it can be shown (see the
-discussion of the rcu-link relation earlier) that X and Z are related
-by rcu-link, yielding:
+The fr link from Z to W indicates that W has not propagated to Z's CPU
+at the time that Z executes.  From this, it can be shown (see the
+discussion of the rcu-link relation earlier) that S and U are related
+by rcu-link:
 
-	S ->po X ->rcu-link Z ->po U.
+	S ->rcu-link U.
 
-The formulas say that S is po-between F and X, hence F ->gp X.  They
-also say that Z comes before the end of the critical section and E
-comes after its start, hence Z ->rscs E.  From all this we obtain:
+Since S is a grace period we have S ->rcu-gp S, and since L and U are
+the start and end of the critical section C we have U ->rcu-rscsi L.
+From this we obtain:
 
-	F ->gp X ->rcu-link Z ->rscs E ->rcu-link F,
+	S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S,
 
 a forbidden cycle.  Thus the "rcu" axiom rules out this violation of
 the Grace Period Guarantee.
 
 For something a little more down-to-earth, let's see how the axiom
 works out in practice.  Consider the RCU code example from above, this
-time with statement labels added to the memory access instructions:
+time with statement labels added:
 
 	int x, y;
 
 	P0()
 	{
-		rcu_read_lock();
-		W: WRITE_ONCE(x, 1);
-		X: WRITE_ONCE(y, 1);
-		rcu_read_unlock();
+		L: rcu_read_lock();
+		X: WRITE_ONCE(x, 1);
+		Y: WRITE_ONCE(y, 1);
+		U: rcu_read_unlock();
 	}
 
 	P1()
 	{
 		int r1, r2;
 
-		Y: r1 = READ_ONCE(x);
-		synchronize_rcu();
-		Z: r2 = READ_ONCE(y);
+		Z: r1 = READ_ONCE(x);
+		S: synchronize_rcu();
+		W: r2 = READ_ONCE(y);
 	}
 
 
-If r2 = 0 at the end then P0's store at X overwrites the value that
-P1's load at Z reads from, so we have Z ->fre X and thus Z ->rcu-link X.
-In addition, there is a synchronize_rcu() between Y and Z, so therefore
-we have Y ->gp Z.
+If r2 = 0 at the end then P0's store at Y overwrites the value that
+P1's load at W reads from, so we have W ->fre Y.  Since S ->po W and
+also Y ->po U, we get S ->rcu-link U.  In addition, S ->rcu-gp S
+because S is a grace period.
 
-If r1 = 1 at the end then P1's load at Y reads from P0's store at W,
-so we have W ->rcu-link Y.  In addition, W and X are in the same critical
-section, so therefore we have X ->rscs W.
+If r1 = 1 at the end then P1's load at Z reads from P0's store at X,
+so we have X ->rfe Z.  Together with L ->po X and Z ->po S, this
+yields L ->rcu-link S.  And since L and U are the start and end of a
+critical section, we have U ->rcu-rscsi L.
 
-Then X ->rscs W ->rcu-link Y ->gp Z ->rcu-link X is a forbidden cycle,
-violating the "rcu" axiom.  Hence the outcome is not allowed by the
-LKMM, as we would expect.
+Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a
+forbidden cycle, violating the "rcu" axiom.  Hence the outcome is not
+allowed by the LKMM, as we would expect.
 
 For contrast, let's see what can happen in a more complicated example:
 
@@ -1690,51 +1713,52 @@ For contrast, let's see what can happen in a more complicated example:
 	{
 		int r0;
 
-		rcu_read_lock();
-		W: r0 = READ_ONCE(x);
-		X: WRITE_ONCE(y, 1);
-		rcu_read_unlock();
+		L0: rcu_read_lock();
+		    r0 = READ_ONCE(x);
+		    WRITE_ONCE(y, 1);
+		U0: rcu_read_unlock();
 	}
 
 	P1()
 	{
 		int r1;
 
-		Y: r1 = READ_ONCE(y);
-		synchronize_rcu();
-		Z: WRITE_ONCE(z, 1);
+		    r1 = READ_ONCE(y);
+		S1: synchronize_rcu();
+		    WRITE_ONCE(z, 1);
 	}
 
 	P2()
 	{
 		int r2;
 
-		rcu_read_lock();
-		U: r2 = READ_ONCE(z);
-		V: WRITE_ONCE(x, 1);
-		rcu_read_unlock();
+		L2: rcu_read_lock();
+		    r2 = READ_ONCE(z);
+		    WRITE_ONCE(x, 1);
+		U2: rcu_read_unlock();
 	}
 
 If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows
-that W ->rscs X ->rcu-link Y ->gp Z ->rcu-link U ->rscs V ->rcu-link W.
-However this cycle is not forbidden, because the sequence of relations
-contains fewer instances of gp (one) than of rscs (two).  Consequently
-the outcome is allowed by the LKMM.  The following instruction timing
-diagram shows how it might actually occur:
+that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi
+L2 ->rcu-link U0.  However this cycle is not forbidden, because the
+sequence of relations contains fewer instances of rcu-gp (one) than of
+rcu-rscsi (two).  Consequently the outcome is allowed by the LKMM.
+The following instruction timing diagram shows how it might actually
+occur:
 
 P0			P1			P2
 --------------------	--------------------	--------------------
 rcu_read_lock()
-X: WRITE_ONCE(y, 1)
-			Y: r1 = READ_ONCE(y)
+WRITE_ONCE(y, 1)
+			r1 = READ_ONCE(y)
 			synchronize_rcu() starts
 			.			rcu_read_lock()
-			.			V: WRITE_ONCE(x, 1)
-W: r0 = READ_ONCE(x)	.
+			.			WRITE_ONCE(x, 1)
+r0 = READ_ONCE(x)	.
 rcu_read_unlock()	.
 			synchronize_rcu() ends
-			Z: WRITE_ONCE(z, 1)
-						U: r2 = READ_ONCE(z)
+			WRITE_ONCE(z, 1)
+						r2 = READ_ONCE(z)
 						rcu_read_unlock()
 
 This requires P0 and P2 to execute their loads and stores out of
@@ -1744,6 +1768,15 @@ section in P0 both starts before P1's grace period does and ends
 before it does, and the critical section in P2 both starts after P1's
 grace period does and ends after it does.
 
+Addendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in
+addition to normal RCU.  The ideas involved are much the same as
+above, with new relations srcu-gp and srcu-rscsi added to represent
+SRCU grace periods and read-side critical sections.  There is a
+restriction on the srcu-gp and srcu-rscsi links that can appear in an
+rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp
+links having the same SRCU domain with proper nesting); the details
+are relatively unimportant.
+
 
 LOCKING
 -------
@@ -1886,6 +1919,521 @@ architectures supported by the Linux kernel, albeit for various
 differing reasons.
 
 
+PLAIN ACCESSES AND DATA RACES
+-----------------------------
+
+In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
+smp_load_acquire(&z), and so on are collectively referred to as
+"marked" accesses, because they are all annotated with special
+operations of one kind or another.  Ordinary C-language memory
+accesses such as x or y = 0 are simply called "plain" accesses.
+
+Early versions of the LKMM had nothing to say about plain accesses.
+The C standard allows compilers to assume that the variables affected
+by plain accesses are not concurrently read or written by any other
+threads or CPUs.  This leaves compilers free to implement all manner
+of transformations or optimizations of code containing plain accesses,
+making such code very difficult for a memory model to handle.
+
+Here is just one example of a possible pitfall:
+
+	int a = 6;
+	int *x = &a;
+
+	P0()
+	{
+		int *r1;
+		int r2 = 0;
+
+		r1 = x;
+		if (r1 != NULL)
+			r2 = READ_ONCE(*r1);
+	}
+
+	P1()
+	{
+		WRITE_ONCE(x, NULL);
+	}
+
+On the face of it, one would expect that when this code runs, the only
+possible final values for r2 are 6 and 0, depending on whether or not
+P1's store to x propagates to P0 before P0's load from x executes.
+But since P0's load from x is a plain access, the compiler may decide
+to carry out the load twice (for the comparison against NULL, then again
+for the READ_ONCE()) and eliminate the temporary variable r1.  The
+object code generated for P0 could therefore end up looking rather
+like this:
+
+	P0()
+	{
+		int r2 = 0;
+
+		if (x != NULL)
+			r2 = READ_ONCE(*x);
+	}
+
+And now it is obvious that this code runs the risk of dereferencing a
+NULL pointer, because P1's store to x might propagate to P0 after the
+test against NULL has been made but before the READ_ONCE() executes.
+If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
+the compiler would not have performed this optimization and there
+would be no possibility of a NULL-pointer dereference.
+
+Given the possibility of transformations like this one, the LKMM
+doesn't try to predict all possible outcomes of code containing plain
+accesses.  It is instead content to determine whether the code
+violates the compiler's assumptions, which would render the ultimate
+outcome undefined.
+
+In technical terms, the compiler is allowed to assume that when the
+program executes, there will not be any data races.  A "data race"
+occurs when two conflicting memory accesses execute concurrently;
+two memory accesses "conflict" if:
+
+	they access the same location,
+
+	they occur on different CPUs (or in different threads on the
+	same CPU),
+
+	at least one of them is a plain access,
+
+	and at least one of them is a store.
+
+The LKMM tries to determine whether a program contains two conflicting
+accesses which may execute concurrently; if it does then the LKMM says
+there is a potential data race and makes no predictions about the
+program's outcome.
+
+Determining whether two accesses conflict is easy; you can see that
+all the concepts involved in the definition above are already part of
+the memory model.  The hard part is telling whether they may execute
+concurrently.  The LKMM takes a conservative attitude, assuming that
+accesses may be concurrent unless it can prove they cannot.
+
+If two memory accesses aren't concurrent then one must execute before
+the other.  Therefore the LKMM decides two accesses aren't concurrent
+if they can be connected by a sequence of hb, pb, and rb links
+(together referred to as xb, for "executes before").  However, there
+are two complicating factors.
+
+If X is a load and X executes before a store Y, then indeed there is
+no danger of X and Y being concurrent.  After all, Y can't have any
+effect on the value obtained by X until the memory subsystem has
+propagated Y from its own CPU to X's CPU, which won't happen until
+some time after Y executes and thus after X executes.  But if X is a
+store, then even if X executes before Y it is still possible that X
+will propagate to Y's CPU just as Y is executing.  In such a case X
+could very well interfere somehow with Y, and we would have to
+consider X and Y to be concurrent.
+
+Therefore when X is a store, for X and Y to be non-concurrent the LKMM
+requires not only that X must execute before Y but also that X must
+propagate to Y's CPU before Y executes.  (Or vice versa, of course, if
+Y executes before X -- then Y must propagate to X's CPU before X
+executes if Y is a store.)  This is expressed by the visibility
+relation (vis), where X ->vis Y is defined to hold if there is an
+intermediate event Z such that:
+
+	X is connected to Z by a possibly empty sequence of
+	cumul-fence links followed by an optional rfe link (if none of
+	these links are present, X and Z are the same event),
+
+and either:
+
+	Z is connected to Y by a strong-fence link followed by a
+	possibly empty sequence of xb links,
+
+or:
+
+	Z is on the same CPU as Y and is connected to Y by a possibly
+	empty sequence of xb links (again, if the sequence is empty it
+	means Z and Y are the same event).
+
+The motivations behind this definition are straightforward:
+
+	cumul-fence memory barriers force stores that are po-before
+	the barrier to propagate to other CPUs before stores that are
+	po-after the barrier.
+
+	An rfe link from an event W to an event R says that R reads
+	from W, which certainly means that W must have propagated to
+	R's CPU before R executed.
+
+	strong-fence memory barriers force stores that are po-before
+	the barrier, or that propagate to the barrier's CPU before the
+	barrier executes, to propagate to all CPUs before any events
+	po-after the barrier can execute.
+
+To see how this works out in practice, consider our old friend, the MP
+pattern (with fences and statement labels, but without the conditional
+test):
+
+	int buf = 0, flag = 0;
+
+	P0()
+	{
+		X: WRITE_ONCE(buf, 1);
+		   smp_wmb();
+		W: WRITE_ONCE(flag, 1);
+	}
+
+	P1()
+	{
+		int r1;
+		int r2 = 0;
+
+		Z: r1 = READ_ONCE(flag);
+		   smp_rmb();
+		Y: r2 = READ_ONCE(buf);
+	}
+
+The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
+assuming r1 = 1 at the end, there is an rfe link from W to Z.  This
+means that the store to buf must propagate from P0 to P1 before Z
+executes.  Next, Z and Y are on the same CPU and the smp_rmb() fence
+provides an xb link from Z to Y (i.e., it forces Z to execute before
+Y).  Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
+executes.
+
+The second complicating factor mentioned above arises from the fact
+that when we are considering data races, some of the memory accesses
+are plain.  Now, although we have not said so explicitly, up to this
+point most of the relations defined by the LKMM (ppo, hb, prop,
+cumul-fence, pb, and so on -- including vis) apply only to marked
+accesses.
+
+There are good reasons for this restriction.  The compiler is not
+allowed to apply fancy transformations to marked accesses, and
+consequently each such access in the source code corresponds more or
+less directly to a single machine instruction in the object code.  But
+plain accesses are a different story; the compiler may combine them,
+split them up, duplicate them, eliminate them, invent new ones, and
+who knows what else.  Seeing a plain access in the source code tells
+you almost nothing about what machine instructions will end up in the
+object code.
+
+Fortunately, the compiler isn't completely free; it is subject to some
+limitations.  For one, it is not allowed to introduce a data race into
+the object code if the source code does not already contain a data
+race (if it could, memory models would be useless and no multithreaded
+code would be safe!).  For another, it cannot move a plain access past
+a compiler barrier.
+
+A compiler barrier is a kind of fence, but as the name implies, it
+only affects the compiler; it does not necessarily have any effect on
+how instructions are executed by the CPU.  In Linux kernel source
+code, the barrier() function is a compiler barrier.  It doesn't give
+rise directly to any machine instructions in the object code; rather,
+it affects how the compiler generates the rest of the object code.
+Given source code like this:
+
+	... some memory accesses ...
+	barrier();
+	... some other memory accesses ...
+
+the barrier() function ensures that the machine instructions
+corresponding to the first group of accesses will all end po-before
+any machine instructions corresponding to the second group of accesses
+-- even if some of the accesses are plain.  (Of course, the CPU may
+then execute some of those accesses out of program order, but we
+already know how to deal with such issues.)  Without the barrier()
+there would be no such guarantee; the two groups of accesses could be
+intermingled or even reversed in the object code.
+
+The LKMM doesn't say much about the barrier() function, but it does
+require that all fences are also compiler barriers.  In addition, it
+requires that the ordering properties of memory barriers such as
+smp_rmb() or smp_store_release() apply to plain accesses as well as to
+marked accesses.
+
+This is the key to analyzing data races.  Consider the MP pattern
+again, now using plain accesses for buf:
+
+	int buf = 0, flag = 0;
+
+	P0()
+	{
+		U: buf = 1;
+		   smp_wmb();
+		X: WRITE_ONCE(flag, 1);
+	}
+
+	P1()
+	{
+		int r1;
+		int r2 = 0;
+
+		Y: r1 = READ_ONCE(flag);
+		   if (r1) {
+			   smp_rmb();
+			V: r2 = buf;
+		   }
+	}
+
+This program does not contain a data race.  Although the U and V
+accesses conflict, the LKMM can prove they are not concurrent as
+follows:
+
+	The smp_wmb() fence in P0 is both a compiler barrier and a
+	cumul-fence.  It guarantees that no matter what hash of
+	machine instructions the compiler generates for the plain
+	access U, all those instructions will be po-before the fence.
+	Consequently U's store to buf, no matter how it is carried out
+	at the machine level, must propagate to P1 before X's store to
+	flag does.
+
+	X and Y are both marked accesses.  Hence an rfe link from X to
+	Y is a valid indicator that X propagated to P1 before Y
+	executed, i.e., X ->vis Y.  (And if there is no rfe link then
+	r1 will be 0, so V will not be executed and ipso facto won't
+	race with U.)
+
+	The smp_rmb() fence in P1 is a compiler barrier as well as a
+	fence.  It guarantees that all the machine-level instructions
+	corresponding to the access V will be po-after the fence, and
+	therefore any loads among those instructions will execute
+	after the fence does and hence after Y does.
+
+Thus U's store to buf is forced to propagate to P1 before V's load
+executes (assuming V does execute), ruling out the possibility of a
+data race between them.
+
+This analysis illustrates how the LKMM deals with plain accesses in
+general.  Suppose R is a plain load and we want to show that R
+executes before some marked access E.  We can do this by finding a
+marked access X such that R and X are ordered by a suitable fence and
+X ->xb* E.  If E was also a plain access, we would also look for a
+marked access Y such that X ->xb* Y, and Y and E are ordered by a
+fence.  We describe this arrangement by saying that R is
+"post-bounded" by X and E is "pre-bounded" by Y.
+
+In fact, we go one step further: Since R is a read, we say that R is
+"r-post-bounded" by X.  Similarly, E would be "r-pre-bounded" or
+"w-pre-bounded" by Y, depending on whether E was a store or a load.
+This distinction is needed because some fences affect only loads
+(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
+the two types of bounds are the same.  And as a degenerate case, we
+say that a marked access pre-bounds and post-bounds itself (e.g., if R
+above were a marked load then X could simply be taken to be R itself.)
+
+The need to distinguish between r- and w-bounding raises yet another
+issue.  When the source code contains a plain store, the compiler is
+allowed to put plain loads of the same location into the object code.
+For example, given the source code:
+
+	x = 1;
+
+the compiler is theoretically allowed to generate object code that
+looks like:
+
+	if (x != 1)
+		x = 1;
+
+thereby adding a load (and possibly replacing the store entirely).
+For this reason, whenever the LKMM requires a plain store to be
+w-pre-bounded or w-post-bounded by a marked access, it also requires
+the store to be r-pre-bounded or r-post-bounded, so as to handle cases
+where the compiler adds a load.
+
+(This may be overly cautious.  We don't know of any examples where a
+compiler has augmented a store with a load in this fashion, and the
+Linux kernel developers would probably fight pretty hard to change a
+compiler if it ever did this.  Still, better safe than sorry.)
+
+Incidentally, the other tranformation -- augmenting a plain load by
+adding in a store to the same location -- is not allowed.  This is
+because the compiler cannot know whether any other CPUs might perform
+a concurrent load from that location.  Two concurrent loads don't
+constitute a race (they can't interfere with each other), but a store
+does race with a concurrent load.  Thus adding a store might create a
+data race where one was not already present in the source code,
+something the compiler is forbidden to do.  Augmenting a store with a
+load, on the other hand, is acceptable because doing so won't create a
+data race unless one already existed.
+
+The LKMM includes a second way to pre-bound plain accesses, in
+addition to fences: an address dependency from a marked load.  That
+is, in the sequence:
+
+	p = READ_ONCE(ptr);
+	r = *p;
+
+the LKMM says that the marked load of ptr pre-bounds the plain load of
+*p; the marked load must execute before any of the machine
+instructions corresponding to the plain load.  This is a reasonable
+stipulation, since after all, the CPU can't perform the load of *p
+until it knows what value p will hold.  Furthermore, without some
+assumption like this one, some usages typical of RCU would count as
+data races.  For example:
+
+	int a = 1, b;
+	int *ptr = &a;
+
+	P0()
+	{
+		b = 2;
+		rcu_assign_pointer(ptr, &b);
+	}
+
+	P1()
+	{
+		int *p;
+		int r;
+
+		rcu_read_lock();
+		p = rcu_dereference(ptr);
+		r = *p;
+		rcu_read_unlock();
+	}
+
+(In this example the rcu_read_lock() and rcu_read_unlock() calls don't
+really do anything, because there aren't any grace periods.  They are
+included merely for the sake of good form; typically P0 would call
+synchronize_rcu() somewhere after the rcu_assign_pointer().)
+
+rcu_assign_pointer() performs a store-release, so the plain store to b
+is definitely w-post-bounded before the store to ptr, and the two
+stores will propagate to P1 in that order.  However, rcu_dereference()
+is only equivalent to READ_ONCE().  While it is a marked access, it is
+not a fence or compiler barrier.  Hence the only guarantee we have
+that the load of ptr in P1 is r-pre-bounded before the load of *p
+(thus avoiding a race) is the assumption about address dependencies.
+
+This is a situation where the compiler can undermine the memory model,
+and a certain amount of care is required when programming constructs
+like this one.  In particular, comparisons between the pointer and
+other known addresses can cause trouble.  If you have something like:
+
+	p = rcu_dereference(ptr);
+	if (p == &x)
+		r = *p;
+
+then the compiler just might generate object code resembling:
+
+	p = rcu_dereference(ptr);
+	if (p == &x)
+		r = x;
+
+or even:
+
+	rtemp = x;
+	p = rcu_dereference(ptr);
+	if (p == &x)
+		r = rtemp;
+
+which would invalidate the memory model's assumption, since the CPU
+could now perform the load of x before the load of ptr (there might be
+a control dependency but no address dependency at the machine level).
+
+Finally, it turns out there is a situation in which a plain write does
+not need to be w-post-bounded: when it is separated from the
+conflicting access by a fence.  At first glance this may seem
+impossible.  After all, to be conflicting the second access has to be
+on a different CPU from the first, and fences don't link events on
+different CPUs.  Well, normal fences don't -- but rcu-fence can!
+Here's an example:
+
+	int x, y;
+
+	P0()
+	{
+		WRITE_ONCE(x, 1);
+		synchronize_rcu();
+		y = 3;
+	}
+
+	P1()
+	{
+		rcu_read_lock();
+		if (READ_ONCE(x) == 0)
+			y = 2;
+		rcu_read_unlock();
+	}
+
+Do the plain stores to y race?  Clearly not if P1 reads a non-zero
+value for x, so let's assume the READ_ONCE(x) does obtain 0.  This
+means that the read-side critical section in P1 must finish executing
+before the grace period in P0 does, because RCU's Grace-Period
+Guarantee says that otherwise P0's store to x would have propagated to
+P1 before the critical section started and so would have been visible
+to the READ_ONCE().  (Another way of putting it is that the fre link
+from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
+between those two events.)
+
+This means there is an rcu-fence link from P1's "y = 2" store to P0's
+"y = 3" store, and consequently the first must propagate from P1 to P0
+before the second can execute.  Therefore the two stores cannot be
+concurrent and there is no race, even though P1's plain store to y
+isn't w-post-bounded by any marked accesses.
+
+Putting all this material together yields the following picture.  For
+two conflicting stores W and W', where W ->co W', the LKMM says the
+stores don't race if W can be linked to W' by a
+
+	w-post-bounded ; vis ; w-pre-bounded
+
+sequence.  If W is plain then they also have to be linked by an
+
+	r-post-bounded ; xb* ; w-pre-bounded
+
+sequence, and if W' is plain then they also have to be linked by a
+
+	w-post-bounded ; vis ; r-pre-bounded
+
+sequence.  For a conflicting load R and store W, the LKMM says the two
+accesses don't race if R can be linked to W by an
+
+	r-post-bounded ; xb* ; w-pre-bounded
+
+sequence or if W can be linked to R by a
+
+	w-post-bounded ; vis ; r-pre-bounded
+
+sequence.  For the cases involving a vis link, the LKMM also accepts
+sequences in which W is linked to W' or R by a
+
+	strong-fence ; xb* ; {w and/or r}-pre-bounded
+
+sequence with no post-bounding, and in every case the LKMM also allows
+the link simply to be a fence with no bounding at all.  If no sequence
+of the appropriate sort exists, the LKMM says that the accesses race.
+
+There is one more part of the LKMM related to plain accesses (although
+not to data races) we should discuss.  Recall that many relations such
+as hb are limited to marked accesses only.  As a result, the
+happens-before, propagates-before, and rcu axioms (which state that
+various relation must not contain a cycle) doesn't apply to plain
+accesses.  Nevertheless, we do want to rule out such cycles, because
+they don't make sense even for plain accesses.
+
+To this end, the LKMM imposes three extra restrictions, together
+called the "plain-coherence" axiom because of their resemblance to the
+rules used by the operational model to ensure cache coherence (that
+is, the rules governing the memory subsystem's choice of a store to
+satisfy a load request and its determination of where a store will
+fall in the coherence order):
+
+	If R and W conflict and it is possible to link R to W by one
+	of the xb* sequences listed above, then W ->rfe R is not
+	allowed (i.e., a load cannot read from a store that it
+	executes before, even if one or both is plain).
+
+	If W and R conflict and it is possible to link W to R by one
+	of the vis sequences listed above, then R ->fre W is not
+	allowed (i.e., if a store is visible to a load then the load
+	must read from that store or one coherence-after it).
+
+	If W and W' conflict and it is possible to link W to W' by one
+	of the vis sequences listed above, then W' ->co W is not
+	allowed (i.e., if one store is visible to a second then the
+	second must come after the first in the coherence order).
+
+This is the extent to which the LKMM deals with plain accesses.
+Perhaps it could say more (for example, plain accesses might
+contribute to the ppo relation), but at the moment it seems that this
+minimal, conservative approach is good enough.
+
+
 ODDS AND ENDS
 -------------
 
@@ -1933,6 +2481,16 @@ treated as READ_ONCE() and rcu_assign_pointer() is treated as
 smp_store_release() -- which is basically how the Linux kernel treats
 them.
 
+Although we said that plain accesses are not linked by the ppo
+relation, they do contribute to it indirectly.  Namely, when there is
+an address dependency from a marked load R to a plain store W,
+followed by smp_wmb() and then a marked store W', the LKMM creates a
+ppo link from R to W'.  The reasoning behind this is perhaps a little
+shaky, but essentially it says there is no way to generate object code
+for this source code in which W' could execute before R.  Just as with
+pre-bounding by address dependencies, it is possible for the compiler
+to undermine this relation if sufficient care is not taken.
+
 There are a few oddball fences which need special treatment:
 smp_mb__before_atomic(), smp_mb__after_atomic(), and
 smp_mb__after_spinlock().  The LKMM uses fence events with special
diff --git a/tools/memory-model/README b/tools/memory-model/README
index 0f2c366518c6..fc07b52f2028 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -20,13 +20,17 @@ that litmus test to be exercised within the Linux kernel.
 REQUIREMENTS
 ============
 
-Version 7.49 of the "herd7" and "klitmus7" tools must be downloaded
-separately:
+Version 7.52 or higher of the "herd7" and "klitmus7" tools must be
+downloaded separately:
 
   https://github.com/herd/herdtools7
 
 See "herdtools7/INSTALL.md" for installation instructions.
 
+Note that although these tools usually provide backwards compatibility,
+this is not absolutely guaranteed.  Therefore, if a later version does
+not work, please try using the exact version called out above.
+
 
 ==================
 BASIC USAGE: HERD7
@@ -163,15 +167,15 @@ scripts	Various scripts, see scripts/README.
 LIMITATIONS
 ===========
 
-The Linux-kernel memory model has the following limitations:
+The Linux-kernel memory model (LKMM) has the following limitations:
 
-1.	Compiler optimizations are not modeled.  Of course, the use
-	of READ_ONCE() and WRITE_ONCE() limits the compiler's ability
-	to optimize, but there is Linux-kernel code that uses bare C
-	memory accesses.  Handling this code is on the to-do list.
-	For more information, see Documentation/explanation.txt (in
-	particular, the "THE PROGRAM ORDER RELATION: po AND po-loc"
-	and "A WARNING" sections).
+1.	Compiler optimizations are not accurately modeled.  Of course,
+	the use of READ_ONCE() and WRITE_ONCE() limits the compiler's
+	ability to optimize, but under some circumstances it is possible
+	for the compiler to undermine the memory model.  For more
+	information, see Documentation/explanation.txt (in particular,
+	the "THE PROGRAM ORDER RELATION: po AND po-loc" and "A WARNING"
+	sections).
 
 	Note that this limitation in turn limits LKMM's ability to
 	accurately model address, control, and data dependencies.
@@ -221,8 +225,29 @@ The Linux-kernel memory model has the following limitations:
 		additional call_rcu() process to the site of the
 		emulated rcu-barrier().
 
-	e.	Sleepable RCU (SRCU) is not modeled.  It can be
-		emulated, but perhaps not simply.
+	e.	Although sleepable RCU (SRCU) is now modeled, there
+		are some subtle differences between its semantics and
+		those in the Linux kernel.  For example, the kernel
+		might interpret the following sequence as two partially
+		overlapping SRCU read-side critical sections:
+
+			 1  r1 = srcu_read_lock(&my_srcu);
+			 2  do_something_1();
+			 3  r2 = srcu_read_lock(&my_srcu);
+			 4  do_something_2();
+			 5  srcu_read_unlock(&my_srcu, r1);
+			 6  do_something_3();
+			 7  srcu_read_unlock(&my_srcu, r2);
+
+		In contrast, LKMM will interpret this as a nested pair of
+		SRCU read-side critical sections, with the outer critical
+		section spanning lines 1-7 and the inner critical section
+		spanning lines 3-5.
+
+		This difference would be more of a concern had anyone
+		identified a reasonable use case for partially overlapping
+		SRCU read-side critical sections.  For more information,
+		please see: https://paulmck.livejournal.com/40593.html
 
 	f.	Reader-writer locking is not modeled.  It can be
 		emulated in litmus tests using atomic read-modify-write
diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell
index 796513362c05..5be86b1025e8 100644
--- a/tools/memory-model/linux-kernel.bell
+++ b/tools/memory-model/linux-kernel.bell
@@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'release}]
 enum Barriers = 'wmb (*smp_wmb*) ||
 		'rmb (*smp_rmb*) ||
 		'mb (*smp_mb*) ||
+		'barrier (*barrier*) ||
 		'rcu-lock (*rcu_read_lock*)  ||
 		'rcu-unlock (*rcu_read_unlock*) ||
 		'sync-rcu (*synchronize_rcu*) ||
@@ -33,8 +34,14 @@ enum Barriers = 'wmb (*smp_wmb*) ||
 		'after-unlock-lock (*smp_mb__after_unlock_lock*)
 instructions F[Barriers]
 
+(* SRCU *)
+enum SRCU = 'srcu-lock || 'srcu-unlock || 'sync-srcu
+instructions SRCU[SRCU]
+(* All srcu events *)
+let Srcu = Srcu-lock | Srcu-unlock | Sync-srcu
+
 (* Compute matching pairs of nested Rcu-lock and Rcu-unlock *)
-let matched = let rec
+let rcu-rscs = let rec
 	    unmatched-locks = Rcu-lock \ domain(matched)
 	and unmatched-unlocks = Rcu-unlock \ range(matched)
 	and unmatched = unmatched-locks | unmatched-unlocks
@@ -46,8 +53,32 @@ let matched = let rec
 	in matched
 
 (* Validate nesting *)
-flag ~empty Rcu-lock \ domain(matched) as unbalanced-rcu-locking
-flag ~empty Rcu-unlock \ range(matched) as unbalanced-rcu-locking
+flag ~empty Rcu-lock \ domain(rcu-rscs) as unbalanced-rcu-locking
+flag ~empty Rcu-unlock \ range(rcu-rscs) as unbalanced-rcu-locking
+
+(* Compute matching pairs of nested Srcu-lock and Srcu-unlock *)
+let srcu-rscs = let rec
+	    unmatched-locks = Srcu-lock \ domain(matched)
+	and unmatched-unlocks = Srcu-unlock \ range(matched)
+	and unmatched = unmatched-locks | unmatched-unlocks
+	and unmatched-po = ([unmatched] ; po ; [unmatched]) & loc
+	and unmatched-locks-to-unlocks =
+		([unmatched-locks] ; po ; [unmatched-unlocks]) & loc
+	and matched = matched | (unmatched-locks-to-unlocks \
+		(unmatched-po ; unmatched-po))
+	in matched
+
+(* Validate nesting *)
+flag ~empty Srcu-lock \ domain(srcu-rscs) as unbalanced-srcu-locking
+flag ~empty Srcu-unlock \ range(srcu-rscs) as unbalanced-srcu-locking
+
+(* Check for use of synchronize_srcu() inside an RCU critical section *)
+flag ~empty rcu-rscs & (po ; [Sync-srcu] ; po) as invalid-sleep
+
+(* Validate SRCU dynamic match *)
+flag ~empty different-values(srcu-rscs) as srcu-bad-nesting
 
-(* Outermost level of nesting only *)
-let crit = matched \ (po^-1 ; matched ; po^-1)
+(* Compute marked and plain memory accesses *)
+let Marked = (~M) | IW | Once | Release | Acquire | domain(rmw) | range(rmw) |
+		LKR | LKW | UL | LF | RL | RU
+let Plain = M \ Marked
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 8f23c74a96fd..2a9b4fe4a84e 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -24,8 +24,14 @@ include "lock.cat"
 (* Basic relations *)
 (*******************)
 
+(* Release Acquire *)
+let acq-po = [Acquire] ; po ; [M]
+let po-rel = [M] ; po ; [Release]
+let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
+
 (* Fences *)
-let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
+let R4rmb = R \ Noreturn	(* Reads for which rmb works *)
+let rmb = [R4rmb] ; fencerel(Rmb) ; [R4rmb]
 let wmb = [W] ; fencerel(Wmb) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
@@ -33,14 +39,15 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
 	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
 		fencerel(After-unlock-lock) ; [M])
-let gp = po ; [Sync-rcu] ; po?
-
+let gp = po ; [Sync-rcu | Sync-srcu] ; po?
 let strong-fence = mb | gp
 
-(* Release Acquire *)
-let acq-po = [Acquire] ; po ; [M]
-let po-rel = [M] ; po ; [Release]
-let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
+let nonrw-fence = strong-fence | po-rel | acq-po
+let fence = nonrw-fence | wmb | rmb
+let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
+		Before-atomic | After-atomic | Acquire | Release |
+		Rcu-lock | Rcu-unlock | Srcu-lock | Srcu-unlock) |
+	(po ; [Release]) | ([Acquire] ; po)
 
 (**********************************)
 (* Fundamental coherence ordering *)
@@ -61,21 +68,22 @@ empty rmw & (fre ; coe) as atomic
 let dep = addr | data
 let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
-let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi)
-let fence = strong-fence | wmb | po-rel | rmb | acq-po
+let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
+let to-r = addr | (dep ; [Marked] ; rfi)
 let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
 
 (* Propagation: Ordering from release operations and strong fences. *)
-let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
-let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
+let A-cumul(r) = (rfe ; [Marked])? ; r
+let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
+	po-unlock-rf-lock-po) ; [Marked]
+let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
+	[Marked] ; rfe? ; [Marked]
 
 (*
  * Happens Before: Ordering from the passage of time.
  * No fences needed here for prop because relation confined to one process.
  *)
-let hb = ppo | rfe | ((prop \ id) & int)
+let hb = [Marked] ; (ppo | rfe | ((prop \ id) & int)) ; [Marked]
 acyclic hb as happens-before
 
 (****************************************)
@@ -83,7 +91,7 @@ acyclic hb as happens-before
 (****************************************)
 
 (* Propagation: Each non-rf link needs a strong fence. *)
-let pb = prop ; strong-fence ; hb*
+let pb = prop ; strong-fence ; hb* ; [Marked]
 acyclic pb as propagation
 
 (*******)
@@ -91,32 +99,51 @@ acyclic pb as propagation
 (*******)
 
 (*
- * Effect of read-side critical section proceeds from the rcu_read_lock()
- * onward on the one hand and from the rcu_read_unlock() backwards on the
- * other hand.
+ * Effects of read-side critical sections proceed from the rcu_read_unlock()
+ * or srcu_read_unlock() backwards on the one hand, and from the
+ * rcu_read_lock() or srcu_read_lock() forwards on the other hand.
+ *
+ * In the definition of rcu-fence below, the po term at the left-hand side
+ * of each disjunct and the po? term at the right-hand end have been factored
+ * out.  They have been moved into the definitions of rcu-link and rb.
+ * This was necessary in order to apply the "& loc" tests correctly.
  *)
-let rscs = po ; crit^-1 ; po?
+let rcu-gp = [Sync-rcu]		(* Compare with gp *)
+let srcu-gp = [Sync-srcu]
+let rcu-rscsi = rcu-rscs^-1
+let srcu-rscsi = srcu-rscs^-1
 
 (*
  * The synchronize_rcu() strong fence is special in that it can order not
  * one but two non-rf relations, but only in conjunction with an RCU
  * read-side critical section.
  *)
-let rcu-link = hb* ; pb* ; prop
+let rcu-link = po? ; hb* ; pb* ; prop ; po
 
 (*
  * Any sequence containing at least as many grace periods as RCU read-side
- * critical sections (joined by rcu-link) acts as a generalized strong fence.
+ * critical sections (joined by rcu-link) induces order like a generalized
+ * inter-CPU strong fence.
+ * Likewise for SRCU grace periods and read-side critical sections, provided
+ * the synchronize_srcu() and srcu_read_[un]lock() calls refer to the same
+ * struct srcu_struct location.
  *)
-let rec rcu-fence = gp |
-	(gp ; rcu-link ; rscs) |
-	(rscs ; rcu-link ; gp) |
-	(gp ; rcu-link ; rcu-fence ; rcu-link ; rscs) |
-	(rscs ; rcu-link ; rcu-fence ; rcu-link ; gp) |
-	(rcu-fence ; rcu-link ; rcu-fence)
+let rec rcu-order = rcu-gp | srcu-gp |
+	(rcu-gp ; rcu-link ; rcu-rscsi) |
+	((srcu-gp ; rcu-link ; srcu-rscsi) & loc) |
+	(rcu-rscsi ; rcu-link ; rcu-gp) |
+	((srcu-rscsi ; rcu-link ; srcu-gp) & loc) |
+	(rcu-gp ; rcu-link ; rcu-order ; rcu-link ; rcu-rscsi) |
+	((srcu-gp ; rcu-link ; rcu-order ; rcu-link ; srcu-rscsi) & loc) |
+	(rcu-rscsi ; rcu-link ; rcu-order ; rcu-link ; rcu-gp) |
+	((srcu-rscsi ; rcu-link ; rcu-order ; rcu-link ; srcu-gp) & loc) |
+	(rcu-order ; rcu-link ; rcu-order)
+let rcu-fence = po ; rcu-order ; po?
+let fence = fence | rcu-fence
+let strong-fence = strong-fence | rcu-fence
 
 (* rb orders instructions just as pb does *)
-let rb = prop ; rcu-fence ; hb* ; pb*
+let rb = prop ; rcu-fence ; hb* ; pb* ; [Marked]
 
 irreflexive rb as rcu
 
@@ -128,3 +155,49 @@ irreflexive rb as rcu
  * let xb = hb | pb | rb
  * acyclic xb as executes-before
  *)
+
+(*********************************)
+(* Plain accesses and data races *)
+(*********************************)
+
+(* Warn about plain writes and marked accesses in the same region *)
+let mixed-accesses = ([Plain & W] ; (po-loc \ barrier) ; [Marked]) |
+	([Marked] ; (po-loc \ barrier) ; [Plain & W])
+flag ~empty mixed-accesses as mixed-accesses
+
+(* Executes-before and visibility *)
+let xbstar = (hb | pb | rb)*
+let vis = cumul-fence* ; rfe? ; [Marked] ;
+	((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
+
+(* Boundaries for lifetimes of plain accesses *)
+let w-pre-bounded = [Marked] ; (addr | fence)?
+let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
+	([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
+let w-post-bounded = fence? ; [Marked]
+let r-post-bounded = (nonrw-fence | ([~Noreturn] ; fencerel(Rmb) ; [R4rmb]))? ;
+	[Marked]
+
+(* Visibility and executes-before for plain accesses *)
+let ww-vis = fence | (strong-fence ; xbstar ; w-pre-bounded) |
+	(w-post-bounded ; vis ; w-pre-bounded)
+let wr-vis = fence | (strong-fence ; xbstar ; r-pre-bounded) |
+	(w-post-bounded ; vis ; r-pre-bounded)
+let rw-xbstar = fence | (r-post-bounded ; xbstar ; w-pre-bounded)
+
+(* Potential races *)
+let pre-race = ext & ((Plain * M) | ((M \ IW) * Plain))
+
+(* Coherence requirements for plain accesses *)
+let wr-incoh = pre-race & rf & rw-xbstar^-1
+let rw-incoh = pre-race & fr & wr-vis^-1
+let ww-incoh = pre-race & co & ww-vis^-1
+empty (wr-incoh | rw-incoh | ww-incoh) as plain-coherence
+
+(* Actual races *)
+let ww-nonrace = ww-vis & ((Marked * W) | rw-xbstar) & ((W * Marked) | wr-vis)
+let ww-race = (pre-race & co) \ ww-nonrace
+let wr-race = (pre-race & (co? ; rf)) \ wr-vis \ rw-xbstar^-1
+let rw-race = (pre-race & fr) \ rw-xbstar
+
+flag ~empty (ww-race | wr-race | rw-race) as data-race
diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def
index b27911cc087d..ef0f3c1850de 100644
--- a/tools/memory-model/linux-kernel.def
+++ b/tools/memory-model/linux-kernel.def
@@ -24,6 +24,7 @@ smp_mb__before_atomic() { __fence{before-atomic}; }
 smp_mb__after_atomic() { __fence{after-atomic}; }
 smp_mb__after_spinlock() { __fence{after-spinlock}; }
 smp_mb__after_unlock_lock() { __fence{after-unlock-lock}; }
+barrier() { __fence{barrier}; }
 
 // Exchange
 xchg(X,V)  __xchg{mb}(X,V)
@@ -47,6 +48,12 @@ rcu_read_unlock() { __fence{rcu-unlock}; }
 synchronize_rcu() { __fence{sync-rcu}; }
 synchronize_rcu_expedited() { __fence{sync-rcu}; }
 
+// SRCU
+srcu_read_lock(X)  __srcu{srcu-lock}(X)
+srcu_read_unlock(X,Y) { __srcu{srcu-unlock}(X,Y); }
+synchronize_srcu(X)  { __srcu{sync-srcu}(X); }
+synchronize_srcu_expedited(X)  { __srcu{sync-srcu}(X); }
+
 // Atomic
 atomic_read(X) READ_ONCE(*X)
 atomic_set(X,V) { WRITE_ONCE(*X,V); }
diff --git a/tools/memory-model/litmus-tests/MP+poonceonces.litmus b/tools/memory-model/litmus-tests/MP+poonceonces.litmus
index b2b60b84fb9d..172f0145301c 100644
--- a/tools/memory-model/litmus-tests/MP+poonceonces.litmus
+++ b/tools/memory-model/litmus-tests/MP+poonceonces.litmus
@@ -1,7 +1,7 @@
 C MP+poonceonces
 
 (*
- * Result: Maybe
+ * Result: Sometimes
  *
  * Can the counter-intuitive message-passing outcome be prevented with
  * no ordering at all?
diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README
index 5ee08f129094..681f9067fa9e 100644
--- a/tools/memory-model/litmus-tests/README
+++ b/tools/memory-model/litmus-tests/README
@@ -244,7 +244,7 @@ produce the name:
 Adding the ".litmus" suffix: SB+rfionceonce-poonceonces.litmus
 
 The descriptors that describe connections between consecutive accesses
-within the cycle through a given litmus test can be provided by the herd
+within the cycle through a given litmus test can be provided by the herd7
 tool (Rfi, Po, Fre, and so on) or by the linux-kernel.bell file (Once,
 Release, Acquire, and so on).
 
diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
index 305ded17e741..6b52f365d73a 100644
--- a/tools/memory-model/lock.cat
+++ b/tools/memory-model/lock.cat
@@ -6,15 +6,12 @@
 
 (*
  * Generate coherence orders and handle lock operations
- *
- * Warning: spin_is_locked() crashes herd7 versions strictly before 7.48.
- * spin_is_locked() is functional from herd7 version 7.49.
  *)
 
 include "cross.cat"
 
 (*
- * The lock-related events generated by herd are as follows:
+ * The lock-related events generated by herd7 are as follows:
  *
  * LKR		Lock-Read: the read part of a spin_lock() or successful
  *			spin_trylock() read-modify-write event pair
diff --git a/tools/memory-model/scripts/README b/tools/memory-model/scripts/README
index 29375a1fbbfa..095c7eb36f9f 100644
--- a/tools/memory-model/scripts/README
+++ b/tools/memory-model/scripts/README
@@ -22,7 +22,7 @@ checklitmushist.sh
 
 	Run all litmus tests having .litmus.out files from previous
 	initlitmushist.sh or newlitmushist.sh runs, comparing the
-	herd output to that of the original runs.
+	herd7 output to that of the original runs.
 
 checklitmus.sh
 
@@ -43,7 +43,7 @@ initlitmushist.sh
 
 judgelitmus.sh
 
-	Given a .litmus file and its .litmus.out herd output, check the
+	Given a .litmus file and its .litmus.out herd7 output, check the
 	.litmus.out file against the .litmus file's "Result:" comment to
 	judge whether the test ran correctly.  Not normally run manually,
 	provided instead for use by other scripts.
diff --git a/tools/memory-model/scripts/checkalllitmus.sh b/tools/memory-model/scripts/checkalllitmus.sh
index b35fcd61ecf6..3c0c7fbbd223 100755
--- a/tools/memory-model/scripts/checkalllitmus.sh
+++ b/tools/memory-model/scripts/checkalllitmus.sh
@@ -1,7 +1,7 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0+
 #
-# Run herd tests on all .litmus files in the litmus-tests directory
+# Run herd7 tests on all .litmus files in the litmus-tests directory
 # and check each file's result against a "Result:" comment within that
 # litmus test.  If the verification result does not match that specified
 # in the litmus test, this script prints an error message prefixed with
diff --git a/tools/memory-model/scripts/checkghlitmus.sh b/tools/memory-model/scripts/checkghlitmus.sh
index 6589fbb6f653..6589fbb6f653 100644..100755
--- a/tools/memory-model/scripts/checkghlitmus.sh
+++ b/tools/memory-model/scripts/checkghlitmus.sh
diff --git a/tools/memory-model/scripts/checklitmus.sh b/tools/memory-model/scripts/checklitmus.sh
index dd08801a30b0..11461ed40b5e 100755
--- a/tools/memory-model/scripts/checklitmus.sh
+++ b/tools/memory-model/scripts/checklitmus.sh
@@ -1,7 +1,7 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0+
 #
-# Run a herd test and invokes judgelitmus.sh to check the result against
+# Run a herd7 test and invokes judgelitmus.sh to check the result against
 # a "Result:" comment within the litmus test.  It also outputs verification
 # results to a file whose name is that of the specified litmus test, but
 # with ".out" appended.
diff --git a/tools/memory-model/scripts/checklitmushist.sh b/tools/memory-model/scripts/checklitmushist.sh
index 1d210ffb7c8a..1d210ffb7c8a 100644..100755
--- a/tools/memory-model/scripts/checklitmushist.sh
+++ b/tools/memory-model/scripts/checklitmushist.sh
diff --git a/tools/memory-model/scripts/cmplitmushist.sh b/tools/memory-model/scripts/cmplitmushist.sh
index 0f498aeeccf5..0f498aeeccf5 100644..100755
--- a/tools/memory-model/scripts/cmplitmushist.sh
+++ b/tools/memory-model/scripts/cmplitmushist.sh
diff --git a/tools/memory-model/scripts/initlitmushist.sh b/tools/memory-model/scripts/initlitmushist.sh
index 956b6957484d..956b6957484d 100644..100755
--- a/tools/memory-model/scripts/initlitmushist.sh
+++ b/tools/memory-model/scripts/initlitmushist.sh
diff --git a/tools/memory-model/scripts/judgelitmus.sh b/tools/memory-model/scripts/judgelitmus.sh
index 0cc63875e395..0cc63875e395 100644..100755
--- a/tools/memory-model/scripts/judgelitmus.sh
+++ b/tools/memory-model/scripts/judgelitmus.sh
diff --git a/tools/memory-model/scripts/newlitmushist.sh b/tools/memory-model/scripts/newlitmushist.sh
index 991f8f814881..991f8f814881 100644..100755
--- a/tools/memory-model/scripts/newlitmushist.sh
+++ b/tools/memory-model/scripts/newlitmushist.sh
diff --git a/tools/memory-model/scripts/parseargs.sh b/tools/memory-model/scripts/parseargs.sh
index 859e1d581e05..40f52080fdbd 100644..100755
--- a/tools/memory-model/scripts/parseargs.sh
+++ b/tools/memory-model/scripts/parseargs.sh
@@ -91,7 +91,7 @@ do
 		shift
 		;;
 	--herdopts|--herdopt)
-		checkarg --destdir "(herd options)" "$#" "$2" '.*' '^--'
+		checkarg --destdir "(herd7 options)" "$#" "$2" '.*' '^--'
 		LKMM_HERD_OPTIONS="$2"
 		shift
 		;;
diff --git a/tools/memory-model/scripts/runlitmushist.sh b/tools/memory-model/scripts/runlitmushist.sh
index e507f5f933d5..6ed376f495bb 100644..100755
--- a/tools/memory-model/scripts/runlitmushist.sh
+++ b/tools/memory-model/scripts/runlitmushist.sh
@@ -79,7 +79,7 @@ then
 	echo ' ---' Summary: 1>&2
 	grep '!!!' $T/*.sh.out 1>&2
 	nfail="`grep '!!!' $T/*.sh.out | wc -l`"
-	echo 'Number of failed herd runs (e.g., timeout): ' $nfail 1>&2
+	echo 'Number of failed herd7 runs (e.g., timeout): ' $nfail 1>&2
 	exit 1
 else
 	echo All runs completed successfully. 1>&2