LLE 6/6: Add LoopLoadElimination pass

Summary: The goal of this pass is to perform store-to-load forwarding across the backedge of a loop. E.g.: for (i) A[i + 1] = A[i] + B[i] => T = A[0] for (i) T = T + B[i] A[i + 1] = T The pass relies on loop dependence analysis via LoopAccessAnalisys to find opportunities of loop-carried dependences with a distance of one between a store and a load. Since it's using LoopAccessAnalysis, it was easy to also add support for versioning away may-aliasing intervening stores that would otherwise prevent this transformation. This optimization is also performed by Load-PRE in GVN without the option of multi-versioning. As was discussed with Daniel Berlin in http://reviews.llvm.org/D9548, this is inferior to a more loop-aware solution applied here. Hopefully, we will be able to remove some complexity from GVN/MemorySSA as a consequence. In the long run, we may want to extend this pass (or create a new one if there is little overlap) to also eliminate loop-indepedent redundant loads and store that *require* versioning due to may-aliasing intervening stores/loads. I have some motivating cases for store elimination. My plan right now is to wait for MemorySSA to come online first rather than using memdep for this. The main motiviation for this pass is the 456.hmmer loop in SPECint2006 where after distributing the original loop and vectorizing the top part, we are left with the critical path exposed in the bottom loop. Being able to promote the memory dependence into a register depedence (even though the HW does perform store-to-load fowarding as well) results in a major gain (~20%). This gain also transfers over to x86: it's around 8-10%. Right now the pass is off by default and can be enabled with -enable-loop-load-elim. On the LNT testsuite, there are two performance changes (negative number -> improvement): 1. -28% in Polybench/linear-algebra/solvers/dynprog: the length of the critical paths is reduced 2. +2% in Polybench/stencils/adi: Unfortunately, I couldn't reproduce this outside of LNT The pass is scheduled after the loop vectorizer (which is after loop distribution). The rational is to try to reuse LAA state, rather than recomputing it. The order between LV and LLE is not critical because normally LV does not touch scalar st->ld forwarding cases where vectorizing would inhibit the CPU's st->ld forwarding to kick in. LoopLoadElimination requires LAA to provide the full set of dependences (including forward dependences). LAA is known to omit loop-independent dependences in certain situations. The big comment before removeDependencesFromMultipleStores explains why this should not occur for the cases that we're interested in. Reviewers: dberlin, hfinkel Subscribers: junbuml, dberlin, mssimpso, rengolin, sanjoy, llvm-commits Differential Revision: http://reviews.llvm.org/D13259 llvm-svn: 252017
author: Adam Nemet <anemet@apple.com> 2015-11-03 23:50:08 +0000
committer: Adam Nemet <anemet@apple.com> 2015-11-03 23:50:08 +0000
commit: e54a4fa95d308578e30818d3fc6c2575ce6611fb (patch)
tree: 2cdac733350af324fb624d303066de2a77e7d52c /llvm/test
parent: 397f5829c7c8d64c3c0edc3e07535835ad35262a (diff)
download: bcm5719-llvm-e54a4fa95d308578e30818d3fc6c2575ce6611fb.tar.gz
bcm5719-llvm-e54a4fa95d308578e30818d3fc6c2575ce6611fb.zip
6 files changed, 268 insertions, 0 deletions
diff --git a/llvm/test/Transforms/LoopLoadElim/backward.ll b/llvm/test/Transforms/LoopLoadElim/backward.ll
new file mode 100644
index 00000000000..7c750a51a2a
--- /dev/null
+++ b/llvm/test/Transforms/LoopLoadElim/backward.ll
@@ -0,0 +1,32 @@
+; RUN: opt -loop-load-elim -S < %s | FileCheck %s
+
+; Simple st->ld forwarding derived from a lexical backward dep.
+;
+;   for (unsigned i = 0; i < 100; i++)
+;     A[i+1] = A[i] + B[i];
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+
+define void @f(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i64 %N) {
+entry:
+; CHECK: %load_initial = load i32, i32* %A
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+; CHECK: %store_forwarded = phi i32 [ %load_initial, %entry ], [ %add, %for.body ]
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
+  %load = load i32, i32* %arrayidx, align 4
+  %arrayidx2 = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
+  %load_1 = load i32, i32* %arrayidx2, align 4
+; CHECK: %add = add i32 %load_1, %store_forwarded
+  %add = add i32 %load_1, %load
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %arrayidx_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
+  store i32 %add, i32* %arrayidx_next, align 4
+  %exitcond = icmp eq i64 %indvars.iv.next, %N
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopLoadElim/def-store-before-load.ll b/llvm/test/Transforms/LoopLoadElim/def-store-before-load.ll
new file mode 100644
index 00000000000..3dc93f6786e
--- /dev/null
+++ b/llvm/test/Transforms/LoopLoadElim/def-store-before-load.ll
@@ -0,0 +1,35 @@
+; RUN: opt -loop-load-elim -S < %s | FileCheck %s
+
+;  No loop-carried forwarding: The intervening store to A[i] kills the stored
+;  value from the previous iteration.
+;
+;   for (unsigned i = 0; i < 100; i++) {
+;     A[i] = 1;
+;     A[i+1] = A[i] + B[i];
+;   }
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+
+define void @f(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i64 %N) {
+entry:
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+; CHECK-NOT: %store_forwarded
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
+  store i32 1, i32* %arrayidx, align 4
+  %a = load i32, i32* %arrayidx, align 4
+  %arrayidxB = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
+  %b = load i32, i32* %arrayidxB, align 4
+; CHECK: %add = add i32 %b, %a
+  %add = add i32 %b, %a
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %arrayidx_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
+  store i32 %add, i32* %arrayidx_next, align 4
+  %exitcond = icmp eq i64 %indvars.iv.next, %N
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopLoadElim/forward.ll b/llvm/test/Transforms/LoopLoadElim/forward.ll
new file mode 100644
index 00000000000..1a77297a064
--- /dev/null
+++ b/llvm/test/Transforms/LoopLoadElim/forward.ll
@@ -0,0 +1,47 @@
+; RUN: opt -loop-load-elim -S < %s | FileCheck %s
+
+; Simple st->ld forwarding derived from a lexical forwrad dep.
+;
+;   for (unsigned i = 0; i < 100; i++) {
+;     A[i+1] = B[i] + 2;
+;     C[i] = A[i] * 2;
+;   }
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+
+define void @f(i32* %A, i32* %B, i32* %C, i64 %N) {
+
+; CHECK:   for.body.lver.memcheck:
+; CHECK:     %found.conflict{{.*}} =
+; CHECK-NOT: %found.conflict{{.*}} =
+
+entry:
+; for.body.ph:
+; CHECK: %load_initial = load i32, i32* %A
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+; CHECK: %store_forwarded = phi i32 [ %load_initial, %for.body.ph ], [ %a_p1, %for.body ]
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+
+  %Aidx_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
+  %Bidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
+  %Cidx = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
+  %Aidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
+
+  %b = load i32, i32* %Bidx, align 4
+  %a_p1 = add i32 %b, 2
+  store i32 %a_p1, i32* %Aidx_next, align 4
+
+  %a = load i32, i32* %Aidx, align 4
+; CHECK: %c = mul i32 %store_forwarded, 2
+  %c = mul i32 %a, 2
+  store i32 %c, i32* %Cidx, align 4
+
+  %exitcond = icmp eq i64 %indvars.iv.next, %N
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopLoadElim/memcheck.ll b/llvm/test/Transforms/LoopLoadElim/memcheck.ll
new file mode 100644
index 00000000000..ebb52825754
--- /dev/null
+++ b/llvm/test/Transforms/LoopLoadElim/memcheck.ll
@@ -0,0 +1,52 @@
+; RUN: opt -loop-load-elim -S < %s | FileCheck %s
+; RUN: opt -loop-load-elim -S -runtime-check-per-loop-load-elim=2 < %s | FileCheck %s --check-prefix=AGGRESSIVE
+
+; This needs two pairs of memchecks (A * { C, D }) for a single load
+; elimination which is considered to expansive by default.
+;
+;   for (unsigned i = 0; i < 100; i++) {
+;     A[i+1] = B[i] + 2;
+;     C[i] = A[i] * 2;
+;     D[i] = 2;
+;   }
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+
+define void @f(i32*  %A, i32*  %B, i32*  %C, i64 %N, i32* %D) {
+entry:
+  br label %for.body
+
+; AGGRESSIVE: for.body.lver.memcheck:
+; AGGRESSIVE: %found.conflict{{.*}} =
+; AGGRESSIVE: %found.conflict{{.*}} =
+; AGGRESSIVE-NOT: %found.conflict{{.*}} =
+
+for.body:                                         ; preds = %for.body, %entry
+; CHECK-NOT: %store_forwarded =
+; AGGRESSIVE: %store_forwarded =
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+
+  %Aidx_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
+  %Bidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
+  %Cidx = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
+  %Aidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
+  %Didx = getelementptr inbounds i32, i32* %D, i64 %indvars.iv
+
+  %b = load i32, i32* %Bidx, align 4
+  %a_p1 = add i32 %b, 2
+  store i32 %a_p1, i32* %Aidx_next, align 4
+
+  %a = load i32, i32* %Aidx, align 4
+; CHECK: %c = mul i32 %a, 2
+; AGGRESSIVE: %c = mul i32 %store_forwarded, 2
+  %c = mul i32 %a, 2
+  store i32 %c, i32* %Cidx, align 4
+  store i32 2, i32* %Didx, align 4
+
+  %exitcond = icmp eq i64 %indvars.iv.next, %N
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopLoadElim/multiple-stores-same-block.ll b/llvm/test/Transforms/LoopLoadElim/multiple-stores-same-block.ll
new file mode 100644
index 00000000000..b0c0f3dee86
--- /dev/null
+++ b/llvm/test/Transforms/LoopLoadElim/multiple-stores-same-block.ll
@@ -0,0 +1,48 @@
+; RUN: opt -basicaa -loop-load-elim -S < %s | FileCheck %s
+
+; In this case the later store forward to the load:
+;
+;   for (unsigned i = 0; i < 100; i++) {
+;     B[i] = A[i] + 1;
+;     A[i+1] = C[i] + 2;
+;     A[i+1] = D[i] + 3;
+;   }
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+
+define void @f(i32* noalias nocapture %A, i32* noalias nocapture readonly %B,
+               i32* noalias nocapture %C, i32* noalias nocapture readonly %D,
+               i64 %N) {
+entry:
+; CHECK: %load_initial = load i32, i32* %A
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+; CHECK: %store_forwarded = phi i32 [ %load_initial, %entry ], [ %addD, %for.body ]
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidxA = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
+  %loadA = load i32, i32* %arrayidxA, align 4
+; CHECK: %addA = add i32 %store_forwarded, 1
+  %addA = add i32 %loadA, 1
+
+  %arrayidxB = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
+  store i32 %addA, i32* %arrayidxB, align 4
+
+  %arrayidxC = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
+  %loadC = load i32, i32* %arrayidxC, align 4
+  %addC = add i32 %loadC, 2
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %arrayidxA_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
+  store i32 %addC, i32* %arrayidxA_next, align 4
+
+  %arrayidxD = getelementptr inbounds i32, i32* %D, i64 %indvars.iv
+  %loadD = load i32, i32* %arrayidxD, align 4
+  %addD = add i32 %loadD, 3
+  store i32 %addD, i32* %arrayidxA_next, align 4
+
+  %exitcond = icmp eq i64 %indvars.iv.next, %N
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopLoadElim/unknown-dep.ll b/llvm/test/Transforms/LoopLoadElim/unknown-dep.ll
new file mode 100644
index 00000000000..d2df718ca4c
--- /dev/null
+++ b/llvm/test/Transforms/LoopLoadElim/unknown-dep.ll
@@ -0,0 +1,54 @@
+; RUN: opt -basicaa -loop-load-elim -S < %s | FileCheck %s
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+
+; Give up in the presence of unknown deps. Here, the different strides result
+; in unknown dependence:
+;
+;   for (unsigned i = 0; i < 100; i++) {
+;     A[i+1] = B[i] + 2;
+;     A[2*i] = C[i] + 2;
+;     D[i] = A[i] + 2;
+;   }
+
+define void @f(i32* noalias %A, i32* noalias %B, i32* noalias %C,
+               i32* noalias %D, i64 %N) {
+
+entry:
+; for.body.ph:
+; CHECK-NOT: %load_initial =
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+; CHECK-NOT: %store_forwarded =
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+
+  %Aidx_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
+  %Bidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
+  %Cidx = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
+  %Didx = getelementptr inbounds i32, i32* %D, i64 %indvars.iv
+  %Aidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
+  %indvars.m2 = mul nuw nsw i64 %indvars.iv, 2
+  %A2idx = getelementptr inbounds i32, i32* %A, i64 %indvars.m2
+
+  %b = load i32, i32* %Bidx, align 4
+  %a_p1 = add i32 %b, 2
+  store i32 %a_p1, i32* %Aidx_next, align 4
+
+  %c = load i32, i32* %Cidx, align 4
+  %a_m2 = add i32 %c, 2
+  store i32 %a_m2, i32* %A2idx, align 4
+
+  %a = load i32, i32* %Aidx, align 4
+; CHECK-NOT: %d = add i32 %store_forwarded, 2
+; CHECK: %d = add i32 %a, 2
+  %d = add i32 %a, 2
+  store i32 %d, i32* %Didx, align 4
+
+  %exitcond = icmp eq i64 %indvars.iv.next, %N
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body
+  ret void
+}
author	Adam Nemet <anemet@apple.com>	2015-11-03 23:50:08 +0000
committer	Adam Nemet <anemet@apple.com>	2015-11-03 23:50:08 +0000
commit	e54a4fa95d308578e30818d3fc6c2575ce6611fb (patch)
tree	2cdac733350af324fb624d303066de2a77e7d52c /llvm/test
parent	397f5829c7c8d64c3c0edc3e07535835ad35262a (diff)
download	bcm5719-llvm-e54a4fa95d308578e30818d3fc6c2575ce6611fb.tar.gz bcm5719-llvm-e54a4fa95d308578e30818d3fc6c2575ce6611fb.zip