diff options
author | Roman Lebedev <lebedev.ri@gmail.com> | 2019-02-20 09:14:04 +0000 |
---|---|---|
committer | Roman Lebedev <lebedev.ri@gmail.com> | 2019-02-20 09:14:04 +0000 |
commit | 69716394f3d65210c8ca62bf380b75f3f1346ec6 (patch) | |
tree | 30cb9a5e869cc4f0f9f44a32e49b621dd94620d5 | |
parent | 2d6bb13443d86d650e92b280d4f7df0805eebac7 (diff) | |
download | bcm5719-llvm-69716394f3d65210c8ca62bf380b75f3f1346ec6.tar.gz bcm5719-llvm-69716394f3d65210c8ca62bf380b75f3f1346ec6.zip |
[llvm-exegesis] Opcode stabilization / reclusterization (PR40715)
Summary:
Given an instruction `Opcode`, we can make benchmarks (measurements) of the
instruction characteristics/performance. Then, to facilitate further analysis
we group the benchmarks with *similar* characteristics into clusters.
Now, this is all not entirely deterministic. Some instructions have variable
characteristics, depending on their arguments. And thus, if we do several
benchmarks of the same instruction `Opcode`, we may end up with *different*
performance characteristics measurements. And when we then do clustering,
these several benchmarks of the same instruction `Opcode` may end up being
clustered into *different* clusters. This is not great for further analysis.
We shall find every `Opcode` with benchmarks not in just one cluster, and move
*all* the benchmarks of said `Opcode` into one new unstable cluster per `Opcode`.
I have solved this by making `ClusterId` a bit field, adding a `IsUnstable` bit,
and introducing `-analysis-display-unstable-clusters` switch to toggle between
displaying stable-only clusters and unstable-only clusters.
The reclusterization is deterministically stable, produces identical reports
between runs. (Or at least that is what i'm seeing, maybe it isn't)
Timings/comparisons:
old (current trunk/head) {F8303582}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'
Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html' (25 runs):
6624.73 msec task-clock # 0.999 CPUs utilized ( +- 0.53% )
172 context-switches # 25.965 M/sec ( +- 29.89% )
0 cpu-migrations # 0.042 M/sec ( +- 56.54% )
31073 page-faults # 4690.754 M/sec ( +- 0.08% )
26538711696 cycles # 4006230.292 GHz ( +- 0.53% ) (83.31%)
2017496807 stalled-cycles-frontend # 7.60% frontend cycles idle ( +- 0.93% ) (83.32%)
13403650062 stalled-cycles-backend # 50.51% backend cycles idle ( +- 0.33% ) (33.37%)
19770706799 instructions # 0.74 insn per cycle
# 0.68 stalled cycles per insn ( +- 0.04% ) (50.04%)
4419821812 branches # 667207369.714 M/sec ( +- 0.03% ) (66.69%)
121741669 branch-misses # 2.75% of all branches ( +- 0.28% ) (83.34%)
6.6283 +- 0.0358 seconds time elapsed ( +- 0.54% )
```
patch, with reclustering but without filtering (i.e. outputting all the stable *and* unstable clusters) {F8303586}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-all.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-all.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-all.html'
Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-all.html' (25 runs):
6475.29 msec task-clock # 0.999 CPUs utilized ( +- 0.31% )
213 context-switches # 32.952 M/sec ( +- 23.81% )
1 cpu-migrations # 0.130 M/sec ( +- 43.84% )
31287 page-faults # 4832.057 M/sec ( +- 0.08% )
25939086577 cycles # 4006160.279 GHz ( +- 0.31% ) (83.31%)
1958812858 stalled-cycles-frontend # 7.55% frontend cycles idle ( +- 0.68% ) (83.32%)
13218961512 stalled-cycles-backend # 50.96% backend cycles idle ( +- 0.29% ) (33.37%)
19752995402 instructions # 0.76 insn per cycle
# 0.67 stalled cycles per insn ( +- 0.04% ) (50.04%)
4417079244 branches # 682195472.305 M/sec ( +- 0.03% ) (66.70%)
121510065 branch-misses # 2.75% of all branches ( +- 0.19% ) (83.34%)
6.4832 +- 0.0229 seconds time elapsed ( +- 0.35% )
```
Funnily, *this* measurement shows that said reclustering actually improved performance.
patch, with reclustering, only the stable clusters {F8303594}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-stable.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-stable.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-stable.html'
Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-stable.html' (25 runs):
6387.71 msec task-clock # 0.999 CPUs utilized ( +- 0.13% )
133 context-switches # 20.792 M/sec ( +- 23.39% )
0 cpu-migrations # 0.063 M/sec ( +- 61.24% )
31318 page-faults # 4903.256 M/sec ( +- 0.08% )
25591984967 cycles # 4006786.266 GHz ( +- 0.13% ) (83.31%)
1881234904 stalled-cycles-frontend # 7.35% frontend cycles idle ( +- 0.25% ) (83.33%)
13209749965 stalled-cycles-backend # 51.62% backend cycles idle ( +- 0.16% ) (33.36%)
19767554347 instructions # 0.77 insn per cycle
# 0.67 stalled cycles per insn ( +- 0.04% ) (50.03%)
4417480305 branches # 691618858.046 M/sec ( +- 0.03% ) (66.68%)
118676358 branch-misses # 2.69% of all branches ( +- 0.07% ) (83.33%)
6.3954 +- 0.0118 seconds time elapsed ( +- 0.18% )
```
Performance improved even further?! Makes sense i guess, less clusters to print.
patch, with reclustering, only the unstable clusters {F8303601}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-unstable.html -analysis-display-unstable-clusters
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-unstable.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-unstable.html'
Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-unstable.html -analysis-display-unstable-clusters' (25 runs):
6124.96 msec task-clock # 1.000 CPUs utilized ( +- 0.20% )
194 context-switches # 31.709 M/sec ( +- 20.46% )
0 cpu-migrations # 0.039 M/sec ( +- 49.77% )
31413 page-faults # 5129.261 M/sec ( +- 0.06% )
24536794267 cycles # 4006425.858 GHz ( +- 0.19% ) (83.31%)
1676085087 stalled-cycles-frontend # 6.83% frontend cycles idle ( +- 0.46% ) (83.32%)
13035595603 stalled-cycles-backend # 53.13% backend cycles idle ( +- 0.16% ) (33.36%)
18260877653 instructions # 0.74 insn per cycle
# 0.71 stalled cycles per insn ( +- 0.05% ) (50.03%)
4112411983 branches # 671484364.603 M/sec ( +- 0.03% ) (66.68%)
114066929 branch-misses # 2.77% of all branches ( +- 0.11% ) (83.32%)
6.1278 +- 0.0121 seconds time elapsed ( +- 0.20% )
```
This tells us that the actual `-analysis-inconsistencies-output-file=` outputting only takes ~0.4 sec for 43970 benchmark points (3 whole sweeps)
(Also, wow this is fast, it used to take several minutes originally)
Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40715 | PR40715 ]].
Reviewers: courbet, gchatelet
Reviewed By: courbet
Subscribers: tschuett, jdoerfert, llvm-commits, RKSimon
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58355
llvm-svn: 354441
-rw-r--r-- | llvm/docs/CommandGuide/llvm-exegesis.rst | 7 | ||||
-rw-r--r-- | llvm/test/tools/llvm-exegesis/X86/analysis-cluster-stabilization.test | 82 | ||||
-rw-r--r-- | llvm/tools/llvm-exegesis/lib/Analysis.cpp | 14 | ||||
-rw-r--r-- | llvm/tools/llvm-exegesis/lib/Analysis.h | 5 | ||||
-rw-r--r-- | llvm/tools/llvm-exegesis/lib/BenchmarkResult.h | 2 | ||||
-rw-r--r-- | llvm/tools/llvm-exegesis/lib/Clustering.cpp | 101 | ||||
-rw-r--r-- | llvm/tools/llvm-exegesis/lib/Clustering.h | 28 | ||||
-rw-r--r-- | llvm/tools/llvm-exegesis/llvm-exegesis.cpp | 18 |
8 files changed, 241 insertions, 16 deletions
diff --git a/llvm/docs/CommandGuide/llvm-exegesis.rst b/llvm/docs/CommandGuide/llvm-exegesis.rst index ff62e0285ae..b5263973d16 100644 --- a/llvm/docs/CommandGuide/llvm-exegesis.rst +++ b/llvm/docs/CommandGuide/llvm-exegesis.rst @@ -224,6 +224,13 @@ OPTIONS Specify the numPoints parameters to be used for DBSCAN clustering (`analysis` mode). +.. option:: -analysis-display-unstable-clusters + + If there is more than one benchmark for an opcode, said benchmarks may end up + not being clustered into the same cluster if the measured performance + characteristics are different. by default all such opcodes are filtered out. + This flag will instead show only such unstable opcodes. + .. option:: -ignore-invalid-sched-class=false If set, ignore instructions that do not have a sched class (class idx = 0). diff --git a/llvm/test/tools/llvm-exegesis/X86/analysis-cluster-stabilization.test b/llvm/test/tools/llvm-exegesis/X86/analysis-cluster-stabilization.test new file mode 100644 index 00000000000..aa14d7b8e86 --- /dev/null +++ b/llvm/test/tools/llvm-exegesis/X86/analysis-cluster-stabilization.test @@ -0,0 +1,82 @@ +# RUN: llvm-exegesis -mode=analysis -benchmarks-file=%s -analysis-clusters-output-file=- -analysis-epsilon=0.1 -analysis-numpoints=1 | FileCheck -check-prefixes=CHECK-CLUSTERS %s +# RUN: llvm-exegesis -mode=analysis -benchmarks-file=%s -analysis-inconsistencies-output-file=- -analysis-epsilon=0.5 -analysis-numpoints=1 | FileCheck -check-prefixes=CHECK-INCONSISTENCIES-ALL,CHECK-INCONSISTENCIES-STABLE %s +# RUN: llvm-exegesis -mode=analysis -benchmarks-file=%s -analysis-inconsistencies-output-file=- -analysis-epsilon=0.5 -analysis-display-unstable-clusters -analysis-numpoints=1 | FileCheck -check-prefixes=CHECK-INCONSISTENCIES-ALL,CHECK-INCONSISTENCIES-UNSTABLE %s + +# We have one ADD32rr measurement, and two measurements for SQRTSSr. +# The ADD32rr measurement and one of the SQRTSSr measurements are identical, +# and thus will be be in the same cluster. But the second SQRTSSr measurement +# is different from the first SQRTSSr measurement, and thus it will be in it's +# own cluster. We do reclusterization, and thus since there is more than one +# measurement from SQRTSSr, and they are not in the same cluster, we move +# all two SQRTSSr measurements into their own cluster, and mark it as unstable. +# By default, we do not show such unstable clusters. +# If told to show, we *only* show such unstable clusters. + +# CHECK-CLUSTERS: {{^}}cluster_id,opcode_name,config,sched_class,latency{{$}} +# CHECK-CLUSTERS-NEXT: {{^}}0, +# CHECK-CLUSTERS-SAME: ,90.00{{$}} +# CHECK-CLUSTERS: {{^}}3, +# CHECK-CLUSTERS-SAME: ,90.11{{$}} +# CHECK-CLUSTERS-NEXT: {{^}}3, +# CHECK-CLUSTERS-SAME: ,100.00{{$}} + +# CHECK-INCONSISTENCIES-STABLE: ADD32rr +# CHECK-INCONSISTENCIES-STABLE-NOT: ADD32rr +# CHECK-INCONSISTENCIES-STABLE-NOT: SQRTSSr + +# CHECK-INCONSISTENCIES-UNSTABLE: SQRTSSr +# CHECK-INCONSISTENCIES-UNSTABLE: SQRTSSr +# CHECK-INCONSISTENCIES-UNSTABLE-NOT: SQRTSSr +# CHECK-INCONSISTENCIES-UNSTABLE-NOT: ADD32rr + +--- +mode: latency +key: + instructions: + - 'ADD32rr EDX EDX EAX' + config: '' + register_initial_values: + - 'EDX=0x0' + - 'EAX=0x0' +cpu_name: bdver2 +llvm_triple: x86_64-unknown-linux-gnu +num_repetitions: 10000 +measurements: + - { key: latency, value: 90.0000, per_snippet_value: 90.0000 } +error: '' +info: Repeating a single implicitly serial instruction +assembled_snippet: BA00000000B80000000001C201C201C201C201C201C201C201C201C201C201C201C201C201C201C201C2C3 +--- +mode: latency +key: + instructions: + - 'SQRTSSr XMM11 XMM11' + config: '' + register_initial_values: + - 'XMM11=0x0' +cpu_name: bdver2 +llvm_triple: x86_64-unknown-linux-gnu +num_repetitions: 10000 +measurements: + - { key: latency, value: 90.1111, per_snippet_value: 90.1111 } +error: '' +info: Repeating a single explicitly serial instruction +assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C410F3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBC3 +... +--- +mode: latency +key: + instructions: + - 'SQRTSSr XMM11 XMM11' + config: '' + register_initial_values: + - 'XMM11=0x0' +cpu_name: bdver2 +llvm_triple: x86_64-unknown-linux-gnu +num_repetitions: 10000 +measurements: + - { key: latency, value: 100, per_snippet_value: 100 } +error: '' +info: Repeating a single explicitly serial instruction +assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C410F3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBF3450F51DBC3 +... diff --git a/llvm/tools/llvm-exegesis/lib/Analysis.cpp b/llvm/tools/llvm-exegesis/lib/Analysis.cpp index 158fbfe3b0a..ba5620e8c4b 100644 --- a/llvm/tools/llvm-exegesis/lib/Analysis.cpp +++ b/llvm/tools/llvm-exegesis/lib/Analysis.cpp @@ -149,7 +149,7 @@ void Analysis::printInstructionRowCsv(const size_t PointId, writeEscaped<kEscapeCsv>(OS, Point.Key.Config); OS << kCsvSep; assert(!Point.Key.Instructions.empty()); - const llvm::MCInst &MCI = Point.Key.Instructions[0]; + const llvm::MCInst &MCI = Point.keyInstruction(); const unsigned SchedClassId = resolveSchedClassId( *SubtargetInfo_, InstrInfo_->get(MCI.getOpcode()).getSchedClass(), MCI); @@ -168,13 +168,15 @@ void Analysis::printInstructionRowCsv(const size_t PointId, } Analysis::Analysis(const llvm::Target &Target, - const InstructionBenchmarkClustering &Clustering) - : Clustering_(Clustering) { + std::unique_ptr<llvm::MCInstrInfo> InstrInfo, + const InstructionBenchmarkClustering &Clustering, + bool AnalysisDisplayUnstableOpcodes) + : Clustering_(Clustering), InstrInfo_(std::move(InstrInfo)), + AnalysisDisplayUnstableOpcodes_(AnalysisDisplayUnstableOpcodes) { if (Clustering.getPoints().empty()) return; const InstructionBenchmark &FirstPoint = Clustering.getPoints().front(); - InstrInfo_.reset(Target.createMCInstrInfo()); RegInfo_.reset(Target.createMCRegInfo(FirstPoint.LLVMTriple)); AsmInfo_.reset(Target.createMCAsmInfo(*RegInfo_, FirstPoint.LLVMTriple)); SubtargetInfo_.reset(Target.createMCSubtargetInfo(FirstPoint.LLVMTriple, @@ -233,7 +235,7 @@ Analysis::makePointsPerSchedClass() const { assert(!Point.Key.Instructions.empty()); // FIXME: we should be using the tuple of classes for instructions in the // snippet as key. - const llvm::MCInst &MCI = Point.Key.Instructions[0]; + const llvm::MCInst &MCI = Point.keyInstruction(); unsigned SchedClassId = InstrInfo_->get(MCI.getOpcode()).getSchedClass(); const bool WasVariant = SchedClassId && SubtargetInfo_->getSchedModel() .getSchedClassDesc(SchedClassId) @@ -668,6 +670,8 @@ llvm::Error Analysis::run<Analysis::PrintSchedClassInconsistencies>( const auto &ClusterId = Clustering_.getClusterIdForPoint(PointId); if (!ClusterId.isValid()) continue; // Ignore noise and errors. FIXME: take noise into account ? + if (ClusterId.isUnstable() ^ AnalysisDisplayUnstableOpcodes_) + continue; // Either display stable or unstable clusters only. auto SchedClassClusterIt = std::find_if(SchedClassClusters.begin(), SchedClassClusters.end(), [ClusterId](const SchedClassCluster &C) { diff --git a/llvm/tools/llvm-exegesis/lib/Analysis.h b/llvm/tools/llvm-exegesis/lib/Analysis.h index 852aaae0012..d55a9a8c684 100644 --- a/llvm/tools/llvm-exegesis/lib/Analysis.h +++ b/llvm/tools/llvm-exegesis/lib/Analysis.h @@ -36,7 +36,9 @@ namespace exegesis { class Analysis { public: Analysis(const llvm::Target &Target, - const InstructionBenchmarkClustering &Clustering); + std::unique_ptr<llvm::MCInstrInfo> InstrInfo, + const InstructionBenchmarkClustering &Clustering, + bool AnalysisDisplayUnstableOpcodes); // Prints a csv of instructions for each cluster. struct PrintClusters {}; @@ -125,6 +127,7 @@ private: std::unique_ptr<llvm::MCAsmInfo> AsmInfo_; std::unique_ptr<llvm::MCInstPrinter> InstPrinter_; std::unique_ptr<llvm::MCDisassembler> Disasm_; + const bool AnalysisDisplayUnstableOpcodes_; }; // Computes the idealized ProcRes Unit pressure. This is the expected diff --git a/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h b/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h index 0ef4fb3caa9..d75e18b7489 100644 --- a/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h +++ b/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h @@ -61,6 +61,8 @@ struct InstructionBenchmark { ModeE Mode; std::string CpuName; std::string LLVMTriple; + // Which instruction is being benchmarked here? + const llvm::MCInst &keyInstruction() const { return Key.Instructions[0]; } // The number of instructions inside the repeated snippet. For example, if a // snippet of 3 instructions is repeated 4 times, this is 12. int NumRepetitions = 0; diff --git a/llvm/tools/llvm-exegesis/lib/Clustering.cpp b/llvm/tools/llvm-exegesis/lib/Clustering.cpp index 36e6c5b110f..bebb535f431 100644 --- a/llvm/tools/llvm-exegesis/lib/Clustering.cpp +++ b/llvm/tools/llvm-exegesis/lib/Clustering.cpp @@ -8,8 +8,11 @@ #include "Clustering.h" #include "llvm/ADT/SetVector.h" +#include "llvm/ADT/SmallSet.h" #include "llvm/ADT/SmallVector.h" +#include <algorithm> #include <string> +#include <vector> namespace llvm { namespace exegesis { @@ -147,10 +150,102 @@ void InstructionBenchmarkClustering::dbScan(const size_t MinPts) { } } +// Given an instruction Opcode, we can make benchmarks (measurements) of the +// instruction characteristics/performance. Then, to facilitate further analysis +// we group the benchmarks with *similar* characteristics into clusters. +// Now, this is all not entirely deterministic. Some instructions have variable +// characteristics, depending on their arguments. And thus, if we do several +// benchmarks of the same instruction Opcode, we may end up with *different* +// performance characteristics measurements. And when we then do clustering, +// these several benchmarks of the same instruction Opcode may end up being +// clustered into *different* clusters. This is not great for further analysis. +// We shall find every opcode with benchmarks not in just one cluster, and move +// *all* the benchmarks of said Opcode into one new unstable cluster per Opcode. +void InstructionBenchmarkClustering::stabilize(unsigned NumOpcodes) { + // Given an instruction Opcode, in which clusters do benchmarks of this + // instruction lie? Normally, they all should be in the same cluster. + std::vector<llvm::SmallSet<ClusterId, 1>> OpcodeToClusterIDs; + OpcodeToClusterIDs.resize(NumOpcodes); + // The list of opcodes that have more than one cluster. + llvm::SetVector<size_t> UnstableOpcodes; + // Populate OpcodeToClusterIDs and UnstableOpcodes data structures. + assert(ClusterIdForPoint_.size() == Points_.size() && "size mismatch"); + for (const auto &Point : zip(Points_, ClusterIdForPoint_)) { + const ClusterId &ClusterIdOfPoint = std::get<1>(Point); + if (!ClusterIdOfPoint.isValid()) + continue; // Only process fully valid clusters. + const unsigned Opcode = std::get<0>(Point).keyInstruction().getOpcode(); + assert(Opcode < NumOpcodes && "NumOpcodes is incorrect (too small)"); + llvm::SmallSet<ClusterId, 1> &ClusterIDsOfOpcode = + OpcodeToClusterIDs[Opcode]; + ClusterIDsOfOpcode.insert(ClusterIdOfPoint); + // Is there more than one ClusterID for this opcode?. + if (ClusterIDsOfOpcode.size() < 2) + continue; // If not, then at this moment this Opcode is stable. + // Else let's record this unstable opcode for future use. + UnstableOpcodes.insert(Opcode); + } + assert(OpcodeToClusterIDs.size() == NumOpcodes && "sanity check"); + + // We know with how many [new] clusters we will end up with. + const auto NewTotalClusterCount = Clusters_.size() + UnstableOpcodes.size(); + Clusters_.reserve(NewTotalClusterCount); + for (const size_t UnstableOpcode : UnstableOpcodes.getArrayRef()) { + const llvm::SmallSet<ClusterId, 1> &ClusterIDs = + OpcodeToClusterIDs[UnstableOpcode]; + assert(ClusterIDs.size() > 1 && + "Should only have Opcodes with more than one cluster."); + + // Create a new unstable cluster, one per Opcode. + Clusters_.emplace_back(ClusterId::makeValidUnstable(Clusters_.size())); + Cluster &UnstableCluster = Clusters_.back(); + // We will find *at least* one point in each of these clusters. + UnstableCluster.PointIndices.reserve(ClusterIDs.size()); + + // Go through every cluster which we recorded as containing benchmarks + // of this UnstableOpcode. NOTE: we only recorded valid clusters. + for (const ClusterId &CID : ClusterIDs) { + assert(CID.isValid() && + "We only recorded valid clusters, not noise/error clusters."); + Cluster &OldCluster = Clusters_[CID.getId()]; // Valid clusters storage. + // Within each cluster, go through each point, and either move it to the + // new unstable cluster, or 'keep' it. + // In this case, we'll reshuffle OldCluster.PointIndices vector + // so that all the points that are *not* for UnstableOpcode are first, + // and the rest of the points is for the UnstableOpcode. + const auto it = std::stable_partition( + OldCluster.PointIndices.begin(), OldCluster.PointIndices.end(), + [this, UnstableOpcode](size_t P) { + return Points_[P].keyInstruction().getOpcode() != UnstableOpcode; + }); + assert(std::distance(it, OldCluster.PointIndices.end()) > 0 && + "Should have found at least one bad point"); + // Mark to-be-moved points as belonging to the new cluster. + std::for_each(it, OldCluster.PointIndices.end(), + [this, &UnstableCluster](size_t P) { + ClusterIdForPoint_[P] = UnstableCluster.Id; + }); + // Actually append to-be-moved points to the new cluster. + UnstableCluster.PointIndices.insert(UnstableCluster.PointIndices.cend(), + it, OldCluster.PointIndices.end()); + // And finally, remove "to-be-moved" points form the old cluster. + OldCluster.PointIndices.erase(it, OldCluster.PointIndices.cend()); + // Now, the old cluster may end up being empty, but let's just keep it + // in whatever state it ended up. Purging empty clusters isn't worth it. + }; + assert(UnstableCluster.PointIndices.size() > 1 && + "New unstable cluster should end up with more than one point."); + assert(UnstableCluster.PointIndices.size() >= ClusterIDs.size() && + "New unstable cluster should end up with no less points than there " + "was clusters"); + } + assert(Clusters_.size() == NewTotalClusterCount && "sanity check"); +} + llvm::Expected<InstructionBenchmarkClustering> InstructionBenchmarkClustering::create( const std::vector<InstructionBenchmark> &Points, const size_t MinPts, - const double Epsilon) { + const double Epsilon, llvm::Optional<unsigned> NumOpcodes) { InstructionBenchmarkClustering Clustering(Points, Epsilon * Epsilon); if (auto Error = Clustering.validateAndSetup()) { return std::move(Error); @@ -160,6 +255,10 @@ InstructionBenchmarkClustering::create( } Clustering.dbScan(MinPts); + + if (NumOpcodes.hasValue()) + Clustering.stabilize(NumOpcodes.getValue()); + return Clustering; } diff --git a/llvm/tools/llvm-exegesis/lib/Clustering.h b/llvm/tools/llvm-exegesis/lib/Clustering.h index dccc28d9518..70082044f56 100644 --- a/llvm/tools/llvm-exegesis/lib/Clustering.h +++ b/llvm/tools/llvm-exegesis/lib/Clustering.h @@ -15,7 +15,9 @@ #define LLVM_TOOLS_LLVM_EXEGESIS_CLUSTERING_H #include "BenchmarkResult.h" +#include "llvm/ADT/Optional.h" #include "llvm/Support/Error.h" +#include <limits> #include <vector> namespace llvm { @@ -27,21 +29,28 @@ public: // for more explanations on the algorithm. static llvm::Expected<InstructionBenchmarkClustering> create(const std::vector<InstructionBenchmark> &Points, size_t MinPts, - double Epsilon); + double Epsilon, llvm::Optional<unsigned> NumOpcodes = llvm::None); class ClusterId { public: static ClusterId noise() { return ClusterId(kNoise); } static ClusterId error() { return ClusterId(kError); } static ClusterId makeValid(size_t Id) { return ClusterId(Id); } - ClusterId() : Id_(kUndef) {} + static ClusterId makeValidUnstable(size_t Id) { + return ClusterId(Id, /*IsUnstable=*/true); + } + + ClusterId() : Id_(kUndef), IsUnstable_(false) {} + + // Compare id's, ignoring the 'unstability' bit. bool operator==(const ClusterId &O) const { return Id_ == O.Id_; } bool operator<(const ClusterId &O) const { return Id_ < O.Id_; } bool isValid() const { return Id_ <= kMaxValid; } - bool isUndef() const { return Id_ == kUndef; } + bool isUnstable() const { return IsUnstable_; } bool isNoise() const { return Id_ == kNoise; } bool isError() const { return Id_ == kError; } + bool isUndef() const { return Id_ == kUndef; } // Precondition: isValid(). size_t getId() const { @@ -50,14 +59,19 @@ public: } private: - explicit ClusterId(size_t Id) : Id_(Id) {} + ClusterId(size_t Id, bool IsUnstable = false) + : Id_(Id), IsUnstable_(IsUnstable) {} + static constexpr const size_t kMaxValid = - std::numeric_limits<size_t>::max() - 4; + (std::numeric_limits<size_t>::max() >> 1) - 4; static constexpr const size_t kNoise = kMaxValid + 1; static constexpr const size_t kError = kMaxValid + 2; static constexpr const size_t kUndef = kMaxValid + 3; - size_t Id_; + + size_t Id_ : (std::numeric_limits<size_t>::digits - 1); + size_t IsUnstable_ : 1; }; + static_assert(sizeof(ClusterId) == sizeof(size_t), "should be a bit field."); struct Cluster { Cluster() = delete; @@ -101,8 +115,10 @@ public: private: InstructionBenchmarkClustering( const std::vector<InstructionBenchmark> &Points, double EpsilonSquared); + llvm::Error validateAndSetup(); void dbScan(size_t MinPts); + void stabilize(unsigned NumOpcodes); void rangeQuery(size_t Q, std::vector<size_t> &Scratchpad) const; const std::vector<InstructionBenchmark> &Points_; diff --git a/llvm/tools/llvm-exegesis/llvm-exegesis.cpp b/llvm/tools/llvm-exegesis/llvm-exegesis.cpp index d9976cfb155..915ffc5863f 100644 --- a/llvm/tools/llvm-exegesis/llvm-exegesis.cpp +++ b/llvm/tools/llvm-exegesis/llvm-exegesis.cpp @@ -96,13 +96,21 @@ static cl::opt<std::string> AnalysisInconsistenciesOutputFile("analysis-inconsistencies-output-file", cl::desc(""), cl::init("")); +static cl::opt<bool> AnalysisDisplayUnstableOpcodes( + "analysis-display-unstable-clusters", + cl::desc("if there is more than one benchmark for an opcode, said " + "benchmarks may end up not being clustered into the same cluster " + "if the measured performance characteristics are different. by " + "default all such opcodes are filtered out. this flag will " + "instead show only such unstable opcodes"), + cl::init(false)); + static cl::opt<std::string> CpuName("mcpu", cl::desc( "cpu name to use for pfm counters, leave empty to autodetect"), cl::init("")); - static ExitOnError ExitOnErr; #ifdef LLVM_EXEGESIS_INITIALIZE_NATIVE_TARGET @@ -432,10 +440,14 @@ static void analysisMain() { llvm::errs() << "unknown target '" << Points[0].LLVMTriple << "'\n"; return; } + + std::unique_ptr<llvm::MCInstrInfo> InstrInfo(TheTarget->createMCInstrInfo()); + const auto Clustering = ExitOnErr(InstructionBenchmarkClustering::create( - Points, AnalysisNumPoints, AnalysisEpsilon)); + Points, AnalysisNumPoints, AnalysisEpsilon, InstrInfo->getNumOpcodes())); - const Analysis Analyzer(*TheTarget, Clustering); + const Analysis Analyzer(*TheTarget, std::move(InstrInfo), Clustering, + AnalysisDisplayUnstableOpcodes); maybeRunAnalysis<Analysis::PrintClusters>(Analyzer, "analysis clusters", AnalysisClustersOutputFile); |