diff options
| author | Roman Lebedev <lebedev.ri@gmail.com> | 2019-09-16 16:18:24 +0000 | 
|---|---|---|
| committer | Roman Lebedev <lebedev.ri@gmail.com> | 2019-09-16 16:18:24 +0000 | 
| commit | 10151f661854e3ee4922662f1d0f62b327cbfa8c (patch) | |
| tree | 73c772df4cb51ff054d13a86f2bee66ddb3533c5 /llvm/lib/Transforms/Utils | |
| parent | 685d8a95c5a392b016b259b41c17065d30b66afe (diff) | |
| download | bcm5719-llvm-10151f661854e3ee4922662f1d0f62b327cbfa8c.tar.gz bcm5719-llvm-10151f661854e3ee4922662f1d0f62b327cbfa8c.zip  | |
[SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost
Summary:
Previously, if the threshold was 2, we were willing to speculatively
execute 2 cheap instructions in both basic blocks (thus we were willing
to speculatively execute cost = 4), but weren't willing to speculate
when one BB had 3 instructions and other one had no instructions,
even thought that would have total cost of 3.
This looks inconsistent to me.
I don't think `cmov`-like instructions will start executing
until both of it's inputs are available: https://godbolt.org/z/zgHePf
So i don't see why the existing behavior is the correct one.
Also, let's add it's own `cl::opt` for this threshold,
with default=4, so it is not stricter than the previous threshold:
will allow to fold when there are 2 BB's each with cost=2.
And since the logic has changed, it will also allow to fold when
one BB has cost=3 and other cost=1, or there is only one BB with cost=4.
This is an alternative solution to D65148:
This fix is mainly motivated by `signbit-like-value-extension.ll` test.
That pattern comes up in JPEG decoding, see e.g.
`Figure F.12 – Extending the sign bit of a decoded value in V`
of `ITU T.81` (JPEG specification).
That branch is not predictable, and it is within the innermost loop,
so the fact that that pattern ends up being stuck with a branch
instead of `select` (i.e. `CMOV` for x86) is unlikely to be beneficial.
This has great results on the final assembly (vanilla test-suite + RawSpeed): (metric pass - D67240)
| metric                                 |     old |     new | delta |      % |
| x86-mi-counting.NumMachineFunctions    |   37720 |   37721 |     1 |  0.00% |
| x86-mi-counting.NumMachineBasicBlocks  |  773545 |  771181 | -2364 | -0.31% |
| x86-mi-counting.NumMachineInstructions | 7488843 | 7486442 | -2401 | -0.03% |
| x86-mi-counting.NumUncondBR            |  135770 |  135543 |  -227 | -0.17% |
| x86-mi-counting.NumCondBR              |  423753 |  422187 | -1566 | -0.37% |
| x86-mi-counting.NumCMOV                |   24815 |   25731 |   916 |  3.69% |
| x86-mi-counting.NumVecBlend            |      17 |      17 |     0 |  0.00% |
We significantly decrease basic block count, notably decrease instruction count,
significantly decrease branch count and very significantly increase `cmov` count.
Performance-wise, unsurprisingly, this has great effect on
target RawSpeed benchmark. I'm seeing 5 **major** improvements:
```
Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3064         -0.3064      226.9913      157.4452      226.9800      157.4384
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3057         -0.3057      226.8407      157.4926      226.8282      157.4828
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.4985         -0.4954        0.3051        0.1530        0.3040        0.1534
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1747         -0.1747       80.4787       66.4227       80.4771       66.4146
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1742         -0.1743       80.4686       66.4542       80.4690       66.4436
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +0.6089         +0.5797        0.0670        0.1078        0.0673        0.1062
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1598         -0.1598      171.6996      144.2575      171.6915      144.2538
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1598         -0.1597      171.7109      144.2755      171.7018      144.2766
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +0.4024         +0.3850        0.0847        0.1187        0.0848        0.1175
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0550         -0.0551      280.3046      264.8800      280.3017      264.8559
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0554         -0.0554      280.2628      264.7360      280.2574      264.7297
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 +0.7005         +0.7041        0.2779        0.4725        0.2775        0.4729
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0354         -0.0355      316.7396      305.5208      316.7342      305.4890
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0354         -0.0356      316.6969      305.4798      316.6917      305.4324
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.0493         +0.0330        0.3562        0.3737        0.3563        0.3681
```
That being said, it's always best-effort, so there will likely
be cases where this worsens things.
Reviewers: efriedma, craig.topper, dmgreen, jmolloy, fhahn, Carrot, hfinkel, chandlerc
Reviewed By: jmolloy
Subscribers: xbolva00, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67318
llvm-svn: 372009
Diffstat (limited to 'llvm/lib/Transforms/Utils')
| -rw-r--r-- | llvm/lib/Transforms/Utils/SimplifyCFG.cpp | 27 | 
1 files changed, 14 insertions, 13 deletions
diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp index 2e0a07f105d..b7b644c981c 100644 --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp @@ -94,6 +94,12 @@ static cl::opt<unsigned> PHINodeFoldingThreshold(      cl::desc(          "Control the amount of phi node folding to perform (default = 2)")); +static cl::opt<unsigned> TwoEntryPHINodeFoldingThreshold( +    "two-entry-phi-node-folding-threshold", cl::Hidden, cl::init(4), +    cl::desc("Control the maximal total instruction cost that we are willing " +             "to speculatively execute to fold a 2-entry PHI node into a " +             "select (default = 4)")); +  static cl::opt<bool> DupRet(      "simplifycfg-dup-ret", cl::Hidden, cl::init(false),      cl::desc("Duplicate return instructions into unconditional branches")); @@ -332,7 +338,7 @@ static unsigned ComputeSpeculationCost(const User *I,  /// CostRemaining, false is returned and CostRemaining is undefined.  static bool DominatesMergePoint(Value *V, BasicBlock *BB,                                  SmallPtrSetImpl<Instruction *> &AggressiveInsts, -                                unsigned &CostRemaining, +                                int &BudgetRemaining,                                  const TargetTransformInfo &TTI,                                  unsigned Depth = 0) {    // It is possible to hit a zero-cost cycle (phi/gep instructions for example), @@ -375,7 +381,7 @@ static bool DominatesMergePoint(Value *V, BasicBlock *BB,    if (!isSafeToSpeculativelyExecute(I))      return false; -  unsigned Cost = ComputeSpeculationCost(I, TTI); +  BudgetRemaining -= ComputeSpeculationCost(I, TTI);    // Allow exactly one instruction to be speculated regardless of its cost    // (as long as it is safe to do so). @@ -383,17 +389,14 @@ static bool DominatesMergePoint(Value *V, BasicBlock *BB,    // or other expensive operation. The speculation of an expensive instruction    // is expected to be undone in CodeGenPrepare if the speculation has not    // enabled further IR optimizations. -  if (Cost > CostRemaining && +  if (BudgetRemaining < 0 &&        (!SpeculateOneExpensiveInst || !AggressiveInsts.empty() || Depth > 0))      return false; -  // Avoid unsigned wrap. -  CostRemaining = (Cost > CostRemaining) ? 0 : CostRemaining - Cost; -    // Okay, we can only really hoist these out if their operands do    // not take us over the cost threshold.    for (User::op_iterator i = I->op_begin(), e = I->op_end(); i != e; ++i) -    if (!DominatesMergePoint(*i, BB, AggressiveInsts, CostRemaining, TTI, +    if (!DominatesMergePoint(*i, BB, AggressiveInsts, BudgetRemaining, TTI,                               Depth + 1))        return false;    // Okay, it's safe to do this!  Remember this instruction. @@ -2322,10 +2325,8 @@ static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,    // instructions.  While we are at it, keep track of the instructions    // that need to be moved to the dominating block.    SmallPtrSet<Instruction *, 4> AggressiveInsts; -  unsigned MaxCostVal0 = PHINodeFoldingThreshold, -           MaxCostVal1 = PHINodeFoldingThreshold; -  MaxCostVal0 *= TargetTransformInfo::TCC_Basic; -  MaxCostVal1 *= TargetTransformInfo::TCC_Basic; +  int BudgetRemaining = +      TwoEntryPHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic;    for (BasicBlock::iterator II = BB->begin(); isa<PHINode>(II);) {      PHINode *PN = cast<PHINode>(II++); @@ -2336,9 +2337,9 @@ static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,      }      if (!DominatesMergePoint(PN->getIncomingValue(0), BB, AggressiveInsts, -                             MaxCostVal0, TTI) || +                             BudgetRemaining, TTI) ||          !DominatesMergePoint(PN->getIncomingValue(1), BB, AggressiveInsts, -                             MaxCostVal1, TTI)) +                             BudgetRemaining, TTI))        return false;    }  | 

