[HotColdSplit] Introduce a cost model to control splitting behavior

The main goal of the model is to avoid *increasing* function size, as that would eradicate any memory locality benefits from splitting. This happens when: - There are too many inputs or outputs to the cold region. Argument materialization and reloads of outputs have a cost. - The cold region has too many distinct exit blocks, causing a large switch to be formed in the caller. - The code size cost of the split code is less than the cost of a set-up call. A secondary goal is to prevent excessive overall binary size growth. With the cost model in place, I experimented to find a splitting threshold that works well in practice. To make warm & cold code easily separable for analysis purposes, I moved split functions to a "cold" section. I experimented with thresholds between [0, 4] and set the default to the threshold which minimized geomean __text size. Experiment data from building LNT+externals for X86 (N = 639 programs, all sizes in bytes): | Configuration | __text geom size | __cold geom size | TEXT geom size | | **-Os** | 1736.3 | 0, n=0 | 10961.6 | | -Os, thresh=0 | 1740.53 | 124.482, n=134 | 11014 | | -Os, thresh=1 | 1734.79 | 57.8781, n=90 | 10978.6 | | -Os, thresh=2 | ** 1733.85 ** | 65.6604, n=61 | 10977.6 | | -Os, thresh=3 | 1733.85 | 65.3071, n=61 | 10977.6 | | -Os, thresh=4 | 1735.08 | 67.5156, n=54 | 10965.7 | | **-Oz** | 1554.4 | 0, n=0 | 10153 | | -Oz, thresh=2 | ** 1552.2 ** | 65.633, n=61 | 10176 | | **-O3** | 2563.37 | 0, n=0 | 13105.4 | | -O3, thresh=2 | ** 2559.49 ** | 71.1072, n=61 | 13162.4 | Picking thresh=2 reduces the geomean __text section size by 0.14% at -Os, -Oz, and -O3 and causes ~0.2% growth in the TEXT segment. Note that TEXT size is page-aligned, whereas section sizes are byte-aligned. Experiment data from building LNT+externals for ARM64 (N = 558 programs, all sizes in bytes): | Configuration | __text geom size | __cold geom size | TEXT geom size | | **-Os** | 1763.96 | 0, n=0 | 42934.9 | | -Os, thresh=2 | ** 1760.9 ** | 76.6755, n=61 | 42934.9 | Picking thresh=2 reduces the geomean __text section size by 0.17% at -Os and causes no growth in the TEXT segment. Measurements were done with D57082 (r352080) applied. Differential Revision: https://reviews.llvm.org/D57125 llvm-svn: 352228
author: Vedant Kumar <vsk@apple.com> 2019-01-25 18:30:37 +0000
committer: Vedant Kumar <vsk@apple.com> 2019-01-25 18:30:37 +0000
commit: db3f9774eefc662cbcf976b51f459c80d2664d82 (patch)
tree: 30a63bacfcc58e059c7c5d13bf22cebb266d6e64 /llvm/lib
parent: 13ef84fced995ad67942349e6814bfcb005e19ea (diff)
download: bcm5719-llvm-db3f9774eefc662cbcf976b51f459c80d2664d82.tar.gz
bcm5719-llvm-db3f9774eefc662cbcf976b51f459c80d2664d82.zip
1 files changed, 91 insertions, 36 deletions
diff --git a/llvm/lib/Transforms/IPO/HotColdSplitting.cpp b/llvm/lib/Transforms/IPO/HotColdSplitting.cpp
index 8250db706e3..36dd6fa4be7 100644
--- a/llvm/lib/Transforms/IPO/HotColdSplitting.cpp
+++ b/llvm/lib/Transforms/IPO/HotColdSplitting.cpp
@@ -80,9 +80,9 @@ static cl::opt<bool> EnableStaticAnalyis("hot-cold-static-analysis",
                               cl::init(true), cl::Hidden);
 
 static cl::opt<int>
-    SplittingThreshold("hotcoldsplit-threshold", cl::init(3), cl::Hidden,
-                       cl::desc("Code size threshold for splitting cold code "
-                                "(as a multiple of TCC_Basic)"));
+    SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,
+                       cl::desc("Base penalty for splitting cold code (as a "
+                                "multiple of TCC_Basic)"));
 
 namespace {
 
@@ -139,31 +139,6 @@ static bool mayExtractBlock(const BasicBlock &BB) {
          !isa<InvokeInst>(BB.getTerminator());
 }
 
-/// Check whether \p Region is profitable to outline.
-static bool isProfitableToOutline(const BlockSequence &Region,
-                                  TargetTransformInfo &TTI) {
-  // If the splitting threshold is set at or below zero, skip the usual
-  // profitability check.
-  if (SplittingThreshold <= 0)
-    return true;
-
-  if (Region.size() > 1)
-    return true;
-
-  int Cost = 0;
-  const BasicBlock &BB = *Region[0];
-  for (const Instruction &I : BB) {
-    if (isa<DbgInfoIntrinsic>(&I) || &I == BB.getTerminator())
-      continue;
-
-    Cost += TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize);
-
-    if (Cost >= (SplittingThreshold * TargetTransformInfo::TCC_Basic))
-      return true;
-  }
-  return false;
-}
-
 /// Mark \p F cold. Based on this assumption, also optimize it for minimum size.
 /// Return true if the function is changed.
 static bool markFunctionCold(Function &F) {
@@ -247,6 +222,82 @@ bool HotColdSplitting::shouldOutlineFrom(const Function &F) const {
   return true;
 }
 
+/// Get the benefit score of outlining \p Region.
+static int getOutliningBenefit(ArrayRef<BasicBlock *> Region,
+                               TargetTransformInfo &TTI) {
+  // Sum up the code size costs of non-terminator instructions. Tight coupling
+  // with \ref getOutliningPenalty is needed to model the costs of terminators.
+  int Benefit = 0;
+  for (BasicBlock *BB : Region)
+    for (Instruction &I : BB->instructionsWithoutDebug())
+      if (&I != BB->getTerminator())
+        Benefit +=
+            TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize);
+
+  return Benefit;
+}
+
+/// Get the penalty score for outlining \p Region.
+static int getOutliningPenalty(ArrayRef<BasicBlock *> Region,
+                               unsigned NumInputs, unsigned NumOutputs) {
+  int Penalty = SplittingThreshold;
+  LLVM_DEBUG(dbgs() << "Applying penalty for splitting: " << Penalty << "\n");
+
+  // If the splitting threshold is set at or below zero, skip the usual
+  // profitability check.
+  if (SplittingThreshold <= 0)
+    return Penalty;
+
+  // The typical code size cost for materializing an argument for the outlined
+  // call.
+  LLVM_DEBUG(dbgs() << "Applying penalty for: " << NumInputs << " inputs\n");
+  const int CostForArgMaterialization = TargetTransformInfo::TCC_Basic;
+  Penalty += CostForArgMaterialization * NumInputs;
+
+  // The typical code size cost for an output alloca, its associated store, and
+  // its associated reload.
+  LLVM_DEBUG(dbgs() << "Applying penalty for: " << NumOutputs << " outputs\n");
+  const int CostForRegionOutput = 3 * TargetTransformInfo::TCC_Basic;
+  Penalty += CostForRegionOutput * NumOutputs;
+
+  // Find the number of distinct exit blocks for the region. Use a conservative
+  // check to determine whether control returns from the region.
+  bool NoBlocksReturn = true;
+  SmallPtrSet<BasicBlock *, 2> SuccsOutsideRegion;
+  for (BasicBlock *BB : Region) {
+    // If a block has no successors, only assume it does not return if it's
+    // unreachable.
+    if (succ_empty(BB)) {
+      NoBlocksReturn &= isa<UnreachableInst>(BB->getTerminator());
+      continue;
+    }
+
+    for (BasicBlock *SuccBB : successors(BB)) {
+      if (find(Region, SuccBB) == Region.end()) {
+        NoBlocksReturn = false;
+        SuccsOutsideRegion.insert(SuccBB);
+      }
+    }
+  }
+
+  // Apply a `noreturn` bonus.
+  if (NoBlocksReturn) {
+    LLVM_DEBUG(dbgs() << "Applying bonus for: " << Region.size()
+                      << " non-returning terminators\n");
+    Penalty -= Region.size();
+  }
+
+  // Apply a penalty for having more than one successor outside of the region.
+  // This penalty accounts for the switch needed in the caller.
+  if (!SuccsOutsideRegion.empty()) {
+    LLVM_DEBUG(dbgs() << "Applying penalty for: " << SuccsOutsideRegion.size()
+                      << " non-region successors\n");
+    Penalty += (SuccsOutsideRegion.size() - 1) * TargetTransformInfo::TCC_Basic;
+  }
+
+  return Penalty;
+}
+
 Function *HotColdSplitting::extractColdRegion(const BlockSequence &Region,
                                               DominatorTree &DT,
                                               BlockFrequencyInfo *BFI,
@@ -261,6 +312,18 @@ Function *HotColdSplitting::extractColdRegion(const BlockSequence &Region,
                    /* AllowAlloca */ false,
                    /* Suffix */ "cold." + std::to_string(Count));
 
+  // Perform a simple cost/benefit analysis to decide whether or not to permit
+  // splitting.
+  SetVector<Value *> Inputs, Outputs, Sinks;
+  CE.findInputsOutputs(Inputs, Outputs, Sinks);
+  int OutliningBenefit = getOutliningBenefit(Region, TTI);
+  int OutliningPenalty =
+      getOutliningPenalty(Region, Inputs.size(), Outputs.size());
+  LLVM_DEBUG(dbgs() << "Split profitability: benefit = " << OutliningBenefit
+                    << ", penalty = " << OutliningPenalty << "\n");
+  if (OutliningBenefit <= OutliningPenalty)
+    return nullptr;
+
   Function *OrigF = Region[0]->getParent();
   if (Function *OutF = CE.extractCodeRegion()) {
     User *U = *OutF->user_begin();
@@ -556,14 +619,6 @@ bool HotColdSplitting::outlineColdRegions(Function &F, bool HasProfileSummary) {
     assert(!Region.empty() && "Empty outlining region in worklist");
     do {
       BlockSequence SubRegion = Region.takeSingleEntrySubRegion(*DT);
-      if (!isProfitableToOutline(SubRegion, TTI)) {
-        LLVM_DEBUG({
-          dbgs() << "Skipping outlining; not profitable to outline\n";
-          SubRegion[0]->dump();
-        });
-        continue;
-      }
-
       LLVM_DEBUG({
         dbgs() << "Hot/cold splitting attempting to outline these blocks:\n";
         for (BasicBlock *BB : SubRegion)
author	Vedant Kumar <vsk@apple.com>	2019-01-25 18:30:37 +0000
committer	Vedant Kumar <vsk@apple.com>	2019-01-25 18:30:37 +0000
commit	db3f9774eefc662cbcf976b51f459c80d2664d82 (patch)
tree	30a63bacfcc58e059c7c5d13bf22cebb266d6e64 /llvm/lib
parent	13ef84fced995ad67942349e6814bfcb005e19ea (diff)
download	bcm5719-llvm-db3f9774eefc662cbcf976b51f459c80d2664d82.tar.gz bcm5719-llvm-db3f9774eefc662cbcf976b51f459c80d2664d82.zip