[SLP] avoid reduction transform on patterns that the backend can load-combine

I don't see an ideal solution to these 2 related, potentially large, perf regressions: https://bugs.llvm.org/show_bug.cgi?id=42708 https://bugs.llvm.org/show_bug.cgi?id=43146 We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend. Therefore, preventing SLP from destroying load combine opportunities requires that it recognizes patterns that could be combined later, but not do the optimization itself ( it's not a vector combine anyway, so it's probably out-of-scope for SLP). Here, we add a scalar cost model adjustment with a conservative pattern match and cost summation for a multi-instruction sequence that can probably be reduced later. This should prevent SLP from creating a vector reduction unless that sequence is extremely cheap. In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining will produce a single instruction on these tests like: movbe rax, qword ptr [rdi] or: mov rax, qword ptr [rdi] Not some (half) vector monstrosity as we currently do using SLP: vpmovzxbq ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,.. vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] movzx eax, byte ptr [rdi] movzx ecx, byte ptr [rdi + 5] shl rcx, 40 movzx edx, byte ptr [rdi + 6] shl rdx, 48 or rdx, rcx movzx ecx, byte ptr [rdi + 7] shl rcx, 56 or rcx, rdx or rcx, rax vextracti128 xmm1, ymm0, 1 vpor xmm0, xmm0, xmm1 vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpor xmm0, xmm0, xmm1 vmovq rax, xmm0 or rax, rcx vzeroupper ret Differential Revision: https://reviews.llvm.org/D67841 llvm-svn: 373833
author: Sanjay Patel <spatel@rotateright.com> 2019-10-05 18:03:58 +0000
committer: Sanjay Patel <spatel@rotateright.com> 2019-10-05 18:03:58 +0000
commit: e2321bb4488a81b87742f3343e3bdf8e161aa35b (patch)
tree: 48e6260a743b8adf2a2866d6250955e09c2ce8a6 /llvm/lib/Analysis
parent: 9ecacb0d54fb89dc7e6da66d9ecae934ca5c01d4 (diff)
download: bcm5719-llvm-e2321bb4488a81b87742f3343e3bdf8e161aa35b.tar.gz
bcm5719-llvm-e2321bb4488a81b87742f3343e3bdf8e161aa35b.zip
1 files changed, 53 insertions, 0 deletions
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index f3d20ce984d..6730aa86a99 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -571,11 +571,64 @@ TargetTransformInfo::getOperandInfo(Value *V, OperandValueProperties &OpProps) {
   return OpInfo;
 }
 
+Optional<int>
+TargetTransformInfo::getLoadCombineCost(unsigned Opcode,
+                                        ArrayRef<const Value *> Args) const {
+  if (Opcode != Instruction::Or)
+    return llvm::None;
+  if (Args.empty())
+    return llvm::None;
+
+  // Look past the reduction to find a source value. Arbitrarily follow the
+  // path through operand 0 of any 'or'. Also, peek through optional
+  // shift-left-by-constant.
+  const Value *ZextLoad = Args.front();
+  while (match(ZextLoad, m_Or(m_Value(), m_Value())) ||
+         match(ZextLoad, m_Shl(m_Value(), m_Constant())))
+    ZextLoad = cast<BinaryOperator>(ZextLoad)->getOperand(0);
+
+  // Check if the input to the reduction is an extended load.
+  Value *LoadPtr;
+  if (!match(ZextLoad, m_ZExt(m_Load(m_Value(LoadPtr)))))
+    return llvm::None;
+
+  // Require that the total load bit width is a legal integer type.
+  // For example, <8 x i8> --> i64 is a legal integer on a 64-bit target.
+  // But <16 x i8> --> i128 is not, so the backend probably can't reduce it.
+  Type *WideType = ZextLoad->getType();
+  Type *EltType = LoadPtr->getType()->getPointerElementType();
+  unsigned WideWidth = WideType->getIntegerBitWidth();
+  unsigned EltWidth = EltType->getIntegerBitWidth();
+  if (!isTypeLegal(WideType) || WideWidth % EltWidth != 0)
+    return llvm::None;
+
+  // Calculate relative cost: {narrow load+zext+shl+or} are assumed to be
+  // removed and replaced by a single wide load.
+  // FIXME: This is not accurate for the larger pattern where we replace
+  //        multiple narrow load sequences with just 1 wide load. We could
+  //        remove the addition of the wide load cost here and expect the caller
+  //        to make an adjustment for that.
+  int Cost = 0;
+  Cost -= getMemoryOpCost(Instruction::Load, EltType, 0, 0);
+  Cost -= getCastInstrCost(Instruction::ZExt, WideType, EltType);
+  Cost -= getArithmeticInstrCost(Instruction::Shl, WideType);
+  Cost -= getArithmeticInstrCost(Instruction::Or, WideType);
+  Cost += getMemoryOpCost(Instruction::Load, WideType, 0, 0);
+  return Cost;
+}
+
+
 int TargetTransformInfo::getArithmeticInstrCost(
     unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
     OperandValueKind Opd2Info, OperandValueProperties Opd1PropInfo,
     OperandValueProperties Opd2PropInfo,
     ArrayRef<const Value *> Args) const {
+  // Check if we can match this instruction as part of a larger pattern.
+  Optional<int> LoadCombineCost = getLoadCombineCost(Opcode, Args);
+  if (LoadCombineCost)
+    return LoadCombineCost.getValue();
+
+  // Fallback to implementation-specific overrides or base class.
   int Cost = TTIImpl->getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,
                                              Opd1PropInfo, Opd2PropInfo, Args);
   assert(Cost >= 0 && "TTI should not produce negative costs!");
author	Sanjay Patel <spatel@rotateright.com>	2019-10-05 18:03:58 +0000
committer	Sanjay Patel <spatel@rotateright.com>	2019-10-05 18:03:58 +0000
commit	e2321bb4488a81b87742f3343e3bdf8e161aa35b (patch)
tree	48e6260a743b8adf2a2866d6250955e09c2ce8a6 /llvm/lib/Analysis
parent	9ecacb0d54fb89dc7e6da66d9ecae934ca5c01d4 (diff)
download	bcm5719-llvm-e2321bb4488a81b87742f3343e3bdf8e161aa35b.tar.gz bcm5719-llvm-e2321bb4488a81b87742f3343e3bdf8e161aa35b.zip