[x86] Implement combineRepeatedFPDivisors

Set the transform bar at 2 divisions because the fastest current x86 FP divider circuit is in SandyBridge / Haswell at 10 cycle latency (best case) relative to a 5 cycle multiplier. So that's the worst case for this transform (no latency win), but multiplies are obviously pipelined while divisions are not, so there's still a big throughput win which we would expect to show up in typical FP code. These are the sequences I'm comparing: divss %xmm2, %xmm0 mulss %xmm1, %xmm0 divss %xmm2, %xmm0 Becomes: movss LCPI0_0(%rip), %xmm3 ## xmm3 = mem[0],zero,zero,zero divss %xmm2, %xmm3 mulss %xmm3, %xmm0 mulss %xmm1, %xmm0 mulss %xmm3, %xmm0 [Ignore for the moment that we don't optimize the chain of 3 multiplies into 2 independent fmuls followed by 1 dependent fmul...this is the DAG version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that, then the transform becomes even more profitable on all targets.] Differential Revision: http://reviews.llvm.org/D8941 llvm-svn: 235012
author: Sanjay Patel <spatel@rotateright.com> 2015-04-15 15:22:55 +0000
committer: Sanjay Patel <spatel@rotateright.com> 2015-04-15 15:22:55 +0000
commit: 7024b8121a9e51d468302e43ed41aeb6e1fb7274 (patch)
tree: 48ff3f56bac06c236228c51546fe471acba71c48 /llvm/lib
parent: 280d8dc9f06989dea6b304d780f43e522146a6eb (diff)
download: bcm5719-llvm-7024b8121a9e51d468302e43ed41aeb6e1fb7274.tar.gz
bcm5719-llvm-7024b8121a9e51d468302e43ed41aeb6e1fb7274.zip
2 files changed, 13 insertions, 0 deletions
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1c60237f75b..c32412a741c 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -12818,6 +12818,16 @@ SDValue X86TargetLowering::getRecipEstimate(SDValue Op,
   return SDValue();
 }
 
+/// If we have at least two divisions that use the same divisor, convert to
+/// multplication by a reciprocal. This may need to be adjusted for a given
+/// CPU if a division's cost is not at least twice the cost of a multiplication.
+/// This is because we still need one division to calculate the reciprocal and
+/// then we need two multiplies by that reciprocal as replacements for the
+/// original divisions.
+bool X86TargetLowering::combineRepeatedFPDivisors(unsigned NumUsers) const {
+  return NumUsers > 1;
+}
+
 static bool isAllOnes(SDValue V) {
   ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);
   return C && C->isAllOnesValue();
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index dd20ec23976..5130c37b042 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1072,6 +1072,9 @@ namespace llvm {
     /// Use rcp* to speed up fdiv calculations.
     SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
                              unsigned &RefinementSteps) const override;
+
+    /// Reassociate floating point divisions into multiply by reciprocal.
+    bool combineRepeatedFPDivisors(unsigned NumUsers) const override;
   };
 
   namespace X86 {
author	Sanjay Patel <spatel@rotateright.com>	2015-04-15 15:22:55 +0000
committer	Sanjay Patel <spatel@rotateright.com>	2015-04-15 15:22:55 +0000
commit	7024b8121a9e51d468302e43ed41aeb6e1fb7274 (patch)
tree	48ff3f56bac06c236228c51546fe471acba71c48 /llvm/lib
parent	280d8dc9f06989dea6b304d780f43e522146a6eb (diff)
download	bcm5719-llvm-7024b8121a9e51d468302e43ed41aeb6e1fb7274.tar.gz bcm5719-llvm-7024b8121a9e51d468302e43ed41aeb6e1fb7274.zip