diff options
author | Sanjay Patel <spatel@rotateright.com> | 2015-04-15 15:22:55 +0000 |
---|---|---|
committer | Sanjay Patel <spatel@rotateright.com> | 2015-04-15 15:22:55 +0000 |
commit | 7024b8121a9e51d468302e43ed41aeb6e1fb7274 (patch) | |
tree | 48ff3f56bac06c236228c51546fe471acba71c48 /llvm/lib | |
parent | 280d8dc9f06989dea6b304d780f43e522146a6eb (diff) | |
download | bcm5719-llvm-7024b8121a9e51d468302e43ed41aeb6e1fb7274.tar.gz bcm5719-llvm-7024b8121a9e51d468302e43ed41aeb6e1fb7274.zip |
[x86] Implement combineRepeatedFPDivisors
Set the transform bar at 2 divisions because the fastest current
x86 FP divider circuit is in SandyBridge / Haswell at 10 cycle
latency (best case) relative to a 5 cycle multiplier.
So that's the worst case for this transform (no latency win),
but multiplies are obviously pipelined while divisions are not,
so there's still a big throughput win which we would expect to
show up in typical FP code.
These are the sequences I'm comparing:
divss %xmm2, %xmm0
mulss %xmm1, %xmm0
divss %xmm2, %xmm0
Becomes:
movss LCPI0_0(%rip), %xmm3 ## xmm3 = mem[0],zero,zero,zero
divss %xmm2, %xmm3
mulss %xmm3, %xmm0
mulss %xmm1, %xmm0
mulss %xmm3, %xmm0
[Ignore for the moment that we don't optimize the chain of 3 multiplies
into 2 independent fmuls followed by 1 dependent fmul...this is the DAG
version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that,
then the transform becomes even more profitable on all targets.]
Differential Revision: http://reviews.llvm.org/D8941
llvm-svn: 235012
Diffstat (limited to 'llvm/lib')
-rw-r--r-- | llvm/lib/Target/X86/X86ISelLowering.cpp | 10 | ||||
-rw-r--r-- | llvm/lib/Target/X86/X86ISelLowering.h | 3 |
2 files changed, 13 insertions, 0 deletions
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 1c60237f75b..c32412a741c 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -12818,6 +12818,16 @@ SDValue X86TargetLowering::getRecipEstimate(SDValue Op, return SDValue(); } +/// If we have at least two divisions that use the same divisor, convert to +/// multplication by a reciprocal. This may need to be adjusted for a given +/// CPU if a division's cost is not at least twice the cost of a multiplication. +/// This is because we still need one division to calculate the reciprocal and +/// then we need two multiplies by that reciprocal as replacements for the +/// original divisions. +bool X86TargetLowering::combineRepeatedFPDivisors(unsigned NumUsers) const { + return NumUsers > 1; +} + static bool isAllOnes(SDValue V) { ConstantSDNode *C = dyn_cast<ConstantSDNode>(V); return C && C->isAllOnesValue(); diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h index dd20ec23976..5130c37b042 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.h +++ b/llvm/lib/Target/X86/X86ISelLowering.h @@ -1072,6 +1072,9 @@ namespace llvm { /// Use rcp* to speed up fdiv calculations. SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI, unsigned &RefinementSteps) const override; + + /// Reassociate floating point divisions into multiply by reciprocal. + bool combineRepeatedFPDivisors(unsigned NumUsers) const override; }; namespace X86 { |