From 6dafe97271c871f1568f6c98c8daacbd09ecbae9 Mon Sep 17 00:00:00 2001
From: Adam Nemet <anemet@apple.com>
Date: Sun, 30 Mar 2014 18:07:13 +0000
Subject: [X86] Adjust cost of FP_TO_UINT v8f32->v8i32

There is no direct AVX instruction to convert to unsigned.  I have some ideas
how we may be able to do this with three vector instructions but the current
backend just bails on this to get it scalarized.

See the comment why we need to adjust the cost returned by BasicTTI.

The test is a bit roundabout (and checks assembly rather than bit code) because
I'd like it to work even if at some point we could vectorize this conversion.

Fixes <rdar://problem/16371920>

llvm-svn: 205159
---
 llvm/lib/Target/X86/X86TargetTransformInfo.cpp | 6 ++++++
 1 file changed, 6 insertions(+)

(limited to 'llvm/lib/Target/X86/X86TargetTransformInfo.cpp')

diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index f58c94683fd..e1e151328fa 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -522,6 +522,12 @@ unsigned X86TTI::getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src) const {
 
     { ISD::FP_TO_SINT,  MVT::v8i8,  MVT::v8f32, 7 },
     { ISD::FP_TO_SINT,  MVT::v4i8,  MVT::v4f32, 1 },
+    // This node is expanded into scalarized operations but BasicTTI is overly
+    // optimistic estimating its cost.  It computes 3 per element (one
+    // vector-extract, one scalar conversion and one vector-insert).  The
+    // problem is that the inserts form a read-modify-write chain so latency
+    // should be factored in too.  Inflating the cost per element by 1.
+    { ISD::FP_TO_UINT,  MVT::v8i32, MVT::v8f32, 8*4 },
   };
 
   if (ST->hasAVX2()) {
-- 
cgit v1.2.3