bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[X86][SSE] Transform truncations between vectors of integers into ↵	Cong Hou	2015-12-21	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	X86ISD::PACKUS/PACKSS operations during DAG combine. This patch transforms truncation between vectors of integers into X86ISD::PACKUS/PACKSS operations during DAG combine. We don't do it in lowering phase because after type legalization, the original truncation will be turned into a BUILD_VECTOR with each element that is extracted from a vector and then truncated, and from them it is difficult to do this optimization. This greatly improves the performance of truncations on some specific types. Cost table is updated accordingly. Differential revision: http://reviews.llvm.org/D14588 llvm-svn: 256194
*	[X86][SSE] Update the cost table for integer-integer conversions on SSE2/SSE4.1.	Cong Hou	2015-12-11	2	-3/+356
\| \| \| \| \| \| \| \| \| \| \| \|	Previously in the conversion cost table there are no entries for integer-integer conversions on SSE2. This will result in imprecise costs for certain vectorized operations. This patch adds those entries for SSE2 and SSE4.1. The cost numbers are counted from the result of running llc on the new test case in this patch. Differential revision: http://reviews.llvm.org/D15132 llvm-svn: 255315
*	Don't punish vectorized arithmetic instruction whose type will be split to ↵	Cong Hou	2015-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	multiple registers Currently in LLVM's cost model, a vectorized arithmetic instruction will have high cost if its type is split into multiple registers. However, this punishment is too heavy and unnecessary. The overhead of the split should not be on arithmetic instructions but instructions that implement the split. Note that during vectorization we have calculated the register pressure, and we only choose proper interleaving factor (and also vectorization factor) so that we don't use more registers than the maximum number. Here is a very simple example: if a vadd has the cost 1, and if we double VF so that we need two registers to perform it, then its cost will become 4 with the current implementation, which will prevent us to use larger VF. Differential revision: http://reviews.llvm.org/D15159 llvm-svn: 254671
*	AVX-512: Updated cost of FP/SINT/UINT conversion operations	Elena Demikhovsky	2015-12-02	3	-364/+432
\| \| \| \| \| \| \| \| \|	I checked and updated the cost of AVX-512 conversion operations. Added cost of conversion operations in DQ mode. Conversion of illegal types that requires vector split is not calculated right now (like for other X86 targets). Differential Revision: http://reviews.llvm.org/D15074 llvm-svn: 254494
*	Fixed a failure in cost calculation for vector GEP	Elena Demikhovsky	2015-12-01	1	-0/+17
\| \| \| \| \| \| \| \| \|	Cost calculation for vector GEP failed with due to invalid cast to GEP index operand. The bug is fixed, added a test. http://reviews.llvm.org/D14976 llvm-svn: 254408
*	[CostModel] Fixed AVX integer shift costs	Simon Pilgrim	2015-10-17	3	-34/+34
\| \| \| \| \| \|	Targets with AVX but without AVX2 were incorrectly reporting costs of 256-bit integer shifts. llvm-svn: 250611
*	[X86] Completed SHL cost model tests	Simon Pilgrim	2015-10-11	1	-1/+399
\| \| \| \| \| \|	As discussed in D8690. llvm-svn: 249990
*	[X86] Renamed SHL cost model tests	Simon Pilgrim	2015-10-11	1	-0/+0
\| \| \| \| \| \| \| \|	Matches naming conventions for ASHR/LSHR cost tests As discussed in D8690. llvm-svn: 249984
*	[X86] Added LSHR cost model tests	Simon Pilgrim	2015-10-11	1	-0/+400
\| \| \| \| \| \| \| \|	There are several dodgy costings due to AVX1 legalizing 256-bit integer vectors that need fixing. As discussed in D8690. llvm-svn: 249983
*	[X86] Added ASHR cost model tests	Simon Pilgrim	2015-10-11	1	-0/+392
\| \| \| \| \| \| \| \|	There are several dodgy costings due to AVX1 legalizing 256-bit integer vectors that need fixing. As discussed in D8690. llvm-svn: 249981
*	[X86][XOP] Added support for the lowering of 128-bit vector shifts to XOP ↵	Simon Pilgrim	2015-09-30	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \|	shift instructions The XOP shifts just have logical/arithmetic versions and the left/right shifts are controlled by whether the value is positive/negative. Because of this I've added new X86ISD nodes instead of trying to force them to use the existing shift nodes. Additionally Excavator cores (bdver4) support XOP and AVX2 - meaning that it should use the AVX2 shifts when it can and fall back to XOP in other cases. Differential Revision: http://reviews.llvm.org/D8690 llvm-svn: 248878
*	[X86][SSE] Vectorize i64 ASHR operations	Simon Pilgrim	2015-07-29	2	-18/+18
\| \| \| \| \| \| \| \|	This patch vectorizes the v2i64/v4i64 ASHR shift operations - the last remaining integer vector shifts that are still being transferred to/from the scalar unit to be completed. Differential Revision: http://reviews.llvm.org/D11439 llvm-svn: 243569
*	[X86][SSE] Updated SHL/LSHR i64 vectorization costs.	Simon Pilgrim	2015-07-18	3	-25/+25
\| \| \| \| \| \|	This was missed in D8416. llvm-svn: 242621
*	[X86][SSE] Vectorized v4i32 non-uniform shifts.	Simon Pilgrim	2015-07-12	2	-24/+24
\| \| \| \| \| \| \| \| \| \|	While the v4i32 shl operation is already vectorized using a cvttps2dq/pmulld pattern, the lshr/ashr opeations are still scalarized. This patch adds vectorization support for non-uniform v4i32 shift operations - it splats constant shift amounts to allow them to use the immediate sse shift instructions, or extracts/zero-extends non-constant shift amounts. The individual results are then blended together. Differential Revision: http://reviews.llvm.org/D11063 llvm-svn: 241989
*	[X86][SSE] Vectorized i64 uniform constant SRA shifts	Simon Pilgrim	2015-07-06	1	-16/+16
\| \| \| \| \| \| \| \|	This patch adds vectorization support for uniform constant i64 arithmetic shift right operators. Differential Revision: http://reviews.llvm.org/D9645 llvm-svn: 241514
*	[X86][SSE][CostModel] Added full set of sitofp/uitofp costings for ↵	Simon Pilgrim	2015-06-20	2	-126/+755
\| \| \| \| \| \| \| \| \| \| \| \|	SSE2/AVX/AVX2/AVX512F. Merged separate (but equivalent) SSE2/AVX512F tests. Removed codegen tests since these are already done better in test/CodeGen/X86. The actual cost values still need to be updated to match recent codegen improvements. llvm-svn: 240219
*	[X86][SSE][CostModel] Fixed uitofp/sitofp cost target tests to specify ↵	Simon Pilgrim	2015-06-18	2	-5/+5
\| \| \| \| \| \|	sse2/avx2/avx512f directly instead of via a cpu model. llvm-svn: 240062
*	[X86][SSE] Vectorized i8 and i16 shift operators	Simon Pilgrim	2015-06-11	3	-36/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch ensures that SHL/SRL/SRA shifts for i8 and i16 vectors avoid scalarization. It builds on the existing i8 SHL vectorized implementation of moving the shift bits up to the sign bit position and separating the 4, 2 & 1 bit shifts with several improvements: 1 - SSE41 targets can use (v)pblendvb directly with the sign bit instead of performing a comparison to feed into a VSELECT node. 2 - pre-SSE41 targets were masking + comparing with an 0x80 constant - we avoid this by using the fact that a set sign bit means a negative integer which can be compared against zero to then feed into VSELECT, avoiding the need for a constant mask (zero generation is much cheaper). 3 - SRA i8 needs to be unpacked to the upper byte of a i16 so that the i16 psraw instruction can be correctly used for sign extension - we have to do more work than for SHL/SRL but perf tests indicate that this is still beneficial. The i16 implementation is similar but simpler than for i8 - we have to do 8, 4, 2 & 1 bit shifts but less shift masking is involved. SSE41 use of (v)pblendvb requires that the i16 shift amount is splatted to both bytes however. Tested on SSE2, SSE41 and AVX machines. Differential Revision: http://reviews.llvm.org/D9474 llvm-svn: 239509
*	[X86][SSE] Avoid scalarization of v2i64 vector shifts (REAPPLIED)	Simon Pilgrim	2015-03-18	2	-16/+16
\| \| \| \| \| \| \| \|	Fixed broken tests. Differential Revision: http://reviews.llvm.org/D8416 llvm-svn: 232682
*	TTI: Honour cost model for estimating cost of vector-intrinsic and calls.	Michael Zolotukhin	2015-03-17	1	-2/+2
\| \| \| \| \|	Review: http://reviews.llvm.org/D8096 llvm-svn: 232528
*	[opaque pointer type] Add textual IR support for explicit type parameter to ↵	David Blaikie	2015-02-27	4	-27/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	load instruction Essentially the same as the GEP change in r230786. A similar migration script can be used to update test cases, though a few more test case improvements/changes were required this time around: (r229269-r229278) import fileinput import sys import re pat = re.compile(r"((?:=\|:\|^)\sload (?:atomic )?(?:volatile )?(.?))(\| addrspace$\d+$ )\($\| (?:%\|@\|null\|undef\|blockaddress\|getelementptr\|addrspacecast\|bitcast\|inttoptr\|\[\[[a-zA-Z]\|\{\{).$)") for line in sys.stdin: sys.stdout.write(re.sub(pat, r"\1, \2\3*\4", line)) Reviewers: rafael, dexonsmith, grosser Differential Revision: http://reviews.llvm.org/D7649 llvm-svn: 230794
*	[opaque pointer type] Add textual IR support for explicit type parameter to ↵	David Blaikie	2015-02-27	4	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	getelementptr instruction One of several parallel first steps to remove the target type of pointers, replacing them with a single opaque pointer type. This adds an explicit type parameter to the gep instruction so that when the first parameter becomes an opaque pointer type, the type to gep through is still available to the instructions. * This doesn't modify gep operators, only instructions (operators will be handled separately) * Textual IR changes only. Bitcode (including upgrade) and changing the in-memory representation will be in separate changes. * geps of vectors are transformed as: getelementptr <4 x float> %x, ... ->getelementptr float, <4 x float> %x, ... Then, once the opaque pointer type is introduced, this will ultimately look like: getelementptr float, <4 x ptr> %x with the unambiguous interpretation that it is a vector of pointers to float. * address spaces remain on the pointer, not the type: getelementptr float addrspace(1)* %x ->getelementptr float, float addrspace(1)* %x Then, eventually: getelementptr float, ptr addrspace(1) %x Importantly, the massive amount of test case churn has been automated by same crappy python code. I had to manually update a few test cases that wouldn't fit the script's model (r228970,r229196,r229197,r229198). The python script just massages stdin and writes the result to stdout, I then wrapped that in a shell script to handle replacing files, then using the usual find+xargs to migrate all the files. update.py: import fileinput import sys import re ibrep = re.compile(r"(^.?[^%\w]getelementptr inbounds )(((?:<\d x )?)(.?)(\| addrspace$\d$) \(\|>)(?:$\| (?:%\|@\|null\|undef\|blockaddress\|getelementptr\|addrspacecast\|bitcast\|inttoptr\|\[\[[a-zA-Z]\|\{\{).$))") normrep = re.compile( r"(^.?[^%\w]getelementptr )(((?:<\d* x )?)(.?)(\| addrspace$\d$) \(\|>)(?:$\| (?:%\|@\|null\|undef\|blockaddress\|getelementptr\|addrspacecast\|bitcast\|inttoptr\|\[\[[a-zA-Z]\|\{\{).$))") def conv(match, line): if not match: return line line = match.groups()[0] if len(match.groups()[5]) == 0: line += match.groups()[2] line += match.groups()[3] line += ", " line += match.groups()[1] line += "\n" return line for line in sys.stdin: if line.find("getelementptr ") == line.find("getelementptr inbounds"): if line.find("getelementptr inbounds") != line.find("getelementptr inbounds ("): line = conv(re.match(ibrep, line), line) elif line.find("getelementptr ") != line.find("getelementptr ("): line = conv(re.match(normrep, line), line) sys.stdout.write(line) apply.sh: for name in "$@" do python3 `dirname "$0"`/update.py < "$name" > "$name.tmp" && mv "$name.tmp" "$name" rm -f "$name.tmp" done The actual commands: From llvm/src: find test/ -name .ll \| xargs ./apply.sh From llvm/src/tools/clang: find test/ -name .mm -o -name .m -o -name .cpp -o -name .c \| xargs -I '{}' ../../apply.sh "{}" From llvm/src/tools/polly: find test/ -name *.ll \| xargs ./apply.sh After that, check-all (with llvm, clang, clang-tools-extra, lld, compiler-rt, and polly all checked out). The extra 'rm' in the apply.sh script is due to a few files in clang's test suite using interesting unicode stuff that my python script was throwing exceptions on. None of those files needed to be migrated, so it seemed sufficient to ignore those cases. Reviewers: rafael, dexonsmith, grosser Differential Revision: http://reviews.llvm.org/D7636 llvm-svn: 230786
*	[x86,sdag] Two interrelated changes to the x86 and sdag code.	Chandler Carruth	2015-02-19	1	-15/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	First, don't combine bit masking into vector shuffles (even ones the target can handle) once operation legalization has taken place. Custom legalization of vector shuffles may exist for these patterns (making the predicate return true) but that custom legalization may in some cases produce the exact bit math this matches. We only really want to handle this prior to operation legalization. However, the x86 backend, in a fit of awesome, relied on this. What it would do is mark VSELECTs as expand, which would turn them into arithmetic, which this would then match back into vector shuffles, which we would then lower properly. Amazing. Instead, the second change is to teach the x86 backend to directly form vector shuffles from VSELECT nodes with constant conditions, and to mark all of the vector types we support lowering blends as shuffles as custom VSELECT lowering. We still mark the forms which actually support variable blends as legal so that the custom lowering is bypassed, and the legal lowering can even be used by the vector shuffle legalization (yes, i know, this is confusing. but that's how the patterns are written). This makes the VSELECT lowering much more sensible, and in fact should fix a bunch of bugs with it. However, as you'll see in the test cases, right now what it does is point out the hilarious deficiency of the new vector shuffle lowering when it comes to blends. Fortunately, my very next patch fixes that. I can't submit it yet, because that patch, somewhat obviously, forms the exact and/or pattern that the DAG combine is matching here! Without this patch, teaching the vector shuffle lowering to produce the right code infloops in the DAG combiner. With this patch alone, we produce terrible code but at least lower through the right paths. With both patches, all the regressions here should be fixed, and a bunch of the improvements (like using 2 shufps with no memory loads instead of 2 andps with memory loads and an orps) will stay. Win! There is one other change worth noting here. We had hilariously wrong vectorization cost estimates for vselect because we fell through to the code path that assumed all "expand" vector operations are scalarized. However, the "expand" lowering of VSELECT is vector bit math, most definitely not scalarized. So now we go back to the correct if horribly naive cost of "1" for "not scalarized". If anyone wants to add actual modeling of shuffle costs, that would be cool, but this seems an improvement on its own. Note the removal of 16 and 32 "costs" for doing a blend. Even in SSE2 we can blend in fewer than 16 instructions. ;] Of course, we don't right now because of OMG bad code, but I'm going to fix that. Next patch. I promise. llvm-svn: 229835
*	Implemented cost model for masked load/store operations.	Elena Demikhovsky	2015-01-25	1	-0/+89
\| \| \| \|	llvm-svn: 227035
*	AVX-512: SINT_TO_FP cost model and some bugfixes	Elena Demikhovsky	2014-11-13	1	-0/+45
\| \| \| \| \| \| \|	Checked some corner cases, for example translation of <8 x i1> to <8 x double> llvm-svn: 221883
*	[X86] Custom lower UINT_TO_FP from v4f32 to v4i32, and for v8f32 to v8i32 if	Quentin Colombet	2014-11-11	2	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AVX2 is available. According to IACA, the new lowering has a throughput of 8 cycles instead of 13 with the previous one. Althought this lowering kicks in some SPECs benchmarks, the performance improvement was within the noise. Correctness testing has been done for the whole range of uint32_t with the following program: uint4 v = (uint4) {0,1,2,3}; uint32_t i; //Check correctness over entire range for uint4 -> float4 conversion for( i = 0; i < 1U << (32-2); i++ ) { float4 t = test(v); float4 c = correct(v); if( 0xf != _mm_movemask_ps( t == c )) { printf( "Error @ %vx: %vf vs. %vf\n", v, c, t); return -1; } v += 4; } Where "correct" is the old lowering and "test" the new one. The patch adds a test case for the two custom lowering instruction. It also modifies the vector cost model, which is why cast.ll and uitofp.ll are modified. 2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7 instructions instead of 4 (3 more constant loads). rdar://problem/18153096> llvm-svn: 221657
*	AVX-512: added cost for some AVX-512 instructions	Elena Demikhovsky	2014-09-16	2	-0/+50
\| \| \| \|	llvm-svn: 217863
*	[CostModel][x86] Improved cost model for alternate shuffles.	Andrea Di Biagio	2014-07-03	1	-0/+347
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch: 1) Improves the cost model for x86 alternate shuffles (originally added at revision 211339); 2) Teaches the Cost Model Analysis pass how to analyze alternate shuffles. Alternate shuffles are a special kind of blend; on x86, we can often easily lowered alternate shuffled into single blend instruction (depending on the subtarget features). The existing cost model didn't take into account subtarget features. Also, it had a couple of "dead" entries for vector types that are never legal (example: on x86 types v2i32 and v2f32 are not legal; those are always either promoted or widened to 128-bit vector types). The new x86 cost model takes into account what target features we have before returning the shuffle cost (i.e. the number of instructions after the blend is lowered/expanded). This patch also teaches the Cost Model Analysis how to identify and analyze alternate shuffles (i.e. 'SK_Alternate' shufflevector instructions): - added function 'isAlternateVectorMask'; - added some logic to check if an instruction is a alternate shuffle and, in case, call the target specific TTI to get the corresponding shuffle cost; - added a test to verify the cost model analysis on alternate shuffles. llvm-svn: 212296
*	Reduce verbiage of lit.local.cfg files	Alp Toker	2014-06-09	1	-2/+1
\| \| \| \| \| \|	We can just split targets_to_build in one place and make it immutable. llvm-svn: 210496
*	Added tests for the cost of lowering VSELECT instructions.	Filipe Cabecinhas	2014-05-16	1	-0/+126
\| \| \| \|	llvm-svn: 209045
*	TTI: Estimate @llvm.fmuladd cost as fmul + fadd when FMA's aren't legal on ↵	Benjamin Kramer	2014-05-06	1	-0/+28
\| \| \| \| \| \|	the target. llvm-svn: 208115
*	X86TTI: Adjust sdiv cost now that we can lower it on plain SSE2.	Benjamin Kramer	2014-04-27	1	-2/+2
\| \| \| \| \| \| \|	Includes a fix for a horrible typo that caused all SDIV costs to be slightly off :) llvm-svn: 207371
*	X86TTI: i16/i32 vector div with a constant (splat) divisor are reasonably ↵	Benjamin Kramer	2014-04-26	1	-0/+92
\| \| \| \| \| \| \| \|	cheap now. Turn vectorization back on. llvm-svn: 207320
*	When analyzing vectors of element type that require legalization,	Raul E. Silvera	2014-03-10	1	-0/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the legalization cost must be included to get an accurate estimation of the total cost of the scalarized vector. The inaccurate cost triggered unprofitable SLP vectorization on 32-bit X86. Summary: Include legalization overhead when computing scalarization cost Reviewers: hfinkel, nadav CC: chandlerc, rnk, llvm-commits Differential Revision: http://llvm-reviews.chandlerc.com/D2992 llvm-svn: 203509
*	Add extra CHECK prefix to tests with explicit prefix	Nico Rieck	2014-02-16	1	-2/+2
\| \| \| \| \| \| \|	These tests mistakenly assume that CHECK is still available even if an explicit prefix is specified. llvm-svn: 201492
*	[Vectorizer] Add a new 'OperandValueKind' in TargetTransformInfo called	Andrea Di Biagio	2014-02-12	1	-0/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	'OK_NonUniformConstValue' to identify operands which are constants but not constant splats. The cost model now allows returning 'OK_NonUniformConstValue' for non splat operands that are instances of ConstantVector or ConstantDataVector. With this change, targets are now able to compute different costs for instructions with non-uniform constant operands. For example, On X86 the cost of a vector shift may vary depending on whether the second operand is a uniform or non-uniform constant. This patch applies the following changes: - The cost model computation now takes into account non-uniform constants; - The cost of vector shift instructions has been improved in X86TargetTransformInfo analysis pass; - BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish between non-uniform and uniform constant operands. Added a new test to verify that the output of opt '-cost-model -analyze' is valid in the following configurations: SSE2, SSE4.1, AVX, AVX2. llvm-svn: 201272
*	X86: add costs for 64-bit vector ext/trunc & rebalance	Tim Northover	2014-02-06	1	-22/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The most important part of this is probably adding any cost at all for operations like zext <8 x i8> to <8 x i32>. Before they were being recorded as extremely costly (24, I believe) which made LLVM fall back on a 4-wide vectorisation of a loop. It also rebalances the values for sext, zext and trunc. Lacking any other sane metric that might work across CPU microarchitectures I went for instructions. This seems to be in reasonable accord with the rest of the table (sitofp, ...) though no doubt at least one value is sub-optimal for some bizarre reason. Finally, separate AVX and AVX2 values are provided where appropriate. The CodeGen is quite different in many cases. rdar://problem/15981990 llvm-svn: 200928
*	X86: Custom lower sext v16i8 to v16i16, and the corresponding truncate.	Benjamin Kramer	2013-10-23	1	-1/+7
\| \| \| \| \| \|	Also update the cost model. llvm-svn: 193270
*	X86 horizontal vector reduction cost model	Yi Jiang	2013-09-19	1	-0/+271
\| \| \| \|	llvm-svn: 191021
*	Costmodel: Add support for horizontal vector reductions	Arnold Schwaighofer	2013-09-17	1	-0/+94
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Upcoming SLP vectorization improvements will want to be able to estimate costs of horizontal reductions. Add infrastructure to support this. We model reductions as a series of (shufflevector,add) tuples ultimately followed by an extractelement. For example, for an add-reduction of <4 x float> we could generate the following sequence: (v0, v1, v2, v3) \ \ / / \ \ / + + (v0+v2, v1+v3, undef, undef) \ / ((v0+v2) + (v1+v3), undef, undef) %rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef> %bin.rdx = fadd <4 x float> %rdx, %rdx.shuf %rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7 %r = extractelement <4 x float> %bin.rdx8, i32 0 This commit adds a cost model interface "getReductionCost(Opcode, Ty, Pairwise)" that will allow clients to ask for the cost of such a reduction (as backends might generate more efficient code than the cost of the individual instructions summed up). This interface is excercised by the CostModel analysis pass which looks for reduction patterns like the one above - starting at extractelements - and if it sees a matching sequence will call the cost model interface. We will also support a second form of pairwise reduction that is well supported on common architectures (haddps, vpadd, faddp). (v0, v1, v2, v3) \ / \ / (v0+v1, v2+v3, undef, undef) \ / ((v0+v1)+(v2+v3), undef, undef, undef) %rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 0, i32 2 , i32 undef, i32 undef> %rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef> %bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1 %rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef> %rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1 %r = extractelement <4 x float> %bin.rdx.1, i32 0 llvm-svn: 190876
*	[tests] Cleanup initialization of test suffixes.	Daniel Dunbar	2013-08-16	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Instead of setting the suffixes in a bunch of places, just set one master list in the top-level config. We now only modify the suffix list in a few suites that have one particular unique suffix (.ml, .mc, .yaml, .td, .py). - Aside from removing the need for a bunch of lit.local.cfg files, this enables 4 tests that were inadvertently being skipped (one in Transforms/BranchFolding, a .s file each in DebugInfo/AArch64 and CodeGen/PowerPC, and one in CodeGen/SI which is now failing and has been XFAILED). - This commit also fixes a bunch of config files to use config.root instead of older copy-pasted code. llvm-svn: 188513
*	Add the nearbyint -> FNEARBYINT mapping to BasicTargetTransformInfo	Hal Finkel	2013-07-08	1	-0/+28
\| \| \| \| \| \| \| \|	This fixes an oversight that Intrinsic::nearbyint was not being mapped to ISD::FNEARBYINT (thus fixing the over-optimistic cost we were assigning to nearbyint calls for some targets). llvm-svn: 185783
*	CostModel: improve the cost model for load/store of non power-of-two types ↵	Nadav Rotem	2013-06-27	1	-0/+19
\| \| \| \| \| \|	such as <3 x float>, which are popular in graphics. llvm-svn: 185085
*	X86 cost model: Vectorizing integer division is a bad idea	Arnold Schwaighofer	2013-06-25	1	-0/+32
\| \| \| \| \| \|	radar://14057959 llvm-svn: 184872
*	TBAA: remove !tbaa from testing cases if not used.	Manman Ren	2013-04-29	2	-13/+5
\| \| \| \| \| \| \|	This will make it easier to turn on struct-path aware TBAA since the metadata format will change. llvm-svn: 180743
*	X86 cost model: Exit before calling getSimpleVT on non-simple VTs	Arnold Schwaighofer	2013-04-17	1	-0/+6
\| \| \| \| \| \| \| \|	getSimpleVT can only handle simple value types. radar://13676022 llvm-svn: 179714
*	CostModel: increase the default cost of supported floating point operations ↵	Nadav Rotem	2013-04-12	1	-2/+2
\| \| \| \| \| \|	from 1 to two. Fixed a few tests that changes because now the cost of one insert + a vector operation on two doubles is lower than two scalar operations on doubles. llvm-svn: 179413
*	X86 cost model: Model cost for uitofp and sitofp on SSE2	Arnold Schwaighofer	2013-04-08	2	-0/+643
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The costs are overfitted so that I can still use the legalization factor. For example the following kernel has about half the throughput vectorized than unvectorized when compiled with SSE2. Before this patch we would vectorize it. unsigned short A[1024]; double B[1024]; void f() { int i; for (i = 0; i < 1024; ++i) { B[i] = (double) A[i]; } } radar://13599001 llvm-svn: 179033
*	TargetLowering: Fix getTypeConversion handling of extended vector types	Arnold Schwaighofer	2013-04-07	3	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code in getTypeConversion attempts to promote the element vector type before it trys to split or widen the vector. After it failed finding a legal vector type by promoting it would continue using the promoted vector element type. Thereby missing legal splitted vector types. For example the type v32i32 that has a legal split of 4 x v3i32 on x86/sse2 would be transformed to: v32i256 and from there on successively split to: v16i256, v8i256, v1i256 and then finally ends up as an i64 type. By resetting the vector element type to the original vector element type that existed before the promotion the code will attempt to split the vector type to smaller vector widths of the same type. llvm-svn: 178999
*	X86 cost model: Differentiate cost for vector shifts of constants	Arnold Schwaighofer	2013-04-04	3	-0/+863
\| \| \| \| \| \| \| \| \| \| \| \|	SSE2 has efficient support for shifts by a scalar. My previous change of making shifts expensive did not take this into account marking all shifts as expensive. This would prevent vectorization from happening where it is actually beneficial. With this change we differentiate between shifts of constants and other shifts. radar://13576547 llvm-svn: 178808