bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	Revert "Add a pass to lower is.constant and objectsize intrinsics"	Dmitri Gribenko	2019-10-14	5	-122/+154
\| \| \| \| \| \| \|	This reverts commit r374743. It broke the build with Ocaml enabled: http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/19218 llvm-svn: 374768
*	Add a pass to lower is.constant and objectsize intrinsics	Joerg Sonnenberger	2019-10-13	5	-154/+122
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This pass lowers is.constant and objectsize intrinsics not simplified by earlier constant folding, i.e. if the object given is not constant or if not using the optimized pass chain. The result is recursively simplified and constant conditionals are pruned, so that dead blocks are removed even for -O0. This allows inline asm blocks with operand constraints to work all the time. The new pass replaces the existing lowering in the codegen-prepare pass and fallbacks in SDAG/GlobalISEL and FastISel. The latter now assert on the intrinsics. Differential Revision: https://reviews.llvm.org/D65280 llvm-svn: 374743
*	[Attributor] Shortcut no-return through will-return	Johannes Doerfert	2019-10-13	6	-6/+18
\| \| \| \| \| \| \| \|	No-return and will-return are exclusive, assuming the latter is more prominent we can avoid updates of the former unless will-return is not known for sure. llvm-svn: 374739
*	[Attributor][FIX] NullPointerIsDefined needs the pointer AS (AANonNull)	Johannes Doerfert	2019-10-13	4	-3/+10
\| \| \| \| \| \|	Also includes a shortcut via AADereferenceable if possible. llvm-svn: 374737
*	[Attributor][MemBehavior] Fallback to the function state for arguments	Johannes Doerfert	2019-10-13	3	-11/+13
\| \| \| \| \| \| \| \| \|	Even if an argument is captured, we cannot have an effect the function does not have. This is fine except for the special case of `inalloca` as it does not behave by the rules. TODO: Maybe the special rule for `inalloca` is wrong after all. llvm-svn: 374736
*	[Attributor][FIX] Use check prefix that is actually tested	Johannes Doerfert	2019-10-13	3	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This changes "CHECK" check lines to "ATTRIBUTOR" check lines where necessary and also fixes the now exposed, mostly minor, problems. Reviewers: sstefan1, uenoku Subscribers: hiraditya, bollu, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68929 llvm-svn: 374735
*	[ConstantFold] fix inconsistent handling of extractelement with undef index ↵	Sanjay Patel	2019-10-13	1	-1/+1
\| \| \| \| \| \| \| \| \|	(PR42689) Any constant other than zero was already folded to undef if the index is undef. https://bugs.llvm.org/show_bug.cgi?id=42689 llvm-svn: 374729
*	[InstCombine] don't assume 'inbounds' for bitcast deref or null pointer in ↵	Sanjay Patel	2019-10-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	non-default address space Follow-up to D68244 to account for a corner case discussed in: https://bugs.llvm.org/show_bug.cgi?id=43501 Add one more restriction: if the pointer is deref-or-null and in a non-default (non-zero) address space, we can't assume inbounds. Differential Revision: https://reviews.llvm.org/D68706 llvm-svn: 374728
*	[NFC][InstCombine] More test for "sign bit test via shifts" pattern (PR43595)	Roman Lebedev	2019-10-13	6	-13/+377
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While that pattern is indirectly handled via reassociateShiftAmtsOfTwoSameDirectionShifts(), that incursme one-use restriction on truncation, which is pointless since we know that we'll produce a single instruction. Additionally, if we are only looking for sign bit, we don't need shifts to be identical, which isn't the case in general, and is the blocker for me in bug in question: https://bugs.llvm.org/show_bug.cgi?id=43595 llvm-svn: 374726
*	[Attributor][FIX] Avoid splitting blocks if possible	Johannes Doerfert	2019-10-13	3	-8/+8
\| \| \| \| \| \| \|	Before, we eagerly split blocks even if it was not necessary, e.g., they had a single unreachable instruction and only a single predecessor. llvm-svn: 374703
*	[Attributor][FIX] Ensure h2s doesn't trigger on escaped pointers	Johannes Doerfert	2019-10-13	1	-52/+67
\| \| \| \| \| \| \| \|	We do not yet perform h2s because we know something is free'ed but we do it because we know the pointer does not escape. Storing the pointer allows it to escape so we have to prevent that. llvm-svn: 374699
*	[Attributor][FIX] Do not apply h2s for arbitrary mallocs	Johannes Doerfert	2019-10-13	1	-0/+5
\| \| \| \| \| \| \| \|	H2S did apply to mallocs of non-constant sizes if the uses were OK. This is now forbidden through reording of the "good" and "bad" cases in the conditional. llvm-svn: 374698
*	[Attributor][FIX] Add missing function declaration in test case	Johannes Doerfert	2019-10-13	1	-0/+2
\| \| \| \|	llvm-svn: 374696
*	[Attributor][FIX] Avoid modifying naked/optnone functions	Johannes Doerfert	2019-10-13	1	-0/+24
\| \| \| \| \| \| \| \|	The check for naked/optnone was insufficient for different reasons. We now check before we initialize an abstract attribute and we do it for all abstract attributes. llvm-svn: 374694
*	[SROA] Reuse existing lifetime markers if possible	Johannes Doerfert	2019-10-13	1	-0/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: If the underlying alloca did not change, we do not necessarily need new lifetime markers. This patch adds a check and reuses the old ones if possible. Reviewers: reames, ssarda, t.p.northover, hfinkel Subscribers: hiraditya, bollu, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68900 llvm-svn: 374692
*	[LoopIdiomRecognize] Recommit: BCmp loop idiom recognition	Roman Lebedev	2019-10-12	3	-595/+407
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This is a recommit, this originally landed in rL370454 but was subsequently reverted in rL370788 due to https://bugs.llvm.org/show_bug.cgi?id=43206 The reduced testcase was added to bcmp-negative-tests.ll as @pr43206_different_loops - we must ensure that the SCEV's we got are both for the same loop we are currently investigating. Original commit message: @mclow.lists brought up this issue up in IRC. It is a reasonably common problem to compare some two values for equality. Those may be just some integers, strings or arrays of integers. In C, there is `memcmp()`, `bcmp()` functions. In C++, there exists `std::equal()` algorithm. One can also write that function manually. libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ libc++ does not do anything like that, it simply relies on simple C++'s `operator==()`. https://godbolt.org/z/er0Zwf (GOOD!) So likely, there exists a certain performance opportunities. Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that is using `memcmp()` (in this case, compiled with modified compiler). {F8768213} ``` #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> #include "benchmark/benchmark.h" template <class T> bool equal(T* a, T* a_end, T* b) noexcept { for (; a != a_end; ++a, ++b) { if (a != b) return false; } return true; } template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct Identical { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto Tmp = getVectorOfRandomNumbers<T>(count); return std::make_pair(Tmp, std::move(Tmp)); } }; struct InequalHalfway { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto V0 = getVectorOfRandomNumbers<T>(count); auto V1 = V0; V1[V1.size() / size_t(2)]++; // just change the value. return std::make_pair(std::move(V0), std::move(V1)); } }; template <class T, class Gen> void BM_bcmp(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { const bool is_equal = equal(a.data(), a.data() + a.size(), b.data()); benchmark::DoNotOptimize(is_equal); } state.SetComplexityN(Length); state.counters["eltcnt"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["eltcnt/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = []() { for (const benchmark::CPUInfo::CacheInfo& I : benchmark::CPUInfo::Get().caches) { if (I.level == 2) return I.size; } return 0; }(); // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /total bufs/ 2 / /maximal elt size/ sizeof(T) / /safety margin/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway) ->Apply(CustomArguments<uint64_t>); ``` {F8768210} ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx 2019-04-25 21:17:11 Running build-old/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 0.65, 3.90, 4.14 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N BM_bcmp<uint8_t, Identical>_RMS 8 % 8 % <...> BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N BM_bcmp<uint16_t, Identical>_RMS 25 % 25 % <...> BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N BM_bcmp<uint32_t, Identical>_RMS 42 % 42 % <...> BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N BM_bcmp<uint64_t, Identical>_RMS 27 % 27 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 % RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0 2019-04-25 21:19:29 Running build-new/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.01, 2.85, 3.71 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N BM_bcmp<uint8_t, Identical>_RMS 37 % 37 % <...> BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N BM_bcmp<uint16_t, Identical>_RMS 34 % 34 % <...> BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N BM_bcmp<uint32_t, Identical>_RMS 35 % 35 % <...> BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N BM_bcmp<uint64_t, Identical>_RMS 33 % 33 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 % Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench Benchmark Time CPU Time Old Time New CPU Old CPU New --------------------------------------------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590 <...> BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948 <...> BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627 <...> BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855 <...> BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569 <...> BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547 <...> BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394 <...> BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498 ``` What can we tell from the benchmark? * Performance of naive equality check somewhat improves with element size, maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s for uint64_t. I think, that instability implies performance problems. * Performance of `memcmp()`-aware benchmark always maxes out at around bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the naive variant! * eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and linearly decreases with element size. For uint64_t, it's ~4x+ the elements/second. * The call obvious is more pricey than the loop, with small element count. As it can be seen from the full output {F8768210}, the `memcmp()` is almost universally worse, independent of the element size (and thus buffer size) when element count is less than 8. So all in all, bcmp idiom does indeed pose untapped performance headroom. This diff does implement said idiom recognition. I think a reasonable test coverage is present, but do tell if there is anything obvious missing. Now, quality. This does succeed to build and pass the test-suite, at least without any non-bundled elements. {F8768216} {F8768217} This transform fires 91 times: ``` $ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json Tests: 1149 Metric: loop-idiom.NumBCmp Program result-new MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00 MultiSource/Applications/d/make_dparser 3.00 SingleSource/UnitTests/vla 2.00 MultiSource/Applications/Burg/burg 1.00 MultiSourc.../Applications/JM/lencod/lencod 1.00 MultiSource/Applications/lemon/lemon 1.00 MultiSource/Benchmarks/Bullet/bullet 1.00 MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00 MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00 MultiSourc...Prolangs-C/simulator/simulator 1.00 ``` The size changes are: I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look. ``` $ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash Tests: 1149 Same hash: 907 (filtered out) Remaining: 242 Metric: size..text Program result-old result-new diff test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6% test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5% test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1% test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1% test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1% test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1% test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0% test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0% test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0% test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0% test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan% test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan% test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan% Geomean difference nan% result-old result-new diff count 1.000000e+01 10.00000 10.000000 mean 3.152090e+05 311695.40000 0.006749 std 3.790398e+05 372091.42232 0.036605 min 7.530000e+02 833.00000 -0.034981 25% 4.243300e+04 42401.00000 -0.000866 50% 1.197370e+05 119689.00000 -0.000392 75% 6.397050e+05 639705.00000 -0.000005 max 1.001697e+06 966657.00000 0.106242 ``` I don't have timings though. And now to the code. The basic idea is to completely replace the whole loop. If we can't fully kill it, don't transform. I have left one or two comments in the code, so hopefully it can be understood. Also, there is a few TODO's that i have left for follow-ups: * widening of `memcmp()`/`bcmp()` * step smaller than the comparison size * Metadata propagation * more than two blocks as long as there is still a single backedge? * ??? Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet Reviewed By: courbet Subscribers: miyuki, hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists Tags: #llvm Differential Revision: https://reviews.llvm.org/D61144 llvm-svn: 374662
*	[NFC][LoopIdiom] Add bcmp loop idiom miscompile test from PR43206.	Roman Lebedev	2019-10-12	1	-3/+54
\| \| \| \| \| \| \| \|	The transform forgot to check SCEV loop scopes. https://bugs.llvm.org/show_bug.cgi?id=43206 llvm-svn: 374661
*	[NFC][LoopIdiom] Move one bcmp test into the proper place	Roman Lebedev	2019-10-12	2	-23/+47
\| \| \| \|	llvm-svn: 374660
*	[CostModel][X86] Improve sum reduction costs.	Simon Pilgrim	2019-10-12	1	-1/+1
\| \| \| \| \| \| \| \|	I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2. I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674. llvm-svn: 374655
*	recommit: [LoopVectorize][PowerPC] Estimate int and float register pressure ↵	Zi Xuan Wu	2019-10-12	3	-10/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	separately in loop-vectorize In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not estimate different register pressure for different register class separately(especially for scalar type, float type should not be on the same position with int type), so it's not accurate. Specifically, it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance. So we need classify the register classes in IR level, and importantly these are abstract register classes, and are not the target register class of backend provided in td file. It's used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types. For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR), float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled, and 3 kinds of register class when VSX is NOT enabled. It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions. Differential revision: https://reviews.llvm.org/D67148 llvm-svn: 374634
*	Dead Virtual Function Elimination	Oliver Stannard	2019-10-11	9	-0/+749
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, it is hard for the compiler to remove unused C++ virtual functions, because they are all referenced from vtables, which are referenced by constructors. This means that if the constructor is called from any live code, then we keep every virtual function in the final link, even if there are no call sites which can use it. This patch allows unused virtual functions to be removed during LTO (and regular compilation in limited circumstances) by using type metadata to match virtual function call sites to the vtable slots they might load from. This information can then be used in the global dead code elimination pass instead of the references from vtables to virtual functions, to more accurately determine which functions are reachable. To make this transformation safe, I have changed clang's code-generation to always load virtual function pointers using the llvm.type.checked.load intrinsic, instead of regular load instructions. I originally tried writing this using clang's existing code-generation, which uses the llvm.type.test and llvm.assume intrinsics after doing a normal load. However, it is possible for optimisations to obscure the relationship between the GEP, load and llvm.type.test, causing GlobalDCE to fail to find virtual function call sites. The existing linkage and visibility types don't accurately describe the scope in which a virtual call could be made which uses a given vtable. This is wider than the visibility of the type itself, because a virtual function call could be made using a more-visible base class. I've added a new !vcall_visibility metadata type to represent this, described in TypeMetadata.rst. The internalization pass and libLTO have been updated to change this metadata when linking is performed. This doesn't currently work with ThinLTO, because it needs to see every call to llvm.type.checked.load in the linkage unit. It might be possible to extend this optimisation to be able to use the ThinLTO summary, as was done for devirtualization, but until then that combination is rejected in the clang driver. To test this, I've written a fuzzer which generates random C++ programs with complex class inheritance graphs, and virtual functions called through object and function pointers of different types. The programs are spread across multiple translation units and DSOs to test the different visibility restrictions. I've also tried doing bootstrap builds of LLVM to test this. This isn't ideal, because only classes in anonymous namespaces can be optimised with -fvisibility=default, and some parts of LLVM (plugins and bugpoint) do not work correctly with -fvisibility=hidden. However, there are only 12 test failures when building with -fvisibility=hidden (and an unmodified compiler), and this change does not cause any new failures for either value of -fvisibility. On the 7 C++ sub-benchmarks of SPEC2006, this gives a geomean code-size reduction of ~6%, over a baseline compiled with "-O2 -flto -fvisibility=hidden -fwhole-program-vtables". The best cases are reductions of ~14% in 450.soplex and 483.xalancbmk, and there are no code size increases. I've also run this on a set of 8 mbed-os examples compiled for Armv7M, which show a geomean size reduction of ~3%, again with no size increases. I had hoped that this would have no effect on performance, which would allow it to awlays be enabled (when using -fwhole-program-vtables). However, the changes in clang to use the llvm.type.checked.load intrinsic are causing ~1% performance regression in the C++ parts of SPEC2006. It should be possible to recover some of this perf loss by teaching optimisations about the llvm.type.checked.load intrinsic, which would make it worth turning this on by default (though it's still dependent on -fwhole-program-vtables). Differential revision: https://reviews.llvm.org/D63932 llvm-svn: 374539
*	[NFC] run specific pass instead of whole -O3 pipeline for popcount ↵	Chen Zheng	2019-10-11	1	-5/+5
\| \| \| \| \| \|	recoginzation testcase. llvm-svn: 374514
*	[InstCombine] recognize popcount.	Chen Zheng	2019-10-11	1	-0/+193
\| \| \| \| \| \| \| \| \|	This patch recognizes popcount intrinsic according to algorithm from website http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel Differential Revision: https://reviews.llvm.org/D68189 llvm-svn: 374512
*	[CVP] Remove a masking operation if range information implies it's a noop	Philip Reames	2019-10-11	3	-3/+128
\| \| \| \| \| \| \| \| \| \|	This is really a known bits style transformation, but known bits isn't context sensitive. The particular case which comes up happens to involve a range which allows range based reasoning to eliminate the mask pattern, so handle that case specifically in CVP. InstCombine likes to generate the mask-by-low-bits pattern when widening an arithmetic expression which includes a zext in the middle. Differential Revision: https://reviews.llvm.org/D68811 llvm-svn: 374506
*	[Attributor][FIX] Do not replace musstail calls with constant	Johannes Doerfert	2019-10-11	1	-0/+5
\| \| \| \|	llvm-svn: 374498
*	[ValueTracking] Improve pointer offset computation for cases of same base	Rong Xu	2019-10-10	1	-0/+77
\| \| \| \| \| \| \| \| \| \| \| \|	This patch improves the handling of pointer offset in GEP expressions where one argument is the base pointer. isPointerOffset() is being used by memcpyopt where current code synthesizes consecutive 32 bytes stores to one store and two memset intrinsic calls. With this patch, we convert the stores to one memset intrinsic. Differential Revision: https://reviews.llvm.org/D67989 llvm-svn: 374454
*	[InstCombine] Add test case for PR43617 (NFC)	Evandro Menezes	2019-10-10	1	-0/+10
\| \| \| \| \| \|	Also, refactor check in `LibCallSimplifier::optimizeLog()`. llvm-svn: 374453
*	Revert "[IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator"	Dmitri Gribenko	2019-10-10	4	-17/+17
\| \| \| \| \| \| \|	This reverts commit r374240. It broke OCaml tests: http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/19014 llvm-svn: 374354
*	[Attributor] Handle `null` differently in capture and alias logic	Johannes Doerfert	2019-10-10	2	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: `null` in the default address space (=AS 0) cannot be captured nor can it alias anything. We make this clear now as it can be important for callbacks and other cases later on. In addition, this patch improves the debug output for noalias deduction. Reviewers: sstefan1, uenoku Subscribers: hiraditya, bollu, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68624 llvm-svn: 374280
*	[IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator	Cameron McInally	2019-10-09	4	-17/+17
\| \| \| \| \| \| \| \|	Also update Clang to call Builder.CreateFNeg(...) for UnaryMinus. Differential Revision: https://reviews.llvm.org/D61675 llvm-svn: 374240
*	[SampleFDO] Add indexing for function profiles so they can be loaded on demand	Wei Mi	2019-10-09	2	-0/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in ExtBinary format Currently for Text, Binary and ExtBinary format profiles, when we compile a module with samplefdo, even if there is no function showing up in the profile, we have to load all the function profiles from the profile input. That is a waste of compile time. CompactBinary format profile has already had the support of loading function profiles on demand. In this patch, we add the support to load profile on demand for ExtBinary format. It will work no matter the sections in ExtBinary format profile are compressed or not. Experiment shows it reduces the time to compile a server benchmark by 30%. When profile remapping and loading function profiles on demand are both used, extra work needs to be done so that the loading on demand process will take the name remapping into consideration. It will be addressed in a follow-up patch. Differential Revision: https://reviews.llvm.org/D68601 llvm-svn: 374233
*	[ConstProp] add tests for extractelement with undef index; NFC	Sanjay Patel	2019-10-09	1	-6/+27
\| \| \| \|	llvm-svn: 374210
*	[InstCombine] add another test for gep inbounds; NFC	Sanjay Patel	2019-10-09	1	-0/+11
\| \| \| \|	llvm-svn: 374190
*	[SLP] respect target register width for GEP vectorization (PR43578)	Sanjay Patel	2019-10-09	3	-58/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We failed to account for the target register width (max vector factor) when vectorizing starting from GEPs. This causes vectorization to proceed to obviously illegal widths as in: https://bugs.llvm.org/show_bug.cgi?id=43578 For x86, this also means that SLP can produce rogue AVX or AVX512 code even when the user specifies a narrower vector width. The AArch64 test in ext-trunc.ll appears to be better using the narrower width. I'm not exactly sure what getelementptr.ll is trying to do, but it's testing with "-slp-threshold=-18", so I'm not worried about those diffs. The x86 test is an over-reduction from SPEC h264; this patch appears to restore the perf loss caused by SLP when using -march=haswell. Differential Revision: https://reviews.llvm.org/D68667 llvm-svn: 374183
*	[LV] Emitting SCEV checks with OptForSize	Sjoerd Meijer	2019-10-09	1	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When optimising for size and SCEV runtime checks need to be emitted to check overflow behaviour, the loop vectorizer can run in this assert: LoopVectorize.cpp:2699: void llvm::InnerLoopVectorizer::emitSCEVChecks( llvm::Loop , llvm::BasicBlock ): Assertion `!BB->getParent()->hasOptSize() && "Cannot SCEV check stride or overflow when opt We should not generate predicates while optimising for size because code will be generated for predicates such as these SCEV overflow runtime checks. This should fix PR43371. Differential Revision: https://reviews.llvm.org/D68082 llvm-svn: 374166
*	[CVP} Replace SExt with ZExt if the input is known-non-negative	Roman Lebedev	2019-10-08	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: zero-extension is far more friendly for further analysis. While this doesn't directly help with the shift-by-signext problem, this is not unrelated. This has the following effect on test-suite (numbers collected after the finish of middle-end module pass manager): \| Statistic \| old \| new \| delta \| percent change \| \| correlated-value-propagation.NumSExt \| 0 \| 6026 \| 6026 \| +100.00% \| \| instcount.NumAddInst \| 272860 \| 271283 \| -1577 \| -0.58% \| \| instcount.NumAllocaInst \| 27227 \| 27226 \| -1 \| 0.00% \| \| instcount.NumAndInst \| 63502 \| 63320 \| -182 \| -0.29% \| \| instcount.NumAShrInst \| 13498 \| 13407 \| -91 \| -0.67% \| \| instcount.NumAtomicCmpXchgInst \| 1159 \| 1159 \| 0 \| 0.00% \| \| instcount.NumAtomicRMWInst \| 5036 \| 5036 \| 0 \| 0.00% \| \| instcount.NumBitCastInst \| 672482 \| 672353 \| -129 \| -0.02% \| \| instcount.NumBrInst \| 702768 \| 702195 \| -573 \| -0.08% \| \| instcount.NumCallInst \| 518285 \| 518205 \| -80 \| -0.02% \| \| instcount.NumExtractElementInst \| 18481 \| 18482 \| 1 \| 0.01% \| \| instcount.NumExtractValueInst \| 18290 \| 18288 \| -2 \| -0.01% \| \| instcount.NumFAddInst \| 139035 \| 138963 \| -72 \| -0.05% \| \| instcount.NumFCmpInst \| 10358 \| 10348 \| -10 \| -0.10% \| \| instcount.NumFDivInst \| 30310 \| 30302 \| -8 \| -0.03% \| \| instcount.NumFenceInst \| 387 \| 387 \| 0 \| 0.00% \| \| instcount.NumFMulInst \| 93873 \| 93806 \| -67 \| -0.07% \| \| instcount.NumFPExtInst \| 7148 \| 7144 \| -4 \| -0.06% \| \| instcount.NumFPToSIInst \| 2823 \| 2838 \| 15 \| 0.53% \| \| instcount.NumFPToUIInst \| 1251 \| 1251 \| 0 \| 0.00% \| \| instcount.NumFPTruncInst \| 2195 \| 2191 \| -4 \| -0.18% \| \| instcount.NumFSubInst \| 92109 \| 92103 \| -6 \| -0.01% \| \| instcount.NumGetElementPtrInst \| 1221423 \| 1219157 \| -2266 \| -0.19% \| \| instcount.NumICmpInst \| 479140 \| 478929 \| -211 \| -0.04% \| \| instcount.NumIndirectBrInst \| 2 \| 2 \| 0 \| 0.00% \| \| instcount.NumInsertElementInst \| 66089 \| 66094 \| 5 \| 0.01% \| \| instcount.NumInsertValueInst \| 2032 \| 2030 \| -2 \| -0.10% \| \| instcount.NumIntToPtrInst \| 19641 \| 19641 \| 0 \| 0.00% \| \| instcount.NumInvokeInst \| 21789 \| 21788 \| -1 \| 0.00% \| \| instcount.NumLandingPadInst \| 12051 \| 12051 \| 0 \| 0.00% \| \| instcount.NumLoadInst \| 880079 \| 878673 \| -1406 \| -0.16% \| \| instcount.NumLShrInst \| 25919 \| 25921 \| 2 \| 0.01% \| \| instcount.NumMulInst \| 42416 \| 42417 \| 1 \| 0.00% \| \| instcount.NumOrInst \| 100826 \| 100576 \| -250 \| -0.25% \| \| instcount.NumPHIInst \| 315118 \| 314092 \| -1026 \| -0.33% \| \| instcount.NumPtrToIntInst \| 15933 \| 15939 \| 6 \| 0.04% \| \| instcount.NumResumeInst \| 2156 \| 2156 \| 0 \| 0.00% \| \| instcount.NumRetInst \| 84485 \| 84484 \| -1 \| 0.00% \| \| instcount.NumSDivInst \| 8599 \| 8597 \| -2 \| -0.02% \| \| instcount.NumSelectInst \| 45577 \| 45913 \| 336 \| 0.74% \| \| instcount.NumSExtInst \| 84026 \| 78278 \| -5748 \| -6.84% \| \| instcount.NumShlInst \| 39796 \| 39726 \| -70 \| -0.18% \| \| instcount.NumShuffleVectorInst \| 100272 \| 100292 \| 20 \| 0.02% \| \| instcount.NumSIToFPInst \| 29131 \| 29113 \| -18 \| -0.06% \| \| instcount.NumSRemInst \| 1543 \| 1543 \| 0 \| 0.00% \| \| instcount.NumStoreInst \| 805394 \| 804351 \| -1043 \| -0.13% \| \| instcount.NumSubInst \| 61337 \| 61414 \| 77 \| 0.13% \| \| instcount.NumSwitchInst \| 8527 \| 8524 \| -3 \| -0.04% \| \| instcount.NumTruncInst \| 60523 \| 60484 \| -39 \| -0.06% \| \| instcount.NumUDivInst \| 2381 \| 2381 \| 0 \| 0.00% \| \| instcount.NumUIToFPInst \| 5549 \| 5549 \| 0 \| 0.00% \| \| instcount.NumUnreachableInst \| 9855 \| 9855 \| 0 \| 0.00% \| \| instcount.NumURemInst \| 1305 \| 1305 \| 0 \| 0.00% \| \| instcount.NumXorInst \| 10230 \| 10081 \| -149 \| -1.46% \| \| instcount.NumZExtInst \| 60353 \| 66840 \| 6487 \| 10.75% \| \| instcount.TotalBlocks \| 829582 \| 829004 \| -578 \| -0.07% \| \| instcount.TotalFuncs \| 83818 \| 83817 \| -1 \| 0.00% \| \| instcount.TotalInsts \| 7316574 \| 7308483 \| -8091 \| -0.11% \| TLDR: we produce -0.11% less instructions, -6.84% less `sext`, +10.75% more `zext`. To be noted, clearly, not all new `zext`'s are produced by this fold. (And now i guess it might have been interesting to measure this for D68103 :S) Reviewers: nikic, spatel, reames, dberlin Reviewed By: nikic Subscribers: hiraditya, jfb, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68654 llvm-svn: 374112
*	[CVP][NFC] Revisit sext vs. zext test	Roman Lebedev	2019-10-08	1	-5/+33
\| \| \| \|	llvm-svn: 374111
*	Revert "[LoopVectorize][PowerPC] Estimate int and float register pressure ↵	Jinsong Ji	2019-10-08	3	-215/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	separately in loop-vectorize" Also Revert "[LoopVectorize] Fix non-debug builds after rL374017" This reverts commit 9f41deccc0e648a006c9f38e11919f181b6c7e0a. This reverts commit 18b6fe07bcf44294f200bd2b526cb737ed275c04. The patch is breaking PowerPC internal build, checked with author, reverting on behalf of him for now due to timezone. llvm-svn: 374091
*	[SLP] add test with prefer-vector-width function attribute; NFC (PR43578)	Sanjay Patel	2019-10-08	1	-0/+59
\| \| \| \|	llvm-svn: 374090
*	[Attributor][Fix] Temporary fix for windows build bot failure	Hideto Ueno	2019-10-08	1	-1/+3
\| \| \| \| \| \| \| \|	D65402 causes test failure related to attributor-max-iterations. This commit removes attributor-max-iterations-verify for now. I'll examine the factor and the flag should be reverted. llvm-svn: 374086
*	[NFC][CVP] Add tests where we can replace sext with zext	Roman Lebedev	2019-10-08	1	-0/+107
\| \| \| \| \| \| \| \|	If the sign bit of the value that is being sign-extended is not set, i.e. the value is non-negative (s>= 0), then zero-extension will suffice, and is better for analysis: https://rise4fun.com/Alive/a8PD llvm-svn: 374075
*	[Attributor][MustExec] Deduce dereferenceable and nonnull attribute using ↵	Hideto Ueno	2019-10-08	15	-65/+295
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MustBeExecutedContextExplorer Summary: In D65186 and related patches, MustBeExecutedContextExplorer is introduced. This enables us to traverse instructions guaranteed to execute from function entry. If we can know the argument is used as `dereferenceable` or `nonnull` in these instructions, we can mark `dereferenceable` or `nonnull` in the argument definition: 1. Memory instruction (similar to D64258) Trace memory instruction pointer operand. Currently, only inbounds GEPs are traced. ``` define i64* @f(i64* %a) { entry: %add.ptr = getelementptr inbounds i64, i64* %a, i64 1 ; (because of inbounds GEP we can know that %a is at least dereferenceable(16)) store i64 1, i64* %add.ptr, align 8 ret i64* %add.ptr ; dereferenceable 8 (because above instruction stores into it) } ``` 2. Propagation from callsite (similar to D27855) If `deref` or `nonnull` are known in call site parameter attributes we can also say that argument also that attribute. ``` declare void @use3(i8* %x, i8* %y, i8* %z); declare void @use3nonnull(i8* nonnull %x, i8* nonnull %y, i8* nonnull %z); define void @parent1(i8* %a, i8* %b, i8* %c) { call void @use3nonnull(i8* %b, i8* %c, i8* %a) ; Above instruction is always executed so we can say that@parent1(i8* nonnnull %a, i8* nonnull %b, i8* nonnull %c) call void @use3(i8* %c, i8* %a, i8* %b) ret void } ``` Reviewers: jdoerfert, sstefan1, spatel, reames Reviewed By: jdoerfert Subscribers: xbolva00, hiraditya, jfb, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D65402 llvm-svn: 374063
*	[SLP] add test with prefer-vector-width function attribute; NFC	Sanjay Patel	2019-10-08	1	-31/+73
\| \| \| \|	llvm-svn: 374039
*	[NFC] Add REQUIRES for r374017 in testcase	Zi Xuan Wu	2019-10-08	1	-0/+1
\| \| \| \|	llvm-svn: 374027
*	[LoopVectorize][PowerPC] Estimate int and float register pressure separately ↵	Zi Xuan Wu	2019-10-08	3	-10/+214
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in loop-vectorize In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not estimate different register pressure for different register class separately(especially for scalar type, float type should not be on the same position with int type), so it's not accurate. Specifically, it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance. So we need classify the register classes in IR level, and importantly these are abstract register classes, and are not the target register class of backend provided in td file. It's used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types. For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR), float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled, and 3 kinds of register class when VSX is NOT enabled. It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions. Differential revision: https://reviews.llvm.org/D67148 llvm-svn: 374017
*	[Attributor] Use local linkage instead of internal	Johannes Doerfert	2019-10-07	1	-2/+2
\| \| \| \| \| \| \|	Local linkage is internal or private, and private is a specialization of internal, so either is fine for all our "local linkage" queries. llvm-svn: 373986
*	[Attributor] Use abstract call sites for call site callback	Johannes Doerfert	2019-10-07	1	-0/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: When we iterate over uses of functions and expect them to be call sites, we now use abstract call sites to allow callback calls. Reviewers: sstefan1, uenoku Subscribers: hiraditya, bollu, hfinkel, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D67871 llvm-svn: 373985
*	[Attributor] Deduce memory behavior of functions and arguments	Johannes Doerfert	2019-10-07	15	-159/+155
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Deduce the memory behavior, aka "read-none", "read-only", or "write-only", for functions and arguments. Reviewers: sstefan1, uenoku Subscribers: hiraditya, bollu, jfb, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D67384 llvm-svn: 373965
*	[InstCombine] Fold conditional sign-extend of high-bit-extract into ↵	Roman Lebedev	2019-10-07	1	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	high-bit-extract-with-signext (PR42389) This can come up in Bit Stream abstractions. The pattern looks big/scary, but it can't be simplified any further. It only is so simple because a number of my preparatory folds had happened already (shift amount reassociation / shift amount reassociation in bit test, sign bit test detection). Highlights: * There are two main flavors: https://rise4fun.com/Alive/zWi The difference is add vs. sub, and left-shift of -1 vs. 1 * Since we only change the shift opcode, we can preserve the exact-ness: https://rise4fun.com/Alive/4u4 * There can be truncation after high-bit-extraction: https://rise4fun.com/Alive/slHc1 (the main pattern i'm after!) Which means that we need to ignore zext of shift amounts and of NBits. * The sign-extending magic can be extended itself (in add pattern via sext, in sub pattern via zext. not the other way around!) https://rise4fun.com/Alive/NhG (or those sext/zext can be sinked into `select`!) Which again means we should pay attention when matching NBits. * We can have both truncation of extraction and widening of magic: https://rise4fun.com/Alive/XTw In other words, i don't believe we need to have any checks on bitwidths of any of these constructs. This is worsened in general by the fact that we may have `sext` instead of `zext` for shift amounts, and we don't yet canonicalize to `zext`, although we should. I have not done anything about that here. Also, we really should have something to weed out `sub` like these, by folding them into `add` variant. https://bugs.llvm.org/show_bug.cgi?id=42389 llvm-svn: 373964
*	[InstCombine][NFC] Tests for "conditional sign-extend of high-bit-extract" ↵	Roman Lebedev	2019-10-07	1	-0/+1040
\| \| \| \| \| \| \| \|	pattern (PR42389) https://bugs.llvm.org/show_bug.cgi?id=42389 llvm-svn: 373963