summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Revert "[SCEV] Move ScalarEvolutionExpander.cpp to Transforms/Utils (NFC)."Florian Hahn2020-01-041-1/+1
| | | | | This reverts commit 51ef53f3bd23559203fe9af82ff2facbfedc1db3, as it breaks some bots.
* [SCEV] Move ScalarEvolutionExpander.cpp to Transforms/Utils (NFC).Florian Hahn2020-01-041-1/+1
| | | | | | | | | | | | SCEVExpander modifies the underlying function so it is more suitable in Transforms/Utils, rather than Analysis. This allows using other transform utils in SCEVExpander. Reviewers: sanjoy.google, efriedma, reames Reviewed By: sanjoy.google Differential Revision: https://reviews.llvm.org/D71537
* Resubmit "[Alignment][NFC] Deprecate CreateMemCpy/CreateMemMove"Guillaume Chatelet2019-12-171-2/+2
| | | | | | | | | | | | | | | | | | | | Summary: This is a resubmit of D71473. This patch introduces a set of functions to enable deprecation of IRBuilder functions without breaking out of tree clients. Functions will be deprecated one by one and as in tree code is cleaned up. This is patch is part of a series to introduce an Alignment type. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html See this patch for the introduction of the type: https://reviews.llvm.org/D64790 Reviewers: aaron.ballman, courbet Subscribers: llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71547
* Revert "[Alignment][NFC] Deprecate CreateMemCpy/CreateMemMove"Guillaume Chatelet2019-12-161-2/+2
| | | | This reverts commit 181ab91efc9fb08dedda10a2fbc5fccb83ce8799.
* [Alignment][NFC] Deprecate CreateMemCpy/CreateMemMoveGuillaume Chatelet2019-12-161-2/+2
| | | | | | | | | | | | | | | | | | Summary: This patch introduces a set of functions to enable deprecation of IRBuilder functions without breaking out of tree clients. Functions will be deprecated one by one and as in tree code is cleaned up. This is patch is part of a series to introduce an Alignment type. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html See this patch for the introduction of the type: https://reviews.llvm.org/D64790 Reviewers: courbet Subscribers: arsenm, jvesely, nhaehnle, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71473
* Reland [DataLayout] Fix occurrences that size and range of pointers are ↵Nicola Zaghen2019-12-131-10/+10
| | | | | | | | | | | | | | assumed to be the same. GEP index size can be specified in the DataLayout, introduced in D42123. However, there were still places in which getIndexSizeInBits was used interchangeably with getPointerSizeInBits. This notably caused issues with Instcombine's visitPtrToInt; but the unit tests was incorrect, so this remained undiscovered. This fixes the buildbot failures. Differential Revision: https://reviews.llvm.org/D68328 Patch by Joseph Faulls!
* Temporarily Revert "[DataLayout] Fix occurrences that size and range of ↵Nicola Zaghen2019-12-121-10/+10
| | | | | | | | | pointers are assumed to be the same." This reverts commit 5f6208778ff92567c57d7c1e2e740c284d7e69a5. This caused failures in Transforms/PhaseOrdering/scev-custom-dl.ll const: Assertion `getBitWidth() == CR.getBitWidth() && "ConstantRange types don't agree!"' failed.
* [DataLayout] Fix occurrences that size and range of pointers are assumed to ↵Nicola Zaghen2019-12-121-10/+10
| | | | | | | | | | | | be the same. GEP index size can be specified in the DataLayout, introduced in D42123. However, there were still places in which getIndexSizeInBits was used interchangeably with getPointerSizeInBits. This notably caused issues with Instcombine's visitPtrToInt; but the unit tests was incorrect, so this remained undiscovered. Differential Revision: https://reviews.llvm.org/D68328 Patch by Joseph Faulls!
* [Alignment][NFC] CreateMemSet use MaybeAlignGuillaume Chatelet2019-12-101-8/+9
| | | | | | | | | | | | | | | Summary: This is patch is part of a series to introduce an Alignment type. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html See this patch for the introduction of the type: https://reviews.llvm.org/D64790 Reviewers: courbet Subscribers: arsenm, jvesely, nhaehnle, hiraditya, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D71213
* Sink all InitializePasses.h includesReid Kleckner2019-11-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This file lists every pass in LLVM, and is included by Pass.h, which is very popular. Every time we add, remove, or rename a pass in LLVM, it caused lots of recompilation. I found this fact by looking at this table, which is sorted by the number of times a file was changed over the last 100,000 git commits multiplied by the number of object files that depend on it in the current checkout: recompiles touches affected_files header 342380 95 3604 llvm/include/llvm/ADT/STLExtras.h 314730 234 1345 llvm/include/llvm/InitializePasses.h 307036 118 2602 llvm/include/llvm/ADT/APInt.h 213049 59 3611 llvm/include/llvm/Support/MathExtras.h 170422 47 3626 llvm/include/llvm/Support/Compiler.h 162225 45 3605 llvm/include/llvm/ADT/Optional.h 158319 63 2513 llvm/include/llvm/ADT/Triple.h 140322 39 3598 llvm/include/llvm/ADT/StringRef.h 137647 59 2333 llvm/include/llvm/Support/Error.h 131619 73 1803 llvm/include/llvm/Support/FileSystem.h Before this change, touching InitializePasses.h would cause 1345 files to recompile. After this change, touching it only causes 550 compiles in an incremental rebuild. Reviewers: bkramer, asbirlea, bollu, jdoerfert Differential Revision: https://reviews.llvm.org/D70211
* Revert BCmp Loop Idiom recognition transform (PR43870)Roman Lebedev2019-11-021-877/+8
| | | | | | | | | | | | | | | | | As discussed in https://bugs.llvm.org/show_bug.cgi?id=43870, this transform is missing a crucial legality check: the old (non-countable) loop would early-return upon first mismatch, but there is no such guarantee for bcmp/memcmp. We'd need to ensure that [PtrA, PtrA+NBytes) and [PtrB, PtrB+NBytes) are fully dereferenceable memory regions. But that would limit the transform to constant loop trip counts and would further cripple it because dereferenceability analysis is *very* partial. Furthermore, even if all that is done, every single test would need to be rewritten from scratch. So let's just give up.
* [LoopIdiom] BCmp: check, not assert that loop exits exit out of the loop ↵Roman Lebedev2019-10-171-7/+8
| | | | | | | | | | | | | | | (PR43687) We can't normally stumble into that assertion because a tautological *conditional* `br` in loop body is required, one that always branches to loop latch. But that should have been always folded to an unconditional branch before we get it. But that is not guaranteed if the pass is run standalone. So let's just promote the assertion into a proper check. Fixes https://bugs.llvm.org/show_bug.cgi?id=43687 llvm-svn: 375100
* [Alignment][NFC] Remove dependency on GlobalObject::setAlignment(unsigned)Guillaume Chatelet2019-10-151-1/+1
| | | | | | | | | | | | | | | | | Summary: This is patch is part of a series to introduce an Alignment type. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html See this patch for the introduction of the type: https://reviews.llvm.org/D64790 Reviewers: courbet Subscribers: arsenm, mehdi_amini, jvesely, nhaehnle, hiraditya, steven_wu, dexonsmith, dang, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68944 llvm-svn: 374880
* [LoopIdiom] BCmp: loop exit count must not be wider than size_t that `bcmp` ↵Roman Lebedev2019-10-141-0/+6
| | | | | | | | | | | | takes As reported by Joerg Sonnenberger in IRC, for 32-bit systems, where pointer and size_t are 32-bit, if you use 64-bit-wide variable in the loop, you could end up with loop exit count being of the type wider than the size_t. Now, i'm not sure if we can produce `bcmp` from that (just truncate?), but we certainly should not assert/miscompile. llvm-svn: 374811
* [NFC][LoopIdiom] Adjust FIXME to be self-explanatoryRoman Lebedev2019-10-121-1/+1
| | | | llvm-svn: 374670
* [LoopIdiomRecognize] Recommit: BCmp loop idiom recognitionRoman Lebedev2019-10-121-8/+870
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: This is a recommit, this originally landed in rL370454 but was subsequently reverted in rL370788 due to https://bugs.llvm.org/show_bug.cgi?id=43206 The reduced testcase was added to bcmp-negative-tests.ll as @pr43206_different_loops - we must ensure that the SCEV's we got are both for the same loop we are currently investigating. Original commit message: @mclow.lists brought up this issue up in IRC. It is a reasonably common problem to compare some two values for equality. Those may be just some integers, strings or arrays of integers. In C, there is `memcmp()`, `bcmp()` functions. In C++, there exists `std::equal()` algorithm. One can also write that function manually. libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ libc++ does not do anything like that, it simply relies on simple C++'s `operator==()`. https://godbolt.org/z/er0Zwf (GOOD!) So likely, there exists a certain performance opportunities. Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that is using `memcmp()` (in this case, compiled with modified compiler). {F8768213} ``` #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> #include "benchmark/benchmark.h" template <class T> bool equal(T* a, T* a_end, T* b) noexcept { for (; a != a_end; ++a, ++b) { if (*a != *b) return false; } return true; } template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct Identical { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto Tmp = getVectorOfRandomNumbers<T>(count); return std::make_pair(Tmp, std::move(Tmp)); } }; struct InequalHalfway { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto V0 = getVectorOfRandomNumbers<T>(count); auto V1 = V0; V1[V1.size() / size_t(2)]++; // just change the value. return std::make_pair(std::move(V0), std::move(V1)); } }; template <class T, class Gen> void BM_bcmp(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { const bool is_equal = equal(a.data(), a.data() + a.size(), b.data()); benchmark::DoNotOptimize(is_equal); } state.SetComplexityN(Length); state.counters["eltcnt"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["eltcnt/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = []() { for (const benchmark::CPUInfo::CacheInfo& I : benchmark::CPUInfo::Get().caches) { if (I.level == 2) return I.size; } return 0; }(); // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 / /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway) ->Apply(CustomArguments<uint64_t>); ``` {F8768210} ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx 2019-04-25 21:17:11 Running build-old/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 0.65, 3.90, 4.14 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N BM_bcmp<uint8_t, Identical>_RMS 8 % 8 % <...> BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N BM_bcmp<uint16_t, Identical>_RMS 25 % 25 % <...> BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N BM_bcmp<uint32_t, Identical>_RMS 42 % 42 % <...> BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N BM_bcmp<uint64_t, Identical>_RMS 27 % 27 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 % RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0 2019-04-25 21:19:29 Running build-new/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.01, 2.85, 3.71 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N BM_bcmp<uint8_t, Identical>_RMS 37 % 37 % <...> BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N BM_bcmp<uint16_t, Identical>_RMS 34 % 34 % <...> BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N BM_bcmp<uint32_t, Identical>_RMS 35 % 35 % <...> BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N BM_bcmp<uint64_t, Identical>_RMS 33 % 33 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 % Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench Benchmark Time CPU Time Old Time New CPU Old CPU New --------------------------------------------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590 <...> BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948 <...> BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627 <...> BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855 <...> BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569 <...> BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547 <...> BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394 <...> BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498 ``` What can we tell from the benchmark? * Performance of naive equality check somewhat improves with element size, maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s for uint64_t. I think, that instability implies performance problems. * Performance of `memcmp()`-aware benchmark always maxes out at around bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the naive variant! * eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and linearly decreases with element size. For uint64_t, it's ~4x+ the elements/second. * The call obvious is more pricey than the loop, with small element count. As it can be seen from the full output {F8768210}, the `memcmp()` is almost universally worse, independent of the element size (and thus buffer size) when element count is less than 8. So all in all, bcmp idiom does indeed pose untapped performance headroom. This diff does implement said idiom recognition. I think a reasonable test coverage is present, but do tell if there is anything obvious missing. Now, quality. This does succeed to build and pass the test-suite, at least without any non-bundled elements. {F8768216} {F8768217} This transform fires 91 times: ``` $ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json Tests: 1149 Metric: loop-idiom.NumBCmp Program result-new MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00 MultiSource/Applications/d/make_dparser 3.00 SingleSource/UnitTests/vla 2.00 MultiSource/Applications/Burg/burg 1.00 MultiSourc.../Applications/JM/lencod/lencod 1.00 MultiSource/Applications/lemon/lemon 1.00 MultiSource/Benchmarks/Bullet/bullet 1.00 MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00 MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00 MultiSourc...Prolangs-C/simulator/simulator 1.00 ``` The size changes are: I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look. ``` $ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash Tests: 1149 Same hash: 907 (filtered out) Remaining: 242 Metric: size..text Program result-old result-new diff test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6% test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5% test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1% test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1% test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1% test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1% test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0% test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0% test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0% test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0% test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan% test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan% test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan% Geomean difference nan% result-old result-new diff count 1.000000e+01 10.00000 10.000000 mean 3.152090e+05 311695.40000 0.006749 std 3.790398e+05 372091.42232 0.036605 min 7.530000e+02 833.00000 -0.034981 25% 4.243300e+04 42401.00000 -0.000866 50% 1.197370e+05 119689.00000 -0.000392 75% 6.397050e+05 639705.00000 -0.000005 max 1.001697e+06 966657.00000 0.106242 ``` I don't have timings though. And now to the code. The basic idea is to completely replace the whole loop. If we can't fully kill it, don't transform. I have left one or two comments in the code, so hopefully it can be understood. Also, there is a few TODO's that i have left for follow-ups: * widening of `memcmp()`/`bcmp()` * step smaller than the comparison size * Metadata propagation * more than two blocks as long as there is still a single backedge? * ??? Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet Reviewed By: courbet Subscribers: miyuki, hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists Tags: #llvm Differential Revision: https://reviews.llvm.org/D61144 llvm-svn: 374662
* Change TargetLibraryInfo analysis passes to always require FunctionTeresa Johnson2019-09-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: This is the first change to enable the TLI to be built per-function so that -fno-builtin* handling can be migrated to use function attributes. See discussion on D61634 for background. This is an enabler for fixing handling of these options for LTO, for example. This change should not affect behavior, as the provided function is not yet used to build a specifically per-function TLI, but rather enables that migration. Most of the changes were very mechanical, e.g. passing a Function to the legacy analysis pass's getTLI interface, or in Module level cases, adding a callback. This is similar to the way the per-function TTI analysis works. There was one place where we were looking for builtins but not in the context of a specific function. See FindCXAAtExit in lib/Transforms/IPO/GlobalOpt.cpp. I'm somewhat concerned my workaround could provide the wrong behavior in some corner cases. Suggestions welcome. Reviewers: chandlerc, hfinkel Subscribers: arsenm, dschuff, jvesely, nhaehnle, mehdi_amini, javed.absar, sbc100, jgravelle-google, eraman, aheejin, steven_wu, george.burgess.iv, dexonsmith, jfb, asbirlea, gchatelet, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66428 llvm-svn: 371284
* Revert r370454 "[LoopIdiomRecognize] BCmp loop idiom recognition"Roman Lebedev2019-09-031-869/+8
| | | | | | | | | | https://bugs.llvm.org/show_bug.cgi?id=43206 was filed, claiming that there is a miscompilation. Reverting until i investigate. This reverts commit r370454 llvm-svn: 370788
* Remove an extra ";", NFC.Haojian Wu2019-08-301-1/+1
| | | | llvm-svn: 370465
* [LoopIdiomRecognize] BCmp loop idiom recognitionRoman Lebedev2019-08-301-8/+869
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: @mclow.lists brought up this issue up in IRC. It is a reasonably common problem to compare some two values for equality. Those may be just some integers, strings or arrays of integers. In C, there is `memcmp()`, `bcmp()` functions. In C++, there exists `std::equal()` algorithm. One can also write that function manually. libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ libc++ does not do anything like that, it simply relies on simple C++'s `operator==()`. https://godbolt.org/z/er0Zwf (GOOD!) So likely, there exists a certain performance opportunities. Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that is using `memcmp()` (in this case, compiled with modified compiler). {F8768213} ``` #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> #include "benchmark/benchmark.h" template <class T> bool equal(T* a, T* a_end, T* b) noexcept { for (; a != a_end; ++a, ++b) { if (*a != *b) return false; } return true; } template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct Identical { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto Tmp = getVectorOfRandomNumbers<T>(count); return std::make_pair(Tmp, std::move(Tmp)); } }; struct InequalHalfway { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto V0 = getVectorOfRandomNumbers<T>(count); auto V1 = V0; V1[V1.size() / size_t(2)]++; // just change the value. return std::make_pair(std::move(V0), std::move(V1)); } }; template <class T, class Gen> void BM_bcmp(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { const bool is_equal = equal(a.data(), a.data() + a.size(), b.data()); benchmark::DoNotOptimize(is_equal); } state.SetComplexityN(Length); state.counters["eltcnt"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["eltcnt/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = []() { for (const benchmark::CPUInfo::CacheInfo& I : benchmark::CPUInfo::Get().caches) { if (I.level == 2) return I.size; } return 0; }(); // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 / /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway) ->Apply(CustomArguments<uint64_t>); ``` {F8768210} ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx 2019-04-25 21:17:11 Running build-old/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 0.65, 3.90, 4.14 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N BM_bcmp<uint8_t, Identical>_RMS 8 % 8 % <...> BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N BM_bcmp<uint16_t, Identical>_RMS 25 % 25 % <...> BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N BM_bcmp<uint32_t, Identical>_RMS 42 % 42 % <...> BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N BM_bcmp<uint64_t, Identical>_RMS 27 % 27 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 % RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0 2019-04-25 21:19:29 Running build-new/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.01, 2.85, 3.71 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N BM_bcmp<uint8_t, Identical>_RMS 37 % 37 % <...> BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N BM_bcmp<uint16_t, Identical>_RMS 34 % 34 % <...> BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N BM_bcmp<uint32_t, Identical>_RMS 35 % 35 % <...> BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N BM_bcmp<uint64_t, Identical>_RMS 33 % 33 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 % Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench Benchmark Time CPU Time Old Time New CPU Old CPU New --------------------------------------------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590 <...> BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948 <...> BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627 <...> BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855 <...> BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569 <...> BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547 <...> BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394 <...> BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498 ``` What can we tell from the benchmark? * Performance of naive equality check somewhat improves with element size, maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s for uint64_t. I think, that instability implies performance problems. * Performance of `memcmp()`-aware benchmark always maxes out at around bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the naive variant! * eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and linearly decreases with element size. For uint64_t, it's ~4x+ the elements/second. * The call obvious is more pricey than the loop, with small element count. As it can be seen from the full output {F8768210}, the `memcmp()` is almost universally worse, independent of the element size (and thus buffer size) when element count is less than 8. So all in all, bcmp idiom does indeed pose untapped performance headroom. This diff does implement said idiom recognition. I think a reasonable test coverage is present, but do tell if there is anything obvious missing. Now, quality. This does succeed to build and pass the test-suite, at least without any non-bundled elements. {F8768216} {F8768217} This transform fires 91 times: ``` $ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json Tests: 1149 Metric: loop-idiom.NumBCmp Program result-new MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00 MultiSource/Applications/d/make_dparser 3.00 SingleSource/UnitTests/vla 2.00 MultiSource/Applications/Burg/burg 1.00 MultiSourc.../Applications/JM/lencod/lencod 1.00 MultiSource/Applications/lemon/lemon 1.00 MultiSource/Benchmarks/Bullet/bullet 1.00 MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00 MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00 MultiSourc...Prolangs-C/simulator/simulator 1.00 ``` The size changes are: I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look. ``` $ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash Tests: 1149 Same hash: 907 (filtered out) Remaining: 242 Metric: size..text Program result-old result-new diff test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6% test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5% test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1% test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1% test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1% test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1% test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0% test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0% test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0% test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0% test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan% test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan% test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan% Geomean difference nan% result-old result-new diff count 1.000000e+01 10.00000 10.000000 mean 3.152090e+05 311695.40000 0.006749 std 3.790398e+05 372091.42232 0.036605 min 7.530000e+02 833.00000 -0.034981 25% 4.243300e+04 42401.00000 -0.000866 50% 1.197370e+05 119689.00000 -0.000392 75% 6.397050e+05 639705.00000 -0.000005 max 1.001697e+06 966657.00000 0.106242 ``` I don't have timings though. And now to the code. The basic idea is to completely replace the whole loop. If we can't fully kill it, don't transform. I have left one or two comments in the code, so hopefully it can be understood. Also, there is a few TODO's that i have left for follow-ups: * widening of `memcmp()`/`bcmp()` * step smaller than the comparison size * Metadata propagation * more than two blocks as long as there is still a single backedge? * ??? Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet Reviewed By: courbet Subscribers: hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists Tags: #llvm Differential Revision: https://reviews.llvm.org/D61144 llvm-svn: 370454
* NFC: Pass DataLayout into isBytewiseValueVitaly Buka2019-07-101-4/+4
| | | | | | | | | | | | | | | | | | Summary: We will need to handle IntToPtr which I will submit in a separate patch as it's not going to be NFC. Reviewers: eugenis, pcc Reviewed By: eugenis Subscribers: hiraditya, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D63940 llvm-svn: 365709
* [LoopIdiom] Basic OptimizationRemarkEmitter handlingRoman Lebedev2019-05-301-4/+40
| | | | | | | | | | | | | | | | | | Summary: I'm adding ORE to memset/memcpy formation, with tests, but mainly this is split off from D61144. Reviewers: reames, anemet, thegameg, craig.topper Reviewed By: thegameg Subscribers: llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62631 llvm-svn: 362092
* [LoopIdiomRecognize][NFC] Sort includesRoman Lebedev2019-05-301-2/+2
| | | | | | Split off from D61144 llvm-svn: 362091
* [LoopIdiomRecognize][NFC] Use DEBUG_TYPE, add LLVM_DEBUG() to ↵Roman Lebedev2019-05-291-2/+8
| | | | | | | | runOnNoncountableLoop() Split off from D61144 llvm-svn: 362022
* [IR] Refactor attribute methods in Function class (NFC)Evandro Menezes2019-04-041-1/+1
| | | | | | | | Rename the functions that query the optimization kind attributes. Differential revision: https://reviews.llvm.org/D60287 llvm-svn: 357731
* [LoopIdiomRecognize] @llvm.dbg values shouldn't affect the transformation.Davide Italiano2019-02-031-2/+8
| | | | | | | | | | | | | | Summary: PR40564 Reviewers: aprantl, rnk Subscribers: llvm-commits, hiraditya Tags: #llvm Differential Revision: https://reviews.llvm.org/D57629 llvm-svn: 353007
* [opaque pointer types] Pass function types to CallInst creation.James Y Knight2019-02-011-2/+2
| | | | | | | | | This cleans up all CallInst creation in LLVM to explicitly pass a function type rather than deriving it from the pointer's element-type. Differential Revision: https://reviews.llvm.org/D57170 llvm-svn: 352909
* [opaque pointer types] Add a FunctionCallee wrapper type, and use it.James Y Knight2019-02-011-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc doesn't choke on it, hopefully. Original Message: The FunctionCallee type is effectively a {FunctionType*,Value*} pair, and is a useful convenience to enable code to continue passing the result of getOrInsertFunction() through to EmitCall, even once pointer types lose their pointee-type. Then: - update the CallInst/InvokeInst instruction creation functions to take a Callee, - modify getOrInsertFunction to return FunctionCallee, and - update all callers appropriately. One area of particular note is the change to the sanitizer code. Previously, they had been casting the result of `getOrInsertFunction` to a `Function*` via `checkSanitizerInterfaceFunction`, and storing that. That would report an error if someone had already inserted a function declaraction with a mismatching signature. However, in general, LLVM allows for such mismatches, as `getOrInsertFunction` will automatically insert a bitcast if needed. As part of this cleanup, cause the sanitizer code to do the same. (It will call its functions using the expected signature, however they may have been declared.) Finally, in a small number of locations, callers of `getOrInsertFunction` actually were expecting/requiring that a brand new function was being created. In such cases, I've switched them to Function::Create instead. Differential Revision: https://reviews.llvm.org/D57315 llvm-svn: 352827
* Revert "[opaque pointer types] Add a FunctionCallee wrapper type, and use it."James Y Knight2019-01-311-2/+3
| | | | | | | | | This reverts commit f47d6b38c7a61d50db4566b02719de05492dcef1 (r352791). Seems to run into compilation failures with GCC (but not clang, where I tested it). Reverting while I investigate. llvm-svn: 352800
* [opaque pointer types] Add a FunctionCallee wrapper type, and use it.James Y Knight2019-01-311-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The FunctionCallee type is effectively a {FunctionType*,Value*} pair, and is a useful convenience to enable code to continue passing the result of getOrInsertFunction() through to EmitCall, even once pointer types lose their pointee-type. Then: - update the CallInst/InvokeInst instruction creation functions to take a Callee, - modify getOrInsertFunction to return FunctionCallee, and - update all callers appropriately. One area of particular note is the change to the sanitizer code. Previously, they had been casting the result of `getOrInsertFunction` to a `Function*` via `checkSanitizerInterfaceFunction`, and storing that. That would report an error if someone had already inserted a function declaraction with a mismatching signature. However, in general, LLVM allows for such mismatches, as `getOrInsertFunction` will automatically insert a bitcast if needed. As part of this cleanup, cause the sanitizer code to do the same. (It will call its functions using the expected signature, however they may have been declared.) Finally, in a small number of locations, callers of `getOrInsertFunction` actually were expecting/requiring that a brand new function was being created. In such cases, I've switched them to Function::Create instead. Differential Revision: https://reviews.llvm.org/D57315 llvm-svn: 352791
* Update the file headers across all of the LLVM projects in the monorepoChandler Carruth2019-01-191-4/+3
| | | | | | | | | | | | | | | | | to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636
* [LoopIdiomRecognize] Add CTTZ supportCraig Topper2018-12-261-64/+85
| | | | | | | | | | | | | | | | | | | | | Summary: Existing LIR recognizes CTLZ where shifting input variable right until it is zero. (Shift-Until-Zero idiom) This commit: 1. Augments Shift-Until-Zero idiom to recognize CTTZ where input variable is shifted left. 2. Prepare for BitScan idiom recognition. Patch by Yuanfang Chen (tabloid.adroit) Reviewers: craig.topper, evstupac Reviewed By: craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D55876 llvm-svn: 350074
* [LoopIdioms] More LocationSize::precise annotations; NFCGeorge Burgess IV2018-12-241-2/+3
| | | | | | | | | Both of these places reference memset-like loops. Memset is precise. Trying to keep these patches super small so they're easily post-commit verifiable, as requested in D44748. llvm-svn: 350044
* [InstCombine] Cleanup libfunc attribute inferringDavid Bolvansky2018-10-161-2/+3
| | | | | | | | | | | | Reviewers: efriedma Reviewed By: efriedma Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D53338 llvm-svn: 344645
* [NFC] Make LoopSafetyInfo abstract to allow alternative implementationsMax Kazantsev2018-10-161-1/+1
| | | | llvm-svn: 344592
* [InstCombine] Fixed crash with aliased functionsDavid Bolvansky2018-10-131-1/+1
| | | | | | | | | | | | | | Summary: Fixes PR39177 Reviewers: spatel, jbuening Reviewed By: jbuening Subscribers: jbuening, llvm-commits Differential Revision: https://reviews.llvm.org/D53129 llvm-svn: 344454
* Merge clang's isRepeatedBytePattern with LLVM's isBytewiseValueJF Bastien2018-09-211-0/+7
| | | | | | | | | | | | | Summary: his code was in CGDecl.cpp and really belongs in LLVM's isBytewiseValue. Teach isBytewiseValue the tricks clang's isRepeatedBytePattern had, including merging undef properly, and recursing on more types. clang part of this patch: D51752 Subscribers: dexonsmith, llvm-commits Differential Revision: https://reviews.llvm.org/D51751 llvm-svn: 342709
* NFC: remove magic bool in LoopIdiomRecognizeJF Bastien2018-09-071-9/+9
| | | | | | Use an enum class instead. llvm-svn: 341684
* [NFC] Refactoring of LoopSafetyInfo, step 1Max Kazantsev2018-08-151-2/+2
| | | | | | | | | Turn structure into class, encapsulate methods, add clarifying comments. Differential Revision: https://reviews.llvm.org/D50693 Reviewed By: reames llvm-svn: 339752
* Remove trailing spaceFangrui Song2018-07-301-1/+1
| | | | | | sed -Ei 's/[[:space:]]+$//' include/**/*.{def,h,td} lib/**/*.{cpp,h} llvm-svn: 338293
* [LoopIdiomRecognize] Don't convert a do while loop to ctlz.Craig Topper2018-07-111-10/+15
| | | | | | | | | | | | | | | | | | | This commit suppresses turning loops like this into "(bitwidth - ctlz(input))". unsigned foo(unsigned input) { unsigned num = 0; do { ++num; input >>= 1; } while (input != 0); return num; } The loop version returns a value of 1 for both an input of 0 and an input of 1. Converting to a naive ctlz does not preserve that. Theoretically we could do better if we checked isKnownNonZero or we could insert a select to handle the divergence. But until we have motivating cases for that, this is the easiest solution. llvm-svn: 336864
* [LoopIdiomRecognize] Support for converting loops that use LSHR to CTLZ.Craig Topper2018-07-081-13/+21
| | | | | | | | | | | | | | | | | | | | | In the 'detectCTLZIdiom' function support for loops that use LSHR instruction instead of ASHR has been added. This supports creating ctlz from the following code. int lzcnt(int x) { int count = 0; while (x > 0) { count++; x = x >> 1; } return count; } Patch by Olga Moldovanova Differential Revision: https://reviews.llvm.org/D48354 llvm-svn: 336509
* [LoopIdiomRecognize] Fix a couple places where it appears we were ↵Craig Topper2018-06-251-6/+5
| | | | | | unintenionally making copies of DebugLoc. llvm-svn: 335521
* [NFC] fix trivial typos in commentsHiroshi Inoue2018-06-141-2/+2
| | | | llvm-svn: 334687
* Move Analysis/Utils/Local.h back to TransformsDavid Blaikie2018-06-041-1/+1
| | | | | | | | | | Review feedback from r328165. Split out just the one function from the file that's used by Analysis. (As chandlerc pointed out, the original change only moved the header and not the implementation anyway - which was fine for the one function that was used (since it's a template/inlined in the header) but not in general) llvm-svn: 333954
* [LoopIdiomRecognize] Only convert loops to ctlz if we can prove that the ↵Craig Topper2018-05-311-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | input is non-negative. Summary: Loop idiom recognize tries to convert loops like ``` int foo(int x) { int cnt = 0; while (x) { x >>= 1; ++cnt; } return cnt; } ``` into calls to ctlz, but if x is initially negative this loop should be infinite. It happens that the cases that motivated this change have an absolute value of x before the loop. So this patch restricts the transform to cases where we know x is positive. Note: We are relying on the absolute value of INT_MIN to be undefined so we can assume that the result is always positive. Fixes PR37479 Reviewers: spatel, hfinkel, efriedma, javed.absar Reviewed By: efriedma Subscribers: dmgreen, llvm-commits Differential Revision: https://reviews.llvm.org/D47348 llvm-svn: 333702
* Rename DEBUG macro to LLVM_DEBUG.Nicola Zaghen2018-05-141-11/+13
| | | | | | | | | | | | | | | | The DEBUG() macro is very generic so it might clash with other projects. The renaming was done as follows: - git grep -l 'DEBUG' | xargs sed -i 's/\bDEBUG\s\?(/LLVM_DEBUG(/g' - git diff -U0 master | ../clang/tools/clang-format/clang-format-diff.py -i -p1 -style LLVM - Manual change to APInt - Manually chage DOCS as regex doesn't match it. In the transition period the DEBUG() macro is still present and aliased to the LLVM_DEBUG() one. Differential Revision: https://reviews.llvm.org/D43624 llvm-svn: 332240
* [LoopIdiomRecognize] Don't create an IRBuilder just to call getTrue/getFalse.Craig Topper2018-05-041-2/+2
| | | | | | We can call the methods in ConstantInt directly. We just need a context. llvm-svn: 331542
* [LoopIdiomRecognize] Replace more unchecked dyn_casts with cast.Craig Topper2018-05-041-4/+4
| | | | | | Two of these are immediately dereferenced on the next line. The other two are passed immediately to the IRBuilder constructor which can't handle a nullptr. llvm-svn: 331500
* [LoopIdiomRecognize] Use a regular array instead of a SmallVector and ↵Craig Topper2018-05-041-2/+1
| | | | | | explicit ArrayRef. llvm-svn: 331499
OpenPOWER on IntegriCloud