bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[MemorySSA] Rename all phi entries.	Alina Sbirlea	2019-08-30	1	-3/+8
\| \| \| \| \| \| \|	When renaming Phis incoming values, there may be multiple edges incoming from the same block (switch). Rename all. llvm-svn: 370548
*	[GVN] Verify value equality before doing phi translation for call instruction	Wei Mi	2019-08-30	1	-1/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an updated version of https://reviews.llvm.org/D66909 to fix PR42605. Basically, current phi translatation translates an old value number to an new value number for a call instruction based on the literal equality of call expression, without verifying there is no clobber in between. This is incorrect. To get a finegrain check, use MachineDependence analysis to do the job. However, this is still not ideal. Although given a call instruction, `MemoryDependenceResults::getCallDependencyFrom` returns identical call instructions without clobber in between using MemDepResult with its DepType to be `Def`. However, identical is too strict here and we want it to be relaxed a little to consider phi-translation -- callee is the same, param operands can be different. That means changing the semantic of `MemDepResult::Def` and I don't know the potential impact. So currently the patch is still conservative to only handle MemDepResult::NonFuncLocal, which means the current call has no function local clobber. If there is clobber, even if the clobber doesn't stand in between the current call and the call with the new value, we won't do phi-translate. Differential Revision: https://reviews.llvm.org/D67013 llvm-svn: 370547
*	Fix SEH_NoReturn machine verifier error	Reid Kleckner	2019-08-30	1	-1/+1
\| \| \| \|	llvm-svn: 370543
*	[MC] Avoid crashes from improperly nested or wrong target .seh_handlerdata ↵	Reid Kleckner	2019-08-30	3	-3/+10
\| \| \| \| \| \|	directives llvm-svn: 370540
*	[X86] Print register names in .seh_* directives	Reid Kleckner	2019-08-30	5	-189/+210
\| \| \| \| \| \| \| \| \| \|	Also improve assembler parser register validation for .seh_ directives. This requires moving X86-specific seh directive handling into the x86 backend, which addresses some assembler FIXMEs. Differential Revision: https://reviews.llvm.org/D66625 llvm-svn: 370533
*	[Windows] Disable TrapUnreachable for Win64, add SEH_NoReturn	Reid Kleckner	2019-08-30	7	-10/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Users have complained llvm.trap produce two ud2 instructions on Win64, one for the trap, and one for unreachable. This change fixes that. TrapUnreachable was added and enabled for Win64 in r206684 (April 2014) to avoid poorly understood issues with the Windows unwinder. There seem to be two major things in play: - the unwinder - C++ EH, _CxxFrameHandler3 & co The unwinder disassembles forward from the return address to scan for epilogues. Inserting a ud2 had the effect of stopping the unwinder, and ensuring that it ran the EH personality function for the current frame. However, it's not clear what the unwinder does when the return address happens to be the last address of one function and the first address of the next function. The Visual C++ EH personality, _CxxFrameHandler3, needs to figure out what the current EH state number is. It does this by consulting the ip2state table, which maps from PC to state number. This seems to go wrong when the return address is the last PC of the function or catch funclet. I'm not sure precisely which system is involved here, but in order to address these real or hypothetical problems, I believe it is enough to insert int3 after a call site if it would otherwise be the last instruction in a function or funclet. I was able to reproduce some similar problems locally by arranging for a noreturn call to appear at the end of a catch block immediately before an unrelated function, and I confirmed that the problems go away when an extra trailing int3 instruction is added. MSVC inserts int3 after every noreturn function call, but I believe it's only necessary to do it if the call would be the last instruction. This change inserts a pseudo instruction that expands to int3 if it is in the last basic block of a function or funclet. I did what I could to run the Microsoft compiler EH tests, and the ones I was able to run showed no behavior difference before or after this change. Differential Revision: https://reviews.llvm.org/D66980 llvm-svn: 370525
*	[MachinePipeliner] Separate schedule emission, NFC	James Molloy	2019-08-30	3	-1125/+1236
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the first stage in refactoring the pipeliner and making it more accessible for backends to override and control. This separates the logic and state required to emit a scheudule from the logic that computes and validates a schedule. This will enable (a) new schedule emitters and (b) new modulo scheduling implementations to coexist. NFC. Differential Revision: https://reviews.llvm.org/D67006 llvm-svn: 370500
*	[DAGCombine] ReduceLoadWidth - remove duplicate SDLoc. NFCI.	Simon Pilgrim	2019-08-30	1	-3/+2
\| \| \| \| \| \|	SDLoc(N0) and SDLoc(cast<LoadSDNode>(N0)) should be equivalent. llvm-svn: 370498
*	[TargetLowering] SimplifyDemandedBits ADD/SUB/MUL - correctly inherit ↵	Simon Pilgrim	2019-08-30	1	-4/+2
\| \| \| \| \| \| \| \| \| \|	SDNodeFlags from the original node. Just disable NSW/NUW flags. This matches what we're already doing for the other situations for these nodes, it was just missed for the demanded constant case. Noticed by inspection - confirmed in offline discussion with @spatel. I've checked we have test coverage in the x86 extract-bits.ll and extract-lowbits.ll tests llvm-svn: 370497
*	GlobalISel: Fix missing pass dependency	Matt Arsenault	2019-08-30	1	-0/+1
\| \| \| \|	llvm-svn: 370496
*	[X86] Pass v32i16/v64i8 in zmm registers on KNL target.	Craig Topper	2019-08-30	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	gcc and icc pass these types in zmm registers in zmm registers. This patch implements a quick hack to override the register type before calling convention handling to one that is legal. Longer term we might want to do something similar to 256-bit integer registers on AVX1 where we just split all the operations. Fixes PR42957 Differential Revision: https://reviews.llvm.org/D66708 llvm-svn: 370495
*	[ValueTypes] Add v16f16 and v32f16 to EVT::getEVTString and Tablegen's ↵	Craig Topper	2019-08-30	1	-0/+2
\| \| \| \| \| \| \| \|	getEnumName Missed these when I hadded the enum entries llvm-svn: 370494
*	MemTag: unchecked load/store optimization.	Evgeniy Stepanov	2019-08-30	6	-1/+247
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: MTE allows memory access to bypass tag check iff the address argument is [SP, #imm]. This change takes advantage of this to demote uses of tagged addresses to regular FrameIndex operands, reducing register pressure in large functions. MO_TAGGED target flag is used to signal that the FrameIndex operand refers to memory that might be tagged, and needs to be handled with care. Such operand must be lowered to [SP, #imm] directly, without a scratch register. The transformation pass attempts to predict when the offset will be out of range and disable the optimization. AArch64RegisterInfo::eliminateFrameIndex has an escape hatch in case this prediction has been wrong, but it is quite inefficient and should be avoided. Reviewers: pcc, vitalybuka, ostannard Subscribers: mgorny, javed.absar, kristof.beyls, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66457 llvm-svn: 370490
*	[DAGCombine] visitVSELECT - remove equivalent getValueType() call. NFCI.	Simon Pilgrim	2019-08-30	1	-1/+0
\| \| \| \|	llvm-svn: 370489
*	[Attributor] Fix: do not pretend to preserve the CFG	Johannes Doerfert	2019-08-30	1	-1/+0
\| \| \| \|	llvm-svn: 370485
*	[X86] Merge X86InstrInfo::loadRegFromAddr/storeRegToAddr into their only ↵	Craig Topper	2019-08-30	2	-50/+21
\| \| \| \| \| \| \| \| \| \| \|	call site. I'm looking at unfolding broadcast loads on AVX512 which will require refactoring this code to select broadcast opcodes instead of regular load/stores in some cases. Merging them to avoid further complicating their interfaces. llvm-svn: 370484
*	[Attributor] Use existing function information for the call site	Johannes Doerfert	2019-08-30	1	-16/+259
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Instead of recomputing information for call sites we now use the function information directly. This is always valid and once we have call site specific information we can improve here. This patch also bootstraps attributes that are created on-demand through an initial update call. Information that is known will then directly be available in the new attribute without causing an iteration delay. The tests show how this improves the iteration count. Reviewers: sstefan1, uenoku Subscribers: hiraditya, bollu, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66781 llvm-svn: 370480
*	[Attributor] Manifest load/store alignment generally	Johannes Doerfert	2019-08-30	1	-25/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Any pointer could have load/store users not only floating ones so we move the manifest logic for alignment into the AAAlignImpl class. Reviewers: uenoku, sstefan1 Subscribers: hiraditya, bollu, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66922 llvm-svn: 370479
*	[DAGCombine] visitVSELECT - remove duplicate getOperand calls. NFCI.	Simon Pilgrim	2019-08-30	1	-4/+3
\| \| \| \|	llvm-svn: 370478
*	[InstCombine][AMDGPU] Simplify tbuffer loads	Piotr Sobczak	2019-08-30	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Add missing tbuffer loads intrinsics in SimplifyDemandedVectorElts. Reviewers: arsenm, nhaehnle Reviewed By: arsenm Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66926 llvm-svn: 370475
*	[yaml2obj][obj2yaml] - Use a single "Other" field instead of "Other", ↵	George Rimar	2019-08-30	3	-67/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"Visibility" and "StOther". Currenly we can encode the 'st_other' field of symbol using 3 fields. 'Visibility' is used to encode STV_* values. 'Other' is used to encode everything except the visibility, but it can't handle arbitrary values. 'StOther' is used to encode arbitrary values when 'Visibility'/'Other' are not helpfull enough. 'st_other' field is used to encode symbol visibility and platform-dependent flags and values. Problem to encode it is that it consists of Visibility part (STV_* values) which are enumeration values and the Other part, which is different and inconsistent. For MIPS the Other part contains flags for all STO_MIPS_* values except STO_MIPS_MIPS16. (Like comment in ELFDumper says: "Someones in their infinite wisdom decided to make STO_MIPS_MIPS16 flag overlapped with other ST_MIPS_xxx flags."...) And for PPC64 the Other part might actually encode any value. This patch implements custom logic for handling the st_other and removes 'Visibility' and 'StOther' fields. Here is an example of a new YAML style this patch allows: - Name: foo Other: [ 0x4 ] - Name: bar Other: [ STV_PROTECTED, 4 ] - Name: zed Other: [ STV_PROTECTED, STO_MIPS_OPTIONAL, 0xf8 ] Differential revision: https://reviews.llvm.org/D66886 llvm-svn: 370472
*	[DAGCombine] visitVSELECT - use getShiftAmountTy for shift amounts.	Simon Pilgrim	2019-08-30	1	-3/+3
\| \| \| \|	llvm-svn: 370471
*	[DAGCombine] visitMULHS - use getScalarValueSizeInBits() to make safe for ↵	Simon Pilgrim	2019-08-30	1	-1/+1
\| \| \| \| \| \| \| \|	vector types. This is hidden behind a (scalar-only) isOneConstant(N1) check at the moment, but once we get around to adding vector support we need to ensure we're dealing with the scalar bitwidth, not the total. llvm-svn: 370468
*	Remove an extra ";", NFC.	Haojian Wu	2019-08-30	1	-1/+1
\| \| \| \|	llvm-svn: 370465
*	[CodeGen] Introduce MachineBasicBlock::replacePhiUsesWith helper and use it. NFC	Bjorn Pettersson	2019-08-30	2	-24/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Found a couple of places in the code where all the PHI nodes of a MBB is updated, replacing references to one MBB by reference to another MBB instead. This patch simply refactors the code to use a common helper (MachineBasicBlock::replacePhiUsesWith) for such PHI node updates. Reviewers: t.p.northover, arsenm, uabelho Subscribers: wdng, hiraditya, jsji, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66750 llvm-svn: 370463
*	[DAGCombine] visitMULHS/visitMULHU - isBuildVectorAllZeros doesn't mean node ↵	Simon Pilgrim	2019-08-30	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	is all zeros Return a proper zero vector, just in case some elements are undef. Noticed by inspection after dealing with a similar issue in PR43159. llvm-svn: 370460
*	[Attributor] Implement AANoAliasCallSiteArgument initialization	Hideto Ueno	2019-08-30	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This patch adds an appropriate `initialize` method for `AANoAliasCallSiteArgument`. Reviewers: jdoerfert, sstefan1 Reviewed By: jdoerfert Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66927 llvm-svn: 370456
*	[LoopIdiomRecognize] BCmp loop idiom recognition	Roman Lebedev	2019-08-30	1	-8/+869
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: @mclow.lists brought up this issue up in IRC. It is a reasonably common problem to compare some two values for equality. Those may be just some integers, strings or arrays of integers. In C, there is `memcmp()`, `bcmp()` functions. In C++, there exists `std::equal()` algorithm. One can also write that function manually. libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ libc++ does not do anything like that, it simply relies on simple C++'s `operator==()`. https://godbolt.org/z/er0Zwf (GOOD!) So likely, there exists a certain performance opportunities. Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that is using `memcmp()` (in this case, compiled with modified compiler). {F8768213} ``` #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> #include "benchmark/benchmark.h" template <class T> bool equal(T* a, T* a_end, T* b) noexcept { for (; a != a_end; ++a, ++b) { if (a != b) return false; } return true; } template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct Identical { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto Tmp = getVectorOfRandomNumbers<T>(count); return std::make_pair(Tmp, std::move(Tmp)); } }; struct InequalHalfway { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { auto V0 = getVectorOfRandomNumbers<T>(count); auto V1 = V0; V1[V1.size() / size_t(2)]++; // just change the value. return std::make_pair(std::move(V0), std::move(V1)); } }; template <class T, class Gen> void BM_bcmp(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { const bool is_equal = equal(a.data(), a.data() + a.size(), b.data()); benchmark::DoNotOptimize(is_equal); } state.SetComplexityN(Length); state.counters["eltcnt"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["eltcnt/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = []() { for (const benchmark::CPUInfo::CacheInfo& I : benchmark::CPUInfo::Get().caches) { if (I.level == 2) return I.size; } return 0; }(); // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /total bufs/ 2 / /maximal elt size/ sizeof(T) / /safety margin/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway) ->Apply(CustomArguments<uint8_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway) ->Apply(CustomArguments<uint64_t>); ``` {F8768210} ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx 2019-04-25 21:17:11 Running build-old/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 0.65, 3.90, 4.14 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N BM_bcmp<uint8_t, Identical>_RMS 8 % 8 % <...> BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N BM_bcmp<uint16_t, Identical>_RMS 25 % 25 % <...> BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N BM_bcmp<uint32_t, Identical>_RMS 42 % 42 % <...> BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N BM_bcmp<uint64_t, Identical>_RMS 27 % 27 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 % RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0 2019-04-25 21:19:29 Running build-new/test/llvm-bcmp-bench Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.01, 2.85, 3.71 --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N BM_bcmp<uint8_t, Identical>_RMS 37 % 37 % <...> BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N BM_bcmp<uint16_t, Identical>_RMS 34 % 34 % <...> BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N BM_bcmp<uint32_t, Identical>_RMS 35 % 35 % <...> BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N BM_bcmp<uint64_t, Identical>_RMS 33 % 33 % <...> BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 % <...> BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 % <...> BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 % Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench Benchmark Time CPU Time Old Time New CPU Old CPU New --------------------------------------------------------------------------------------------------------------------------------------- <...> BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590 <...> BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948 <...> BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627 <...> BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855 <...> BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569 <...> BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547 <...> BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394 <...> BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498 ``` What can we tell from the benchmark? * Performance of naive equality check somewhat improves with element size, maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s for uint64_t. I think, that instability implies performance problems. * Performance of `memcmp()`-aware benchmark always maxes out at around bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the naive variant! * eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and linearly decreases with element size. For uint64_t, it's ~4x+ the elements/second. * The call obvious is more pricey than the loop, with small element count. As it can be seen from the full output {F8768210}, the `memcmp()` is almost universally worse, independent of the element size (and thus buffer size) when element count is less than 8. So all in all, bcmp idiom does indeed pose untapped performance headroom. This diff does implement said idiom recognition. I think a reasonable test coverage is present, but do tell if there is anything obvious missing. Now, quality. This does succeed to build and pass the test-suite, at least without any non-bundled elements. {F8768216} {F8768217} This transform fires 91 times: ``` $ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json Tests: 1149 Metric: loop-idiom.NumBCmp Program result-new MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00 MultiSource/Applications/d/make_dparser 3.00 SingleSource/UnitTests/vla 2.00 MultiSource/Applications/Burg/burg 1.00 MultiSourc.../Applications/JM/lencod/lencod 1.00 MultiSource/Applications/lemon/lemon 1.00 MultiSource/Benchmarks/Bullet/bullet 1.00 MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00 MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00 MultiSourc...Prolangs-C/simulator/simulator 1.00 ``` The size changes are: I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look. ``` $ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash Tests: 1149 Same hash: 907 (filtered out) Remaining: 242 Metric: size..text Program result-old result-new diff test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6% test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5% test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1% test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1% test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1% test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1% test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0% test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0% test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0% test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0% test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan% test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan% test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan% test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan% Geomean difference nan% result-old result-new diff count 1.000000e+01 10.00000 10.000000 mean 3.152090e+05 311695.40000 0.006749 std 3.790398e+05 372091.42232 0.036605 min 7.530000e+02 833.00000 -0.034981 25% 4.243300e+04 42401.00000 -0.000866 50% 1.197370e+05 119689.00000 -0.000392 75% 6.397050e+05 639705.00000 -0.000005 max 1.001697e+06 966657.00000 0.106242 ``` I don't have timings though. And now to the code. The basic idea is to completely replace the whole loop. If we can't fully kill it, don't transform. I have left one or two comments in the code, so hopefully it can be understood. Also, there is a few TODO's that i have left for follow-ups: * widening of `memcmp()`/`bcmp()` * step smaller than the comparison size * Metadata propagation * more than two blocks as long as there is still a single backedge? * ??? Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet Reviewed By: courbet Subscribers: hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists Tags: #llvm Differential Revision: https://reviews.llvm.org/D61144 llvm-svn: 370454
*	[LiveDebugValues] Insert entry values after bundles	David Stenberg	2019-08-30	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Change LiveDebugValues so that it inserts entry values after the bundle which contains the clobbering instruction. Previously it would insert the debug value after the bundle head using insertAfter(), breaking the bundle. Reviewers: djtodoro, NikolaPrica, aprantl, vsk Reviewed By: vsk Subscribers: hiraditya, llvm-commits Tags: #debug-info, #llvm Differential Revision: https://reviews.llvm.org/D66888 llvm-svn: 370448
*	[LLD] [COFF] Support merging resource object files	Martin Storsjo	2019-08-30	1	-0/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend WindowsResourceParser to support using a ResourceSectionRef for loading resources from an object file. Only allow merging resource object files in mingw mode; keep the existing error on multiple resource objects in link mode. If there only is one resource object file and no .res resources, don't parse and recreate the .rsrc section, but just link it in without inspecting it. This allows users to produce any .rsrc section (outside of what the parser supports), just like before. (I don't have a specific need for this, but it reduces the risk of this new feature.) Separate out the .rsrc section chunks in InputFiles.cpp, and only include them in the list of section chunks to link if we've determined that there only was one single resource object. (We need to keep other chunks from those object files, as they can legitimately contain other sections as well, in addition to .rsrc section chunks.) Differential Revision: https://reviews.llvm.org/D66824 llvm-svn: 370436
*	[WindowsResource] Remove use of global variables in WindowsResourceParser	Martin Storsjo	2019-08-30	1	-58/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of updating a global variable counter for the next index of strings and data blobs, pass along a reference to actual data/string vectors and let the TreeNode insertion methods add their data/strings to the vectors when a new entry is needed. Additionally, if the resource tree had duplicates, that were ignored with -force:multipleres in lld, we no longer store all versions of the duplicated resource data, now we only keep the one that actually ends up referenced. Differential Revision: https://reviews.llvm.org/D66823 llvm-svn: 370435
*	[WindowsResource] Avoid duplicating the input filenames for each resource. NFC.	Martin Storsjo	2019-08-30	1	-4/+5
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D66821 llvm-svn: 370434
*	[COFF] Add a ResourceSectionRef method for getting resource contents	Martin Storsjo	2019-08-30	1	-0/+117
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows llvm-readobj to print the contents of each resource when printing resources from an object file or executable, like it already does for plain .res files. This requires providing the whole COFFObjectFile to ResourceSectionRef. This supports both object files and executables. For executables, the DataRVA field is used as is to look up the right section. For object files, ideally we would need to complete linking of them and fix up all relocations to know what the DataRVA field would end up being. In practice, the only thing that makes sense for an RVA field is an ADDR32NB relocation. Thus, find a relocation pointing at this field, verify that it has the expected type, locate the symbol it points at, look up the section the symbol points at, and read from the right offset in that section. This works both for GNU windres object files (which use one single .rsrc section, with all relocations against the base of the .rsrc section, with the original value of the DataRVA field being the offset of the data from the beginning of the .rsrc section) and cvtres object files (with two separate .rsrc$01 and .rsrc$02 sections, and one symbol per data entry, with the original pre-relocated DataRVA field being set to zero). Differential Revision: https://reviews.llvm.org/D66820 llvm-svn: 370433
*	[MIPS GlobalISel] Lower uitofp	Petar Avramovic	2019-08-30	1	-1/+48
\| \| \| \| \| \| \| \|	Add custom lowering for G_UITOFP for MIPS32. Differential Revision: https://reviews.llvm.org/D66930 llvm-svn: 370432
*	[MIPS GlobalISel] Lower fptoui	Petar Avramovic	2019-08-30	2	-0/+45
\| \| \| \| \| \| \| \| \| \|	Add lower for G_FPTOUI. Algorithm is similar to the SDAG version in TargetLowering::expandFP_TO_UINT. Lower G_FPTOUI for MIPS32. Differential Revision: https://reviews.llvm.org/D66929 llvm-svn: 370431
*	[CodeGen] Fix lowering for returning the result of an extractvalue	Dan Gohman	2019-08-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the number of return values exceeds the number of registers available, SelectionDAGBuilder::visitRet transforms a function's return to use a pointer to a buffer to hold return values. When the returned value is an operator such as extractvalue, the value may have a non-zero result number. Add that number to the indexing when obtaining the values to store. This fixes https://bugs.llvm.org/show_bug.cgi?id=43132. Differential Revision: https://reviews.llvm.org/D66978 llvm-svn: 370430
*	[PowerPC][NFC] Use inline Subtarget->isPPC64()	Jinsong Ji	2019-08-30	1	-7/+6
\| \| \| \| \| \|	To be consistent with all the other instances. llvm-svn: 370428
*	[PPC32] Emit R_PPC_GOT_TPREL16 instead R_PPC_GOT_TPREL16_LO	Fangrui Song	2019-08-30	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unlike ppc64, which has ADDISgotTprelHA+LDgotTprelL pairs, ppc32 just uses LDgotTprelL32, so it does not make lots of sense to use _LO without a paired _HA. Emit R_PPC_GOT_TPREL16 instead R_PPC_GOT_TPREL16_LO to match GCC, and get better linker relocation check. Note, R_PPC_GOT_TPREL16_{HA,LO} don't have good linker support: (a) lld does not support R_PPC_GOT_TPREL16_{HA,LO}. (b) Top of tree ld.bfd does not support R_PPC_GOT_REL16_HA Initial-Exec -> Local-Exec relaxation: // a.o addis 3, 3, tsd_tls@got@tprel@ha lwz 3, tsd_tls@got@tprel@l(3) add 3, 3, tsd_tls@tls // b.o .section .tdata,"awT"; .globl tsd_tls; tsd_tls: // ld/ld-new a.o b.o internal error, aborting at ../../bfd/elf32-ppc.c:7952 in ppc_elf_relocate_section Reviewed By: adalava Differential Revision: https://reviews.llvm.org/D66925 llvm-svn: 370426
*	[X86] Explicitly list all the always trivially rematerializable instructions.	Craig Topper	2019-08-30	1	-5/+40
\| \| \| \| \| \| \| \| \|	Add a default with an llvm_unreachable for anything we don't expect. This seems safer that just blindly returning true for anything missing from the switch. llvm-svn: 370424
*	[WebAssembly] Make __attribute__((used)) not imply export.	Dan Gohman	2019-08-29	7	-15/+25
\| \| \| \| \| \| \| \| \| \|	Add an WASM_SYMBOL_NO_STRIP flag, so that __attribute__((used)) doesn't need to imply exporting. When targeting Emscripten, have WASM_SYMBOL_NO_STRIP imply exporting. Differential Revision: https://reviews.llvm.org/D62542 llvm-svn: 370415
*	[PowerPC] Support extended mnemonics mffprwz etc.	Jinsong Ji	2019-08-29	2	-2/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Reported in https://github.com/opencv/opencv/issues/15413. We have serveral extended mnemonics for Move To/From Vector-Scalar Register Instructions eg: mffprd,mtfprd etc. We only support one of them, this patch add the others. Reviewers: nemanjai, steven.zhang, hfinkel, #powerpc Reviewed By: hfinkel Subscribers: wuzish, qcolombet, hiraditya, kbarton, MaskRay, shchenz, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66963 llvm-svn: 370411
*	[AArch64][GlobalISel] Select arithmetic extended register patterns	Jessica Paquette	2019-08-29	3	-36/+224
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This teaches GISel to select patterns which fold an extend plus optional shift into the addressing mode. In particular, adds and subs. Factor out the arith extended register ComplexPatterns in AArch64InstrFormats.td and create GISel equivalents. Add some equivalent functions to the ones in AArch64ISelDAGToDAG: - `selectArithExtendedRegister` - `narrowExtendRegIfNeeded` - `getExtendTypeForInst` `getExtendTypeForInst` includes the checks for loads and stores. This will be used for WRO addressing modes in loads + stores. Teach selectCopy to properly handle subregister copies on the same bank in order to support `narrowExtendRegIfNeeded`. The extended register must be a GPR32, so we need to support same-bank subregister copies. Fix a bug in getSubRegForClass which would cause registers on things like GPR32common to end up getting ssub. Just change the check to look for FPR32 rather than GPR32. For tests: - Add select-arith-extended-reg.mir - Update addsub_ext.ll to include GlobalISel checks Differential Revision: https://reviews.llvm.org/D66835 llvm-svn: 370410
*	[X86] Don't emit unreachable stack adjustments	Reid Kleckner	2019-08-29	1	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This is a minor improvement on our past attempts to do this. Fixes PR43155. Reviewers: hans Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66905 llvm-svn: 370409
*	Allow '@' to appear in x86 mingw symbols	Reid Kleckner	2019-08-29	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: There is no reason to differ in assembler behavior here between -msvc and -gnu targets. Without this setting, the text after the '@' is interpreted as a symbol variable, like foo@IMGREL. Reviewers: mstorsjo Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D66974 llvm-svn: 370408
*	Fix the build for MSVC builds using M_PI	Reid Kleckner	2019-08-29	1	-0/+7
\| \| \| \|	llvm-svn: 370405
*	[X86][SSE] combinePMULDQ - pmuldq(x, 0) -> zero vector (PR43159)	Simon Pilgrim	2019-08-29	1	-3/+5
\| \| \| \| \| \|	ISD::isBuildVectorAllZeros permits undef elements to be present, which means we can't return it as a zero vector. PMULDQ/PMULUDQ is an extending multiply so a multiply by zero of the lower 32-bits should result in a zero 64-bit element. llvm-svn: 370404
*	AMDGPU/GlobalISel: Legalize sin/cos	Matt Arsenault	2019-08-29	2	-0/+43
\| \| \| \|	llvm-svn: 370402
*	[InstCombine] reduce duplicated code; NFC	Sanjay Patel	2019-08-29	1	-10/+13
\| \| \| \|	llvm-svn: 370399
*	Revert [MBP] Disable aggressive loop rotate in plain mode	Jordan Rupprecht	2019-08-29	1	-80/+36
\| \| \| \| \| \| \| \|	This reverts r369664 (git commit 51f48295cbe8fa3a44db263b528dd9f7bae7bf9a) It causes many benchmark regressions, internally and in llvm's benchmark suite. llvm-svn: 370398
*	Revert enabling MemorySSA.	Alina Sbirlea	2019-08-29	2	-5/+1
\| \| \| \| \| \| \| \|	Breaks sanitizers bots. Differential Revision: https://reviews.llvm.org/D58311 llvm-svn: 370397