bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	[AMDGPU] gfx1010 GCNRegBankReassign pass	Stanislav Mekhanoshin	2019-05-01	1	-0/+2
\| \| \| \| \| \| \| \|	Reassign registers to reduce register bank conflicts. Differential Revision: https://reviews.llvm.org/D61344 llvm-svn: 359704
*	[AMDGPU] gfx1010 GCNNSAReassign pass	Stanislav Mekhanoshin	2019-05-01	1	-0/+15
\| \| \| \| \| \| \| \|	Convert NSA into non-NSA images. Differential Revision: https://reviews.llvm.org/D61341 llvm-svn: 359700
*	[GlobalISel] Introduce a CSEConfigBase class to allow targets to define ↵	Amara Emerson	2019-04-15	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	their own CSE configs. Because CodeGen can't depend on GlobalISel, we need a way to encapsulate the CSE configs that can be passed between TargetPassConfig and the targets' custom pass configs. This CSEConfigBase allows targets to create custom CSE configs which is then used by the GISel passes for the CSEMIRBuilder. This support will be used in a follow up commit to allow constant-only CSE for -O0 compiles in D60580. llvm-svn: 358368
*	Reapply [ValueTracking] Support min/max selects in computeConstantRange()	Nikita Popov	2019-04-07	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for min/max flavor selects in computeConstantRange(), which allows us to fold comparisons of a min/max against a constant in InstSimplify. This fixes an infinite InstCombine loop, with the test case taken from D59378. Relative to the previous iteration, this contains some adjustments for AMDGPU med3 tests: The AMDGPU target runs InstSimplify prior to codegen, which ends up constant folding some existing med3 tests after this change. To preserve these tests a hidden -amdgpu-scalar-ir-passes option is added, which allows disabling scalar IR passes (that use InstSimplify) for testing purposes. Differential Revision: https://reviews.llvm.org/D59506 llvm-svn: 357870
*	[AMDGPU] Add MachineDCE pass after RenameIndependentSubregs	Stanislav Mekhanoshin	2019-04-05	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Detect dead lanes can create some dead defs. Then RenameIndependentSubregs will break a REG_SEQUENCE which may use these dead defs. At this point a dead instruction can be removed but we do not run a DCE anymore. MachineDCE was only running before live variable analysis. The patch adds a mean to preserve LiveIntervals and SlotIndexes in case it works past this. Differential Revision: https://reviews.llvm.org/D59626 llvm-svn: 357805
*	[AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.	Neil Henning	2019-04-01	1	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change incorporates an effort by Connor Abbot to change how we deal with WWM operations potentially trashing valid values in inactive lanes. Previously, the SIFixWWMLiveness pass would work out which registers were being trashed within WWM regions, and ensure that the register allocator did not have any values it was depending on resident in those registers if the WWM section would trash them. This worked perfectly well, but would cause sometimes severe register pressure when the WWM section resided before divergent control flow (or at least that is where I mostly observed it). This fix instead runs through the WWM sections and pre allocates some registers for WWM. It then reserves these registers so that the register allocator cannot use them. This results in a significant register saving on some WWM shaders I'm working with (130 -> 104 VGPRs, with just this change!). Differential Revision: https://reviews.llvm.org/D59295 llvm-svn: 357400
*	AMDGPU: Add additional MIR tests for exec mask optimizations	Matt Arsenault	2019-03-27	1	-3/+11
\| \| \| \| \| \| \| \| \| \|	Also includes one example of how this transform is unsound. This isn't verifying the copies are used in the control flow intrinisic patterns. Also add option to disable exec mask opt pass. Since this pass is unsound, it may be useful to turn it off until it is fixed. llvm-svn: 357091
*	CodeGen: Refactor regallocator command line and target selection	Matt Arsenault	2019-03-19	1	-6/+6
\| \| \| \| \| \| \| \| \| \|	This will allow targets more flexibility to replace the register allocator core passes. In a future commit, AMDGPU will run the core register assignment passes twice, and will also want to disallow using the standard -regalloc option. llvm-svn: 356506
*	[AMDGPU] Add an experimental buffer fat pointer address space.	Neil Henning	2019-03-18	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Add an experimental buffer fat pointer address space that is currently unhandled in the backend. This commit reserves address space 7 as a non-integral pointer repsenting the 160-bit fat pointer (128-bit buffer descriptor + 32-bit offset) that is heavily used in graphics workloads using the AMDGPU backend. Differential Revision: https://reviews.llvm.org/D58957 llvm-svn: 356373
*	AMDGPU: Partially fix default device for HSA	Matt Arsenault	2019-03-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a few different issues, mostly stemming from using generation based checks for anything instead of subtarget features. Stop adding flat-address-space as a feature for HSA, as it should only be a device property. This was incorrectly allowing flat instructions to select for SI. Increase the default generation for HSA to avoid the encoding error when emitting objects. This has some other side effects from various checks which probably should be separate subtarget features (in the cost model and for dealing with the DS offset folding issue). Partial fix for bug 41070. It should probably be an error to try using amdhsa without flat support. llvm-svn: 356347
*	MIR: Allow targets to serialize MachineFunctionInfo	Matt Arsenault	2019-03-14	1	-0/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This has been a very painful missing feature that has made producing reduced testcases difficult. In particular the various registers determined for stack access during function lowering were necessary to avoid undefined register errors in a large percentage of cases. Implement a subset of the important fields that need to be preserved for AMDGPU. Most of the changes are to support targets parsing register fields and properly reporting errors. The biggest sort-of bug remaining is for fields that can be initialized from the IR section will be overwritten by a default initialized machineFunctionInfo section. Another remaining bug is the machineFunctionInfo section is still printed even if empty. llvm-svn: 356215
*	AMDGPU: Handle "uniform-work-group-size" attribute (fix for RADV)	Aakanksha Patil	2019-03-07	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	A previous patch for "uniform-work-group-size" attribute was found to break some RADV and possibly radeon SI tests and had to be retracted. This patch fixes that. Differential Revision: http://reviews.llvm.org/D58993 llvm-svn: 355574
*	AMDGPU: Fix typo	Matt Arsenault	2019-02-28	1	-1/+1
\| \| \| \|	llvm-svn: 355056
*	AMDGPU: Enable function calls by default	Matt Arsenault	2019-02-28	1	-4/+9
\| \| \| \| \| \| \|	Fixes some crashes on illegal call situations which are unfortunately still valid IR. llvm-svn: 355051
*	AMDGPU: Remove debugger related subtarget features	Matt Arsenault	2019-02-21	1	-2/+0
\| \| \| \| \| \|	As far as I know these aren't needed anymore. llvm-svn: 354634
*	[AMDGPU] Enable DPP combiner pass by default.	Valery Pykhtin	2019-02-11	1	-1/+1
\| \| \| \| \| \|	Related revisions: https://reviews.llvm.org/D55444, https://reviews.llvm.org/D55314 llvm-svn: 353691
*	Update the file headers across all of the LLVM projects in the monorepo	Chandler Carruth	2019-01-19	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636
*	[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try	David Stuttard	2019-01-14	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TFE and LWE support requires extra result registers that are written in the event of a failure in order to detect that failure case. The specific use-case that initiated these changes is sparse texture support. This means that if image intrinsics are used with either option turned on, the programmer must ensure that the return type can contain all of the expected results. This can result in redundant registers since the vector size must be a power-of-2. This change takes roughly 6 parts: 1. Modify the instruction defs in tablegen to add new instruction variants that can accomodate the extra return values. 2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE (where the bulk of the work for these instruction types is now done) 3. Extra verification code to catch cases where intrinsics have been used but insufficient return registers are used. 4. Modification to the adjustWritemask optimisation to account for TFE/LWE being enabled (requires extra registers to be maintained for error return value). 5. An extra pass to zero initialize the error value return - this is because if the error does not occur, the register is not written and thus must be zeroed before use. Also added a new (on by default) option to ensure ALL return values are zero-initialized that is required for sparse texture support. 6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO for this to re-enable and handle correctly). There's an additional fix now to avoid a dmask=0 For an image intrinsic with tfe where all result channels except tfe were unused, I was getting an image instruction with dmask=0 and only a single vgpr result for tfe. That is incorrect because the hardware assumes there is at least one vgpr result, plus the one for tfe. Fixed by forcing dmask to 1, which gives the desired two vgpr result with tfe in the second one. The TFE or LWE result is returned from the intrinsics using an aggregate type. Look in the test code provided to see how this works, but in essence IR code to invoke the intrinsic looks as follows: %v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 1, i32 0) %v.vec = extractvalue {<4 x float>, i32} %v, 0 %v.err = extractvalue {<4 x float>, i32} %v, 1 This re-submit of the change also includes a slight modification in SIISelLowering.cpp to work-around a compiler bug for the powerpc_le platform that caused a buildbot failure on a previous submission. Differential revision: https://reviews.llvm.org/D48826 Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda Work around for ppcle compiler bug Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b llvm-svn: 351054
*	Revert r348971: [AMDGPU] Support for "uniform-work-group-size" attribute	Aakanksha Patil	2018-12-13	1	-3/+1
\| \| \| \| \| \|	This patch breaks RADV (and probably RadeonSI as well) llvm-svn: 349084
*	[AMDGPU] Support for "uniform-work-group-size" attribute	Aakanksha Patil	2018-12-12	1	-1/+3
\| \| \| \| \| \| \| \|	Updated the annotate-kernel-features pass to support the propagation of uniform-work-group attribute from the kernel to the called functions. Once this pass is run, all kernels, even the ones which initially did not have the attribute, will be able to indicate weather or not they have uniform work group size depending on the value of the attribute. Differential Revision: https://reviews.llvm.org/D50200 llvm-svn: 348971
*	[AMDGPU] Add new Mode Register pass	Tim Corringham	2018-12-10	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A new pass to manage the Mode register. Currently this just manages the floating point double precision rounding requirements, but is intended to be easily extended to encompass all Mode register settings. The immediate motivation comes from the requirement to use the round-to-zero rounding mode for the 16 bit interpolation instructions, where the rounding mode setting is shared between 16 and 64 bit operations. llvm-svn: 348754
*	[Targets] Add errors for tiny and kernel codemodel on targets that don't ↵	David Green	2018-12-07	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \|	support them Adds fatal errors for any target that does not support the Tiny or Kernel codemodels by rejigging the getEffectiveCodeModel calls. Differential Revision: https://reviews.llvm.org/D50141 llvm-svn: 348585
*	[AMDGPU] Partial revert of rL348371: Turn on the DPP combiner by default	Valery Pykhtin	2018-12-06	1	-1/+1
\| \| \| \| \| \| \| \|	Turn the combiner back off as there're failures until the issue is fixed. Differential revision: https://reviews.llvm.org/D55314 llvm-svn: 348487
*	[AMDGPU]: Turn on the DPP combiner by default	Valery Pykhtin	2018-12-05	1	-1/+1
\| \| \| \| \| \|	Differential revision: https://reviews.llvm.org/D55314 llvm-svn: 348371
*	[AMDGPU] Combine DPP mov with use instructions (VOP1/2/3)	Valery Pykhtin	2018-11-30	1	-0/+8
\| \| \| \| \| \| \| \|	Introduces DPP pseudo instructions and the pass that combines DPP mov with subsequent uses. Differential revision: https://reviews.llvm.org/D53762 llvm-svn: 347993
*	Revert r347871 "Fix: Add support for TFE/LWE in image intrinsic"	David Stuttard	2018-11-29	1	-1/+0
\| \| \| \| \| \| \| \| \|	Also revert fix r347876 One of the buildbots was reporting a failure in some relevant tests that I can't repro or explain at present, so reverting until I can isolate. llvm-svn: 347911
*	Add support for TFE/LWE in image intrinsics	David Stuttard	2018-11-29	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TFE and LWE support requires extra result registers that are written in the event of a failure in order to detect that failure case. The specific use-case that initiated these changes is sparse texture support. This means that if image intrinsics are used with either option turned on, the programmer must ensure that the return type can contain all of the expected results. This can result in redundant registers since the vector size must be a power-of-2. This change takes roughly 6 parts: 1. Modify the instruction defs in tablegen to add new instruction variants that can accomodate the extra return values. 2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE (where the bulk of the work for these instruction types is now done) 3. Extra verification code to catch cases where intrinsics have been used but insufficient return registers are used. 4. Modification to the adjustWritemask optimisation to account for TFE/LWE being enabled (requires extra registers to be maintained for error return value). 5. An extra pass to zero initialize the error value return - this is because if the error does not occur, the register is not written and thus must be zeroed before use. Also added a new (on by default) option to ensure ALL return values are zero-initialized that is required for sparse texture support. 6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO for this to re-enable and handle correctly). There's an additional fix now to avoid a dmask=0 For an image intrinsic with tfe where all result channels except tfe were unused, I was getting an image instruction with dmask=0 and only a single vgpr result for tfe. That is incorrect because the hardware assumes there is at least one vgpr result, plus the one for tfe. Fixed by forcing dmask to 1, which gives the desired two vgpr result with tfe in the second one. The TFE or LWE result is returned from the intrinsics using an aggregate type. Look in the test code provided to see how this works, but in essence IR code to invoke the intrinsic looks as follows: %v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 1, i32 0) %v.vec = extractvalue {<4 x float>, i32} %v, 0 %v.err = extractvalue {<4 x float>, i32} %v, 1 Differential revision: https://reviews.llvm.org/D48826 Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda llvm-svn: 347871
*	AMDGPU: Don't optimize exec masks at -O0	Matt Arsenault	2018-11-26	1	-1/+2
\| \| \| \|	llvm-svn: 347573
*	[AMDGPU] Add FixupVectorISel pass, currently Supports SREGs in GLOBAL LD/ST	Ron Lieberman	2018-11-16	1	-0/+2
\| \| \| \| \| \| \| \| \|	Add a pass to fixup various vector ISel issues. Currently we handle converting GLOBAL_{LOAD\|STORE}_* and GLOBAL_Atomic_* instructions into their _SADDR variants. This involves feeding the sreg into the saddr field of the new instruction. llvm-svn: 347008
*	Allow subclassing ExternalAA	Matt Arsenault	2018-11-07	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows testing AMDGPU alias analysis like any other alias analysis pass. This fixes the existing test pointlessly running opt -O3 when it really just wants to run the one analysis. Before there was no way to test this using -aa-eval with opt, since the default constructed pass is run. The wrapper subclass allows the default constructor to pass the necessary callback. llvm-svn: 346353
*	AMDGPU: Rewrite SILowerI1Copies to always stay on SALU	Nicolai Haehnle	2018-10-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Instead of writing boolean values temporarily into 32-bit VGPRs if they are involved in PHIs or are observed from outside a loop, we use bitwise masking operations to combine lane masks in a way that is consistent with wave control flow. Move SIFixSGPRCopies to before this pass, since that pass incorrectly attempts to move SGPR phis to VGPRs. This should recover most of the code quality that was lost with the bug fix in "AMDGPU: Remove PHI loop condition optimization". There are still some relevant cases where code quality could be improved, in particular: - We often introduce redundant masks with EXEC. Ideally, we'd have a generic computeKnownBits-like analysis to determine whether masks are already masked by EXEC, so we can avoid this masking both here and when lowering uniform control flow. - The criterion we use to determine whether a def is observed from outside a loop is conservative: it doesn't check whether (loop) branch conditions are uniform. Change-Id: Ibabdb373a7510e426b90deef00f5e16c5d56e64b Reviewers: arsenm, rampitec, tpr Subscribers: kzhuravl, jvesely, wdng, mgorny, yaxunl, dstuttard, t-tye, eraman, llvm-commits Differential Revision: https://reviews.llvm.org/D53496 llvm-svn: 345719
*	[AMDGPU] Add a pass to promote bitcast calls	Scott Linder	2018-10-26	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \|	AMDGPU currently only supports direct calls, but at lower optimisation levels it fails to lower statically direct calls which appear indirect due to a bitcast. Add a pass to visit all CallSites and use CallPromotionUtils to "devirtualize" calls. Differential Revision: https://reviews.llvm.org/D52741 llvm-svn: 345382
*	[AMDGPU] Add an AMDGPU specific atomic optimizer.	Neil Henning	2018-10-08	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds a new IR level pass to the AMDGPU backend to perform atomic optimizations. It works by: - Running through a function and finding atomicrmw add/sub or uses of the atomic buffer intrinsics for add/sub. - If all arguments except the value to be added/subtracted are uniform, record the value to be optimized. - Run through the atomic operations we can optimize and, depending on whether the value is uniform/divergent use wavefront wide operations (DPP in the divergent case) to calculate the total amount to be atomically added/subtracted. - Then let only a single lane of each wavefront perform the atomic operation, reducing the total number of atomic operations in flight. - Lastly we recombine the result from the single lane to each lane of the wavefront, and calculate our individual lanes offset into the final result. Differential Revision: https://reviews.llvm.org/D51969 llvm-svn: 343973
*	AMDGPU: Always run AMDGPUAlwaysInline	Matt Arsenault	2018-10-03	1	-12/+9
\| \| \| \| \| \| \|	Even if calls are enabled, it still needs to be run for forcing inline of functions that use LDS. llvm-svn: 343657
*	AMDGPU: Expand atomicrmw nand in IR	Matt Arsenault	2018-10-02	1	-0/+1
\| \| \| \|	llvm-svn: 343559
*	[AMDGPU] restore r342722 which was reverted with r342743	Sameer Sahasrabuddhe	2018-09-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[AMDGPU] lower-switch in preISel as a workaround for legacy DA Summary: The default target of the switch instruction may sometimes be an "unreachable" block, when it is guaranteed that one of the cases is always taken. The dominator tree concludes that such a switch instruction does not have an immediate post dominator. This confuses divergence analysis, which is unable to propagate sync dependence to the targets of the switch instruction. As a workaround, the AMDGPU target now invokes lower-switch as a preISel pass. LowerSwitch is designed to handle the unreachable default target correctly, allowing the divergence analysis to locate the correct immediate dominator of the now-lowered switch. llvm-svn: 342956
*	revert changes from r342722	Sameer Sahasrabuddhe	2018-09-21	1	-2/+0
\| \| \| \| \| \| \| \| \|	"[AMDGPU] lower-switch in preISel as a workaround for legacy DA" This broke regression tests. The first breakage was noticed here: http://lab.llvm.org:8011/builders/lld-x86_64-freebsd/builds/23549 llvm-svn: 342743
*	[AMDGPU] lower-switch in preISel as a workaround for legacy DA	Sameer Sahasrabuddhe	2018-09-21	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The default target of the switch instruction may sometimes be an "unreachable" block, when it is guaranteed that one of the cases is always taken. The dominator tree concludes that such a switch instruction does not have an immediate post dominator. This confuses divergence analysis, which is unable to propagate sync dependence to the targets of the switch instruction. As a workaround, the AMDGPU target now invokes lower-switch as a preISel pass. LowerSwitch is designed to handle the unreachable default target correctly, allowing the divergence analysis to locate the correct immediate dominator of the now-lowered switch. Reviewers: arsenm, nhaehnle Reviewed By: nhaehnle Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits, simoll Differential Revision: https://reviews.llvm.org/D52221 llvm-svn: 342722
*	AMDGPU: Stop forcing internalize at -O0	Matt Arsenault	2018-08-31	1	-11/+0
\| \| \| \| \| \| \|	This doesn't really matter if clang is always emitting the visibility as hidden by default. llvm-svn: 341168
*	AMDGPU: Remove remnants of old address space mapping	Matt Arsenault	2018-08-31	1	-1/+0
\| \| \| \|	llvm-svn: 341165
*	run post-RA hazard recognizer pass late	Mark Searles	2018-07-16	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	Memory legalizer, waitcnt, and shrink passes can perturb the instructions, which means that the post-RA hazard recognizer pass should run after them. Otherwise, one of those passes may invalidate the work done by the hazard recognizer. Note that this has adverse side-effect that any consecutive S_NOP 0's, emitted by the hazard recognizer, will not be shrunk into a single S_NOP <N>. This should be addressed in a follow-on patch. Differential Revision: https://reviews.llvm.org/D49288 llvm-svn: 337154
*	AMDGPU: Refactor Subtarget classes	Tom Stellard	2018-07-11	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This is a follow-up to r335942. - Merge SISubtarget into AMDGPUSubtarget and rename to GCNSubtarget - Rename AMDGPUCommonSubtarget to AMDGPUSubtarget - Merge R600Subtarget::Generation and GCNSubtarget::Generation into AMDGPUSubtarget::Generation. Reviewers: arsenm, jvesely Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits Differential Revision: https://reviews.llvm.org/D49037 llvm-svn: 336851
*	Reapply "AMDGPU: Force inlining if LDS global address is used"	Matt Arsenault	2018-07-10	1	-5/+7
\| \| \| \| \| \|	This reverts commit r336623 llvm-svn: 336675
*	Revert "AMDGPU: Force inlining if LDS global address is used"	Vlad Tsyrklevich	2018-07-10	1	-7/+5
\| \| \| \| \| \| \|	This reverts commit r336587, it was causing test failures on the sanitizer bots. llvm-svn: 336623
*	AMDGPU: Force inlining if LDS global address is used	Matt Arsenault	2018-07-09	1	-5/+7
\| \| \| \| \| \| \| \| \| \|	These won't work for the forseeable future. These aren't allowed from OpenCL, but IPO optimizations can make them appear. Also directly set the attributes on functions, regardless of the linkage rather than cloning functions like before. llvm-svn: 336587
*	[AMDGPU] Enable LICM in the BE pipeline	Stanislav Mekhanoshin	2018-06-29	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	This allows to hoist code portion to compute reciprocal of loop invariant denominator in integer division after codegen prepare expansion. Differential Revision: https://reviews.llvm.org/D48604 llvm-svn: 335988
*	AMDGPU: Add pass to lower kernel arguments to loads	Matt Arsenault	2018-06-26	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces most argument uses with loads, but for now not all. The code in SelectionDAG for calling convention lowering is actively harmful for amdgpu_kernel. It attempts to split the argument types into register legal types, which results in low quality code for arbitary types. Since all kernel arguments are passed in memory, we just want the raw types. I've tried a couple of methods of mitigating this in SelectionDAG, but it's easier to just bypass this problem alltogether. It's possible to hack around the problem in the initial lowering, but the real problem is the DAG then expects to be able to use CopyToReg/CopyFromReg for uses of the arguments outside the block. Exposing the argument loads in the IR also has the advantage that the LoadStoreVectorizer can merge them. I'm not sure the best approach to dealing with the IR argument list is. The patch as-is just leaves the IR arguments in place, so all the existing code will still compute the same kernarg size and pointlessly lowers the arguments. Arguably the frontend should emit kernels with an empty argument list in the first place. Alternatively a dummy array could be inserted as a single argument just to reserve space. This does have some disadvantages. Local pointer kernel arguments can no longer have AssertZext placed on them as the equivalent !range metadata is not valid on pointer typed loads. This is mostly bad for SI which needs to know about the known bits in order to use the DS instruction offset, so in this case this is not done. More importantly, this skips noalias arguments since this pass does not yet convert this to the equivalent !alias.scope and !noalias metadata. Producing this metadata correctly seems to be tricky, although this logically is the same as inlining into a function which doesn't exist. Additionally, exposing these loads to the vectorizer may result in degraded aliasing information if a pointer load is merged with another argument load. I'm also not entirely sure this is preserving the current clover ABI, although I would greatly prefer if it would stop widening arguments and match the HSA ABI. As-is I think it is extending < 4-byte arguments to 4-bytes but doesn't align them to 4-bytes. llvm-svn: 335650
*	[AMDGPU] Construct memory clauses before RA	Stanislav Mekhanoshin	2018-05-31	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Memory clauses are formed into bundles in presence of xnack. Their source operands are marked as early-clobber. This allows to allocate distinct source and destination registers within a clause and prevent breaking the clause with s_nop in the hazard recognizer. Clauses are undone before post-RA scheduler to allow some rescheduling, which will not break the clause since artificial edges are created in the dag to keep memory operations together. Yet this allows a better ILP in some cases. Differential Revision: https://reviews.llvm.org/D47511 llvm-svn: 333691
*	AMDGPU: Split AMDGPUTTI into GCNTTI and R600TTI	Tom Stellard	2018-05-30	1	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Reviewers: arsenm, nhaehnle Reviewed By: arsenm Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye Differential Revision: https://reviews.llvm.org/D47359 llvm-svn: 333605
*	AMDGPU: Fix typo in option description	Matt Arsenault	2018-05-29	1	-1/+1
\| \| \| \|	llvm-svn: 333457