diff options
author | Jay Foad <jay.foad@gmail.com> | 2019-07-19 08:40:37 +0000 |
---|---|---|
committer | Jay Foad <jay.foad@gmail.com> | 2019-07-19 08:40:37 +0000 |
commit | 7d06ffff466d50ba7e65c154749bc4aea120a907 (patch) | |
tree | 946e4f47b0160eb8359ddc72564ad201f259887b /llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | |
parent | bde33af85a1bf591fcc6cf9391054b0a4f6c5b46 (diff) | |
download | bcm5719-llvm-7d06ffff466d50ba7e65c154749bc4aea120a907.tar.gz bcm5719-llvm-7d06ffff466d50ba7e65c154749bc4aea120a907.zip |
[AMDGPU] Simplify the exclusive scan used for optimized atomics
Summary:
Change the scan algorithm to use only power-of-two shifts (1, 2, 4, 8,
16, 32) instead of starting off shifting by 1, 2 and 3 and then doing
a 3-way ADD, because:
1. It simplifies the compiler a little.
2. It minimizes vgpr pressure because each instruction is now of the
form vn = vn + vn << c.
3. It is more friendly to the DPP combiner, which currently can't
combine into an ADD3 instruction.
Because of #2 and #3 the end result is improved from this:
v_add_u32_dpp v4, v3, v3 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf
v_mov_b32_dpp v1, v3 row_shr:3 row_mask:0xf bank_mask:0xf
v_add3_u32 v1, v4, v5, v1
s_nop 1
v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe
s_nop 1
v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc
s_nop 1
v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf
s_nop 1
v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf
To this:
v_add_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
s_nop 1
v_add_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
s_nop 1
v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe
s_nop 1
v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc
s_nop 1
v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf
s_nop 1
v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf
I.e. two fewer computational instructions, one extra nop where we could
schedule something else.
Reviewers: arsenm, sheredom, critson, rampitec, vpykhtin
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64411
llvm-svn: 366543
Diffstat (limited to 'llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp')
-rw-r--r-- | llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | 18 |
1 files changed, 8 insertions, 10 deletions
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp index 8a92e7d923f..2982549357b 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp @@ -376,26 +376,24 @@ void AMDGPUAtomicOptimizer::optimizeAtomic(Instruction &I, CallInst *const SetInactive = B.CreateIntrinsic(Intrinsic::amdgcn_set_inactive, Ty, {V, Identity}); - CallInst *const FirstDPP = + ExclScan = B.CreateIntrinsic(Intrinsic::amdgcn_update_dpp, Ty, {Identity, SetInactive, B.getInt32(DPP_WF_SR1), B.getInt32(0xf), B.getInt32(0xf), B.getFalse()}); - ExclScan = FirstDPP; - const unsigned Iters = 7; - const unsigned DPPCtrl[Iters] = { - DPP_ROW_SR1, DPP_ROW_SR2, DPP_ROW_SR3, DPP_ROW_SR4, - DPP_ROW_SR8, DPP_ROW_BCAST15, DPP_ROW_BCAST31}; - const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xf, 0xa, 0xc}; - const unsigned BankMask[Iters] = {0xf, 0xf, 0xf, 0xe, 0xc, 0xf, 0xf}; + const unsigned Iters = 6; + const unsigned DPPCtrl[Iters] = {DPP_ROW_SR1, DPP_ROW_SR2, + DPP_ROW_SR4, DPP_ROW_SR8, + DPP_ROW_BCAST15, DPP_ROW_BCAST31}; + const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xa, 0xc}; + const unsigned BankMask[Iters] = {0xf, 0xf, 0xe, 0xc, 0xf, 0xf}; // This loop performs an exclusive scan across the wavefront, with all lanes // active (by using the WWM intrinsic). for (unsigned Idx = 0; Idx < Iters; Idx++) { - Value *const UpdateValue = Idx < 3 ? FirstDPP : ExclScan; CallInst *const DPP = B.CreateIntrinsic( Intrinsic::amdgcn_update_dpp, Ty, - {Identity, UpdateValue, B.getInt32(DPPCtrl[Idx]), + {Identity, ExclScan, B.getInt32(DPPCtrl[Idx]), B.getInt32(RowMask[Idx]), B.getInt32(BankMask[Idx]), B.getFalse()}); ExclScan = buildNonAtomicBinOp(B, Op, ExclScan, DPP); |