diff options
Diffstat (limited to 'mlir/docs/Dialects/GPU.md')
-rw-r--r-- | mlir/docs/Dialects/GPU.md | 132 |
1 files changed, 132 insertions, 0 deletions
diff --git a/mlir/docs/Dialects/GPU.md b/mlir/docs/Dialects/GPU.md new file mode 100644 index 00000000000..7dcd8f6053c --- /dev/null +++ b/mlir/docs/Dialects/GPU.md @@ -0,0 +1,132 @@ +# GPU Dialect + +Note: this dialect is more likely to change than others in the near future; use +with caution. + +This dialect provides middle-level abstractions for launching GPU kernels +following a programming model similar to that of CUDA or OpenCL. It provides +abstractions for kernel invocations (and may eventually provide those for device +management) that are not present at the lower level (e.g., as LLVM IR intrinsics +for GPUs). Its goal is to abstract away device- and driver-specific +manipulations to launch a GPU kernel and provide a simple path towards GPU +execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The +dialect uses `gpu` as its canonical prefix. + +## Memory attribution + +Memory buffers are defined at the function level, either in "gpu.launch" or in +"gpu.func" ops. This encoding makes it clear where the memory belongs and makes +the lifetime of the memory visible. The memory is only accessible while the +kernel is launched/the function is currently invoked. The latter is more strict +than actual GPU implementations but using static memory at the function level is +just for convenience. It is also always possible to pass pointers to the +workgroup memory into other functions, provided they expect the correct memory +space. + +The buffers are considered live throughout the execution of the GPU function +body. The absence of memory attribution syntax means that the function does not +require special buffers. Rationale: although the underlying models declare +memory buffers at the module level, we chose to do it at the function level to +provide some structuring for the lifetime of those buffers; this avoids the +incentive to use the buffers for communicating between different kernels or +launches of the same kernel, which should be done through function arguments +instead; we chose not to use `alloca`-style approach that would require more +complex lifetime analysis following the principles of MLIR that promote +structure and representing analysis results in the IR. + +## Operations + +### `gpu.block_dim` + +Returns the number of threads in the thread block (aka the block size) along the +x, y, or z `dimension`. + +Example: + +```mlir + %bDimX = "gpu.block_dim"() {dimension = "x"} : () -> (index) +``` + +### `gpu.block_id` + +Returns the block id, i.e. the index of the current block within the grid along +the x, y, or z `dimension`. + +Example: + +```mlir + %bIdY = "gpu.block_id"() {dimension = "y"} : () -> (index) +``` + +### `gpu.grid_dim` + +Returns the number of thread blocks in the grid along the x, y, or z +`dimension`. + +Example: + +```mlir + %gDimZ = "gpu.grid_dim"() {dimension = "z"} : () -> (index) +``` + +### `gpu.thread_id` + +Returns the thread id, i.e. the index of the current thread within the block +along the x, y, or z `dimension`. + +Example: + +```mlir + %tIdX = "gpu.thread_id"() {dimension = "x"} : () -> (index) +``` + +### `gpu.yield` + +Is a special terminator operation for blocks inside regions in gpu ops. It +returns values to the immediately enclosing gpu op. + +Example: + +```mlir +gpu.yield %f0, %f1 : f32, f32 +``` + +### `gpu.all_reduce` + +The "all_reduce" op reduces the value of every work item across a local +workgroup. The result is equal for all work items of a workgroup. + +For example, both + +```mlir +%1 = "gpu.all_reduce"(%0) ({}) { op = "add" } : (f32) -> (f32) +%2 = "gpu.all_reduce"(%0) ({ +^bb(%lhs : f32, %rhs : f32): + %sum = addf %lhs, %rhs : f32 + "gpu.yield"(%sum) : (f32) -> () +}) : (f32) -> (f32) +``` + +compute the sum of each work item's %0 value. The first version specifies the +accumulation as operation, whereas the second version specifies the accumulation +as code region. The accumulation operation must either be `add` or `mul`. + +Either none or all work items of a workgroup need to execute this op +in convergence. + +### `gpu.barrier` + +The "barrier" op synchronizes all work items of a workgroup. It is used +to coordinate communication between the work items of the workgroup. + +```mlir +gpu.barrier +``` + +waits until all work items in the workgroup have reached this point and all +memory accesses made by these work items prior to the op are visible to all work +items in the workgroup. Data hazards between work items accessing the same +memory can be avoided by synchronizing work items in-between these accesses. + +Either none or all work items of a workgroup need to execute this op +in convergence. |