summaryrefslogtreecommitdiffstats
path: root/mlir/lib/Transforms/AffineDataCopyGeneration.cpp
Commit message (Collapse)AuthorAgeFilesLines
* NFC: Replace ValuePtr with Value and remove it now that Value is value-typed.River Riddle2019-12-231-1/+1
| | | | | | ValuePtr was a temporary typedef during the transition to a value-typed Value. PiperOrigin-RevId: 286945714
* Adjust License.txt file to use the LLVM licenseMehdi Amini2019-12-231-13/+4
| | | | PiperOrigin-RevId: 286906740
* NFC: Introduce new ValuePtr/ValueRef typedefs to simplify the transition to ↵River Riddle2019-12-221-1/+1
| | | | | | | | | | Value being value-typed. This is an initial step to refactoring the representation of OpResult as proposed in: https://groups.google.com/a/tensorflow.org/g/mlir/c/XXzzKhqqF_0/m/v6bKb08WCgAJ This change will make it much simpler to incrementally transition all of the existing code to use value-typed semantics. PiperOrigin-RevId: 286844725
* Fix a number of Clang-Tidy warnings.Christian Sigg2019-09-231-1/+0
| | | | PiperOrigin-RevId: 270632324
* NFC - Move explicit copy/dma generation utility out of pass and into LoopUtilsUday Bondhugula2019-09-141-666/+20
| | | | | | | | | | | | | | | | - turn copy/dma generation method into a utility in LoopUtils, allowing it to be reused elsewhere. - no functional/logic change to the pass/utility - trim down header includes in files affected Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Closes tensorflow/mlir#124 COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/124 from bondhugula:datacopy 9f346e62e5bd9dd1986720a30a35f302eb4d3252 PiperOrigin-RevId: 269106088
* NFC: Finish replacing FunctionPassBase/ModulePassBase with OpPassBase.River Riddle2019-09-131-1/+1
| | | | | | These directives were temporary during the generalization of FunctionPass/ModulePass to OpPass. PiperOrigin-RevId: 268970259
* pipeline-data-transfer: remove dead tag alloc's and improve test coverage ↵Uday Bondhugula2019-09-041-5/+6
| | | | | | | | | | | | | | | for replaceMemRefUsesWith / pipeline-data-transfer - address remaining comments from PR tensorflow/mlir#87 for better test coverage for pipeline-data-transfer/replaceAllMemRefUsesWith - remove dead tag allocs the same way they are removed for the replaced buffers Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Closes tensorflow/mlir#106 COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/106 from bondhugula:followup 9e868666d047e8d43e5f82f43e4093b838c710fa PiperOrigin-RevId: 267144774
* Fix affine data copy generation corner cases/bugsUday Bondhugula2019-09-031-79/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - the [begin, end) range identified for copying could end in between the block, which makes hoisting invalid in some cases. Change the range identification to always end with end of block. - add test case to exercise these (with fast mem capacity set to minimal so that single element memref buffers are generated at the innermost loop) - the location of begin/end of the block range for data copying was being confused with the insert points for copy in and copy out code. In cases, where we choose to hoist transfers, these are separate. - when copy loops are single iteration ones, promote their bodies at the end of the pass. - change default fast mem space to 1 (setting it to zero made it generate DMA op's that won't verify in the default case - since the DMA ops have a check for src/dest memref spaces being different). Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Co-Authored-By: Mehdi Amini <joker.eph@gmail.com> Closes tensorflow/mlir#88 COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/88 from bondhugula:datacopy 88697267c45e850c3ced87671e16e4a930c02a42 PiperOrigin-RevId: 266980911
* NFC: Move AffineOps dialect to the Dialect sub-directory.River Riddle2019-08-201-1/+1
| | | | PiperOrigin-RevId: 264482571
* NFC: Move LLVMIR, SDBM, and StandardOps to the Dialect/ directory.River Riddle2019-08-191-1/+1
| | | | PiperOrigin-RevId: 264193915
* Change from llvm::make_unique to std::make_uniqueJacques Pienaar2019-08-171-2/+2
| | | | | | | | Switch to C++14 standard method as llvm::make_unique has been removed ( https://reviews.llvm.org/D66259). Also mark some targets as c++14 to ease next integrates. PiperOrigin-RevId: 263953918
* Express ownership transfer in PassManager API through std::unique_ptr (NFC)Mehdi Amini2019-08-121-4/+4
| | | | | | | | | | | | | | Since raw pointers are always passed around for IR construct without implying any ownership transfer, it can be error prone to have implicit ownership transferred the same way. For example this code can seem harmless: Pass *pass = .... pm.addPass(pass); pm.addPass(pass); pm.run(module); PiperOrigin-RevId: 263053082
* NFC: Standardize the terminology used for parent ops/regions/etc.River Riddle2019-08-091-2/+2
| | | | | | There are currently several different terms used to refer to a parent IR unit in 'get' methods: getParent/getEnclosing/getContaining. This cl standardizes all of these methods to use 'getParent*'. PiperOrigin-RevId: 262680287
* AffineDataCopyGeneration: don't use CL flag values inside the passAlex Zinenko2019-08-021-13/+22
| | | | | | | | | | AffineDataCopyGeneration pass relied on command line flags for internal logic in several places, which makes it unusable in a library context (i.e. outside a standalone mlir-opt binary that does the command line parsing). Define configuration flags in the constructor instead, and set them up to command line-based defaults to maintain the original behavior. PiperOrigin-RevId: 261322364
* Introduce explicit copying optimization by generalizing the DMA generation passUday Bondhugula2019-08-011-0/+883
Explicit copying to contiguous buffers is a standard technique to avoid conflict misses and TLB misses, and improve hardware prefetching performance. When done in conjunction with cache tiling, it nearly eliminates all cache conflict and TLB misses, and a single hardware prefetch stream is needed per data tile. - generalize/extend DMA generation pass (renamed data copying pass) to perform either point-wise explicit copies to fast memory buffers or DMAs (depending on a cmd line option). All logic is the same as erstwhile -dma-generate. - -affine-dma-generate is now renamed -affine-data-copy; when -dma flag is provided, DMAs are generated, or else explicit copy loops are generated (point-wise) by default. - point-wise copying could be used for CPUs (or GPUs); some indicative performance numbers with a "C" version of the MLIR when compiled with and without this optimization (about 2x improvement here). With a matmul on 4096^2 matrices on a single core of an Intel Core i7 Skylake i7-8700K with clang 8.0.0: clang -O3: 518s clang -O3 with MLIR tiling (128x128): 24.5s clang -O3 with MLIR tiling + data copying 12.4s (code equivalent to test/Transforms/data-copy.mlir func @matmul) - fix some misleading comments. - change default fast-mem space to 0 (more intuitive now with the default copy generation using point-wise copies instead of DMAs) On a simple 3-d matmul loop nest, code generated with -affine-data-copy: ``` affine.for %arg3 = 0 to 4096 step 128 { affine.for %arg4 = 0 to 4096 step 128 { %0 = affine.apply #map0(%arg3, %arg4) %1 = affine.apply #map1(%arg3, %arg4) %2 = alloc() : memref<128x128xf32, 2> // Copy-in Out matrix. affine.for %arg5 = 0 to 128 { %5 = affine.apply #map2(%arg3, %arg5) affine.for %arg6 = 0 to 128 { %6 = affine.apply #map2(%arg4, %arg6) %7 = load %arg2[%5, %6] : memref<4096x4096xf32> affine.store %7, %2[%arg5, %arg6] : memref<128x128xf32, 2> } } affine.for %arg5 = 0 to 4096 step 128 { %5 = affine.apply #map0(%arg3, %arg5) %6 = affine.apply #map1(%arg3, %arg5) %7 = alloc() : memref<128x128xf32, 2> // Copy-in LHS. affine.for %arg6 = 0 to 128 { %11 = affine.apply #map2(%arg3, %arg6) affine.for %arg7 = 0 to 128 { %12 = affine.apply #map2(%arg5, %arg7) %13 = load %arg0[%11, %12] : memref<4096x4096xf32> affine.store %13, %7[%arg6, %arg7] : memref<128x128xf32, 2> } } %8 = affine.apply #map0(%arg5, %arg4) %9 = affine.apply #map1(%arg5, %arg4) %10 = alloc() : memref<128x128xf32, 2> // Copy-in RHS. affine.for %arg6 = 0 to 128 { %11 = affine.apply #map2(%arg5, %arg6) affine.for %arg7 = 0 to 128 { %12 = affine.apply #map2(%arg4, %arg7) %13 = load %arg1[%11, %12] : memref<4096x4096xf32> affine.store %13, %10[%arg6, %arg7] : memref<128x128xf32, 2> } } // Compute. affine.for %arg6 = #map7(%arg3) to #map8(%arg3) { affine.for %arg7 = #map7(%arg4) to #map8(%arg4) { affine.for %arg8 = #map7(%arg5) to #map8(%arg5) { %11 = affine.load %7[-%arg3 + %arg6, -%arg5 + %arg8] : memref<128x128xf32, 2> %12 = affine.load %10[-%arg5 + %arg8, -%arg4 + %arg7] : memref<128x128xf32, 2> %13 = affine.load %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2> %14 = mulf %11, %12 : f32 %15 = addf %13, %14 : f32 affine.store %15, %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2> } } } dealloc %10 : memref<128x128xf32, 2> dealloc %7 : memref<128x128xf32, 2> } %3 = affine.apply #map0(%arg3, %arg4) %4 = affine.apply #map1(%arg3, %arg4) // Copy out result matrix. affine.for %arg5 = 0 to 128 { %5 = affine.apply #map2(%arg3, %arg5) affine.for %arg6 = 0 to 128 { %6 = affine.apply #map2(%arg4, %arg6) %7 = affine.load %2[%arg5, %arg6] : memref<128x128xf32, 2> store %7, %arg2[%5, %6] : memref<4096x4096xf32> } } dealloc %2 : memref<128x128xf32, 2> } } ``` With -affine-data-copy -dma: ``` affine.for %arg3 = 0 to 4096 step 128 { %0 = affine.apply #map3(%arg3) %1 = alloc() : memref<128xf32, 2> %2 = alloc() : memref<1xi32> affine.dma_start %arg2[%arg3], %1[%c0], %2[%c0], %c128_0 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32> affine.dma_wait %2[%c0], %c128_0 : memref<1xi32> %3 = alloc() : memref<1xi32> affine.for %arg4 = 0 to 4096 step 128 { %5 = affine.apply #map0(%arg3, %arg4) %6 = affine.apply #map1(%arg3, %arg4) %7 = alloc() : memref<128x128xf32, 2> %8 = alloc() : memref<1xi32> affine.dma_start %arg0[%arg3, %arg4], %7[%c0, %c0], %8[%c0], %c16384, %c4096, %c128_2 : memref<4096x4096xf32>, memref<128x128xf32, 2>, memref<1xi32> affine.dma_wait %8[%c0], %c16384 : memref<1xi32> %9 = affine.apply #map3(%arg4) %10 = alloc() : memref<128xf32, 2> %11 = alloc() : memref<1xi32> affine.dma_start %arg1[%arg4], %10[%c0], %11[%c0], %c128_1 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32> affine.dma_wait %11[%c0], %c128_1 : memref<1xi32> affine.for %arg5 = #map3(%arg3) to #map5(%arg3) { affine.for %arg6 = #map3(%arg4) to #map5(%arg4) { %12 = affine.load %7[-%arg3 + %arg5, -%arg4 + %arg6] : memref<128x128xf32, 2> %13 = affine.load %10[-%arg4 + %arg6] : memref<128xf32, 2> %14 = affine.load %1[-%arg3 + %arg5] : memref<128xf32, 2> %15 = mulf %12, %13 : f32 %16 = addf %14, %15 : f32 affine.store %16, %1[-%arg3 + %arg5] : memref<128xf32, 2> } } dealloc %11 : memref<1xi32> dealloc %10 : memref<128xf32, 2> dealloc %8 : memref<1xi32> dealloc %7 : memref<128x128xf32, 2> } %4 = affine.apply #map3(%arg3) affine.dma_start %1[%c0], %arg2[%arg3], %3[%c0], %c128 : memref<128xf32, 2>, memref<4096xf32>, memref<1xi32> affine.dma_wait %3[%c0], %c128 : memref<1xi32> dealloc %3 : memref<1xi32> dealloc %2 : memref<1xi32> dealloc %1 : memref<128xf32, 2> } ``` Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Closes tensorflow/mlir#50 PiperOrigin-RevId: 261221903
OpenPOWER on IntegriCloud