| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- bodies would earlier appear in the order (i, i+3, i+2, i+1) instead of
(i, i+1, i+2, i+3) for example for factor 4.
- clean up hardcoded test cases
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#170
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/170 from bondhugula:ujam b66b405b2b1894a03b376952e32a9d0292042665
PiperOrigin-RevId: 273613131
|
|
|
|
|
|
|
|
|
| |
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#157
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/157 from bondhugula:quickfix bd1fcd79825fc0bd5b4a3e688153fa0993ab703d
PiperOrigin-RevId: 273316498
|
|
|
|
|
|
|
|
| |
callable.
Some dialects have implicit conversions inherent in their modeling, meaning that a call may have a different type that the type that the callable expects. To support this, a hook is added to the dialect interface that allows for materializing conversion operations during inlining when there is a mismatch. A hook is also added to the callable interface to allow for introspecting the expected result types.
PiperOrigin-RevId: 272814379
|
|
|
|
|
|
| |
This allows for the inliner to work on arbitrary call operations. The updated inliner will also work bottom-up through the callgraph enabling support for multiple levels of inlining.
PiperOrigin-RevId: 272813876
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- introduce splat op in standard dialect (currently for int/float/index input
type, output type can be vector or statically shaped tensor)
- implement LLVM lowering (when result type is 1-d vector)
- add constant folding hook for it
- while on Ops.cpp, fix some stale names
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#141
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/141 from bondhugula:splat 48976a6aa0a75be6d91187db6418de989e03eb51
PiperOrigin-RevId: 270965304
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- fix store to load forwarding for a certain set of cases (where
forwarding shouldn't have happened); use AffineValueMap difference
based MemRefAccess equality checking; utility logic is also greatly
simplified
- add missing equality/inequality operators for AffineExpr ==/!= ints
- add == != operators on MemRefAccess
Closes tensorflow/mlir#136
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/136 from bondhugula:store-load-forwarding d79fd1add8bcfbd9fa71d841a6a9905340dcd792
PiperOrigin-RevId: 270457011
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- allow symbols in index remapping provided for memref replacement
- fix memref normalize crash on cases with layout maps with symbols
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Reported by: Alex Zinenko
Closes tensorflow/mlir#139
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/139 from bondhugula:memref-rep-symbols 2f48c1fdb5d4c58915bbddbd9f07b18541819233
PiperOrigin-RevId: 269851182
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- add canonicalization pattern to compose maps into affine loads/stores;
templatize the pattern and reuse it for affine.apply as well
- rename getIndices -> getMapOperands() (getIndices is confusing since
these are no longer the indices themselves but operands to the map
whose results are the indices). This also makes the accessor uniform
across affine.apply/load/store. Change arg names on the affine
load/store builder to avoid confusion. Drop an unused confusing build
method on AffineStoreOp.
- update incomplete doc comment for canonicalizeMapAndOperands (this was
missed from a previous update).
Addresses issue tensorflow/mlir#121
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#122
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/122 from bondhugula:compose-load-store e71de1771e56a85c4282c10cb43f30cef0701c4f
PiperOrigin-RevId: 269619540
|
|
|
|
|
|
| |
When performing A->B->C conversion, an operation may still refer to an operand of A. This makes it necessary to unmap through multiple levels of replacement for a specific value.
PiperOrigin-RevId: 269367859
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- take care of symbolic operands with alloc
- add missing check for compose map failure and a test case
- add test cases on strides
- drop incorrect check for one-to-one'ness
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#132
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/132 from bondhugula:normalize-memrefs 8aebf285fb0d7c19269d85255aed644657e327b7
PiperOrigin-RevId: 269105947
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- NFC - on any pass/utility logic/output.
- Resolve TODO; the method building loop trip count maps was
creating and deleting affine.apply ops (transforming IR from under
analysis!, strictly speaking). Introduce AffineValueMap::difference to
do this correctly (without the need to create any IR).
- Move AffineApplyNormalizer out so that its methods are reusable from
AffineStructures.cpp; add a helper method 'normalize' to it. Fix
AffineApplyNormalize::renumberOneDim (Issue tensorflow/mlir#89).
- Trim includes on files touched.
- add test case on a scenario previously not covered
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#133
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/133 from bondhugula:trip-count-build 7fc34d857f7788f98b641792cafad6f5bd50e47b
PiperOrigin-RevId: 269101118
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- add missing canonicalization pattern to fold memref_cast + dim to
dim (needed to propagate constant when folding a dynamic shape to
a static one)
- also fix an outdated/inconsistent comment in StandardOps/Ops.td
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#126
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/126 from bondhugula:quickfix 4566e75e49685c532faffff91d64c5d83d4da524
PiperOrigin-RevId: 269020058
|
|
|
|
| |
PiperOrigin-RevId: 268783645
|
|
|
|
|
|
|
|
| |
Use variable captures to make constant folding tests less sensitive to printer/parser implementation details.
See guidelines at https://github.com/tensorflow/mlir/blob/master/g3doc/TestingGuide.md
PiperOrigin-RevId: 268780812
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This defines a set of initial utilities for inlining a region(or a FuncOp), and defines a simple inliner pass for testing purposes.
A new dialect interface is defined, DialectInlinerInterface, that allows for dialects to override hooks controlling inlining legality. The interface currently provides the following hooks, but these are just premilinary and should be changed/added to/modified as necessary:
* isLegalToInline
- Determine if a region can be inlined into one of this dialect, *or* if an operation of this dialect can be inlined into a given region.
* shouldAnalyzeRecursively
- Determine if an operation with regions should be analyzed recursively for legality. This allows for child operations to be closed off from the legality checks for operations like lambdas.
* handleTerminator
- Process a terminator that has been inlined.
This cl adds support for inlining StandardOps, but other dialects will be added in followups as necessary.
PiperOrigin-RevId: 267426759
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for replaceMemRefUsesWith / pipeline-data-transfer
- address remaining comments from PR tensorflow/mlir#87 for better test coverage for
pipeline-data-transfer/replaceAllMemRefUsesWith
- remove dead tag allocs the same way they are removed for the replaced buffers
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#106
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/106 from bondhugula:followup 9e868666d047e8d43e5f82f43e4093b838c710fa
PiperOrigin-RevId: 267144774
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- introduce utility to convert memrefs with non-identity layout maps to
ones with identity layout maps: convert the type and rewrite/remap all
its uses
- add this utility to -simplify-affine-structures pass for testing
purposes
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#104
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/104 from bondhugula:memref-normalize f2c914aa1890e8860326c9e33f9aa160b3d65e6d
PiperOrigin-RevId: 266985317
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- the [begin, end) range identified for copying could end in between the
block, which makes hoisting invalid in some cases. Change the range
identification to always end with end of block.
- add test case to exercise these (with fast mem capacity set to minimal so
that single element memref buffers are generated at the innermost loop)
- the location of begin/end of the block range for data copying was
being confused with the insert points for copy in and copy out code.
In cases, where we choose to hoist transfers, these are separate.
- when copy loops are single iteration ones, promote their bodies at
the end of the pass.
- change default fast mem space to 1 (setting it to zero made it
generate DMA op's that won't verify in the default case - since the
DMA ops have a check for src/dest memref spaces being different).
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Co-Authored-By: Mehdi Amini <joker.eph@gmail.com>
Closes tensorflow/mlir#88
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/88 from bondhugula:datacopy 88697267c45e850c3ced87671e16e4a930c02a42
PiperOrigin-RevId: 266980911
|
|
|
|
|
|
| |
This interface will allow for providing hooks to interrop with operation folding. The first hook, 'shouldMaterializeInto', will allow for controlling which region to insert materialized constants into. The folder will generally materialize constants into the top-level isolated region, this allows for materializing into a lower level ancestor region if it is more profitable/correct.
PiperOrigin-RevId: 266702972
|
|
|
|
|
|
| |
AffineForOp themselves are pure and can be removed if there are no internal operations.
PiperOrigin-RevId: 266481293
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- extend canonicalizeMapAndOperands to propagate constant operands into
the map's expressions (and thus drop those operands).
- canonicalizeMapAndOperands previously only dropped duplicate and
unused operands; however, operands that were constants were
retained.
This change makes IR maps/expressions generated by various
utilities/passes even simpler; also makes some of the test checks more
accurate and simpler -- for eg., 0' instead of symbol(%{{.*}}).
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#107
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/107 from bondhugula:canonicalize-maps c889a51486d14fbf7db489f224f881e7e1ff7d72
PiperOrigin-RevId: 266085289
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- fix operand mapping while cloning sub-blocks to jam - was incorrect
for imperfect nests where def/use was across sub-blocks
- strengthen/generalize the first test case to cover the previously
missed scenario
- clean up the other cases while on this.
Previously, unroll-jamming the following nest
```
affine.for %arg0 = 0 to 2048 {
%0 = alloc() : memref<512x10xf32>
affine.for %arg1 = 0 to 10 {
%1 = affine.load %0[%arg0, %arg1] : memref<512x10xf32>
}
dealloc %0 : memref<512x10xf32>
}
```
would yield
```
%0 = alloc() : memref<512x10xf32>
%1 = affine.apply #map0(%arg0)
%2 = alloc() : memref<512x10xf32>
affine.for %arg1 = 0 to 10 {
%4 = affine.load %0[%arg0, %arg1] : memref<512x10xf32>
%5 = affine.apply #map0(%arg0)
%6 = affine.load %0[%5, %arg1] : memref<512x10xf32>
}
dealloc %0 : memref<512x10xf32>
%3 = affine.apply #map0(%arg0)
dealloc %0 : memref<512x10xf32>
```
instead of
```
module {
affine.for %arg0 = 0 to 2048 step 2 {
%0 = alloc() : memref<512x10xf32>
%1 = affine.apply #map0(%arg0)
%2 = alloc() : memref<512x10xf32>
affine.for %arg1 = 0 to 10 {
%4 = affine.load %0[%arg0, %arg1] : memref<512x10xf32>
%5 = affine.apply #map0(%arg0)
%6 = affine.load %2[%5, %arg1] : memref<512x10xf32>
}
dealloc %0 : memref<512x10xf32>
%3 = affine.apply #map0(%arg0)
dealloc %2 : memref<512x10xf32>
}
```
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#98
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/98 from bondhugula:ujam ddbc853f69b5608b3e8ff9b5ac1f6a5a0bb315a4
PiperOrigin-RevId: 266073460
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Refactor replaceAllMemRefUsesWith to split it into two methods: the new
method does the replacement on a single op, and is used by the existing
one.
- make the methods return LogicalResult instead of bool
- Earlier, when replacement failed (due to non-deferencing uses of the
memref), the set of ops that had already been processed would have
been replaced leaving the IR in an inconsistent state. Now, a
pass is made over all ops to first check for non-deferencing
uses, and then replacement is performed. No test cases were affected
because all clients of this method were first checking for
non-deferencing uses before calling this method (for other reasons).
This isn't true for a use case in another upcoming PR (scalar
replacement); clients can now bail out with consistent IR on failure
of replaceAllMemRefUsesWith. Add test case.
- multiple deferencing uses of the same memref in a single op is
possible (we have no such use cases/scenarios), and this has always
remained unsupported. Add an assertion for this.
- minor fix to another test pipeline-data-transfer case.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#87
PiperOrigin-RevId: 265808183
|
|
|
|
|
|
| |
This fixes a bug when folding ops with inner ops and inner ops are still being visited.
PiperOrigin-RevId: 265475780
|
|
|
|
|
|
|
|
|
|
|
| |
This will allow for naming values the same as existing SSA values for regions attached to operations that are isolated from above. This fits in with how the system already allows separate name scopes for sibling regions. This name shadowing can be enabled in the custom parser of operations by setting the 'enableNameShadowing' flag to true when calling 'parseRegion'.
%arg = constant 10 : i32
foo.op {
%arg = constant 10 : i32
}
PiperOrigin-RevId: 264255999
|
|
|
|
|
|
| |
All 'getValue' variants now require that the index is valid, queryable via 'isValidIndex'. 'getSplatValue' now requires that the attribute is a proper splat. This allows for querying these methods on DenseElementAttr with all possible value types; e.g. float, int, APInt, etc. This also allows for removing unnecessary conversions to Attribute that really want the underlying value.
PiperOrigin-RevId: 263437337
|
|
|
|
|
|
|
|
| |
via GreedyPatternRewriteDriver::replaceOp.
This fixes a bug where ops inside the parent op are visited even though the parent op has been removed.
PiperOrigin-RevId: 261953580
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Explicit copying to contiguous buffers is a standard technique to avoid
conflict misses and TLB misses, and improve hardware prefetching
performance. When done in conjunction with cache tiling, it nearly
eliminates all cache conflict and TLB misses, and a single hardware
prefetch stream is needed per data tile.
- generalize/extend DMA generation pass (renamed data copying pass) to
perform either point-wise explicit copies to fast memory buffers or
DMAs (depending on a cmd line option). All logic is the same as
erstwhile -dma-generate.
- -affine-dma-generate is now renamed -affine-data-copy; when -dma flag is
provided, DMAs are generated, or else explicit copy loops are generated
(point-wise) by default.
- point-wise copying could be used for CPUs (or GPUs); some indicative
performance numbers with a "C" version of the MLIR when compiled with
and without this optimization (about 2x improvement here).
With a matmul on 4096^2 matrices on a single core of an Intel Core i7
Skylake i7-8700K with clang 8.0.0:
clang -O3: 518s
clang -O3 with MLIR tiling (128x128): 24.5s
clang -O3 with MLIR tiling + data copying 12.4s
(code equivalent to test/Transforms/data-copy.mlir func @matmul)
- fix some misleading comments.
- change default fast-mem space to 0 (more intuitive now with the
default copy generation using point-wise copies instead of DMAs)
On a simple 3-d matmul loop nest, code generated with -affine-data-copy:
```
affine.for %arg3 = 0 to 4096 step 128 {
affine.for %arg4 = 0 to 4096 step 128 {
%0 = affine.apply #map0(%arg3, %arg4)
%1 = affine.apply #map1(%arg3, %arg4)
%2 = alloc() : memref<128x128xf32, 2>
// Copy-in Out matrix.
affine.for %arg5 = 0 to 128 {
%5 = affine.apply #map2(%arg3, %arg5)
affine.for %arg6 = 0 to 128 {
%6 = affine.apply #map2(%arg4, %arg6)
%7 = load %arg2[%5, %6] : memref<4096x4096xf32>
affine.store %7, %2[%arg5, %arg6] : memref<128x128xf32, 2>
}
}
affine.for %arg5 = 0 to 4096 step 128 {
%5 = affine.apply #map0(%arg3, %arg5)
%6 = affine.apply #map1(%arg3, %arg5)
%7 = alloc() : memref<128x128xf32, 2>
// Copy-in LHS.
affine.for %arg6 = 0 to 128 {
%11 = affine.apply #map2(%arg3, %arg6)
affine.for %arg7 = 0 to 128 {
%12 = affine.apply #map2(%arg5, %arg7)
%13 = load %arg0[%11, %12] : memref<4096x4096xf32>
affine.store %13, %7[%arg6, %arg7] : memref<128x128xf32, 2>
}
}
%8 = affine.apply #map0(%arg5, %arg4)
%9 = affine.apply #map1(%arg5, %arg4)
%10 = alloc() : memref<128x128xf32, 2>
// Copy-in RHS.
affine.for %arg6 = 0 to 128 {
%11 = affine.apply #map2(%arg5, %arg6)
affine.for %arg7 = 0 to 128 {
%12 = affine.apply #map2(%arg4, %arg7)
%13 = load %arg1[%11, %12] : memref<4096x4096xf32>
affine.store %13, %10[%arg6, %arg7] : memref<128x128xf32, 2>
}
}
// Compute.
affine.for %arg6 = #map7(%arg3) to #map8(%arg3) {
affine.for %arg7 = #map7(%arg4) to #map8(%arg4) {
affine.for %arg8 = #map7(%arg5) to #map8(%arg5) {
%11 = affine.load %7[-%arg3 + %arg6, -%arg5 + %arg8] : memref<128x128xf32, 2>
%12 = affine.load %10[-%arg5 + %arg8, -%arg4 + %arg7] : memref<128x128xf32, 2>
%13 = affine.load %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2>
%14 = mulf %11, %12 : f32
%15 = addf %13, %14 : f32
affine.store %15, %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2>
}
}
}
dealloc %10 : memref<128x128xf32, 2>
dealloc %7 : memref<128x128xf32, 2>
}
%3 = affine.apply #map0(%arg3, %arg4)
%4 = affine.apply #map1(%arg3, %arg4)
// Copy out result matrix.
affine.for %arg5 = 0 to 128 {
%5 = affine.apply #map2(%arg3, %arg5)
affine.for %arg6 = 0 to 128 {
%6 = affine.apply #map2(%arg4, %arg6)
%7 = affine.load %2[%arg5, %arg6] : memref<128x128xf32, 2>
store %7, %arg2[%5, %6] : memref<4096x4096xf32>
}
}
dealloc %2 : memref<128x128xf32, 2>
}
}
```
With -affine-data-copy -dma:
```
affine.for %arg3 = 0 to 4096 step 128 {
%0 = affine.apply #map3(%arg3)
%1 = alloc() : memref<128xf32, 2>
%2 = alloc() : memref<1xi32>
affine.dma_start %arg2[%arg3], %1[%c0], %2[%c0], %c128_0 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32>
affine.dma_wait %2[%c0], %c128_0 : memref<1xi32>
%3 = alloc() : memref<1xi32>
affine.for %arg4 = 0 to 4096 step 128 {
%5 = affine.apply #map0(%arg3, %arg4)
%6 = affine.apply #map1(%arg3, %arg4)
%7 = alloc() : memref<128x128xf32, 2>
%8 = alloc() : memref<1xi32>
affine.dma_start %arg0[%arg3, %arg4], %7[%c0, %c0], %8[%c0], %c16384, %c4096, %c128_2 : memref<4096x4096xf32>, memref<128x128xf32, 2>, memref<1xi32>
affine.dma_wait %8[%c0], %c16384 : memref<1xi32>
%9 = affine.apply #map3(%arg4)
%10 = alloc() : memref<128xf32, 2>
%11 = alloc() : memref<1xi32>
affine.dma_start %arg1[%arg4], %10[%c0], %11[%c0], %c128_1 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32>
affine.dma_wait %11[%c0], %c128_1 : memref<1xi32>
affine.for %arg5 = #map3(%arg3) to #map5(%arg3) {
affine.for %arg6 = #map3(%arg4) to #map5(%arg4) {
%12 = affine.load %7[-%arg3 + %arg5, -%arg4 + %arg6] : memref<128x128xf32, 2>
%13 = affine.load %10[-%arg4 + %arg6] : memref<128xf32, 2>
%14 = affine.load %1[-%arg3 + %arg5] : memref<128xf32, 2>
%15 = mulf %12, %13 : f32
%16 = addf %14, %15 : f32
affine.store %16, %1[-%arg3 + %arg5] : memref<128xf32, 2>
}
}
dealloc %11 : memref<1xi32>
dealloc %10 : memref<128xf32, 2>
dealloc %8 : memref<1xi32>
dealloc %7 : memref<128x128xf32, 2>
}
%4 = affine.apply #map3(%arg3)
affine.dma_start %1[%c0], %arg2[%arg3], %3[%c0], %c128 : memref<128xf32, 2>, memref<4096xf32>, memref<1xi32>
affine.dma_wait %3[%c0], %c128 : memref<1xi32>
dealloc %3 : memref<1xi32>
dealloc %2 : memref<1xi32>
dealloc %1 : memref<128xf32, 2>
}
```
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#50
PiperOrigin-RevId: 261221903
|
|
|
|
|
|
|
| |
In the backward slice computation, BlockArgument coming from function arguments represent a natural boundary for the traversal and should not trigger llvm_unreachable.
This CL also improves the error message and adds a relevant test.
PiperOrigin-RevId: 260118630
|
|
|
|
|
|
|
| |
Clipping creates non-affine memory accesses, use std_load and std_store instead of affine_load and affine_store.
In the future we may also want a fill with the neutral element rather than clip, this would make the accesses affine if we wanted more analyses and transformations to happen post lowering to pointwise copies.
PiperOrigin-RevId: 260110503
|
|
|
|
|
|
|
|
| |
This mode analyzes which operations are legalizable to the given target if a conversion were to be applied, i.e. no rewrites are ever performed even on success. This mode is useful for device partitioning or other utilities that may want to analyze the effect of conversion to different targets before performing it.
The analysis method currently just fills a provided set with the operations that were found to be legalizable. This can be extended in the future to capture more information as necessary.
PiperOrigin-RevId: 259987105
|
|
|
|
|
|
|
|
|
| |
This CL fixes an oversight with dealing with loops in slicing analysis.
The forward slice computation properly propagates through loops but not the backward slice.
Add relevant unit tests.
PiperOrigin-RevId: 259903396
|
|
|
|
|
|
| |
Remove hardcoded SSA names and make use of CHECK-LABEL directives.
PiperOrigin-RevId: 259767803
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The loop parallelism detection utility only collects the affine.load and
affine.store operations appearing inside the loop to analyze the access
patterns for the absence of dependences. However, any operation, including
unregistered operations, can appear in a body of an affine loop. If such
operation has side effects, the result of parallelism analysis is incorrect.
Conservatively assume affine loops are not parallel in presence of operations
other than affine.load, affine.store, affine.for, affine.terminator that may
have side effects.
This required to update the loop-fusion unit test that relies on parallelism
analysis and was exercising loop fusion in presence of an unregistered
operation.
PiperOrigin-RevId: 259560935
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- introduce parseRegionArgumentList (similar to parseOperandList) to parse a
list of region arguments with a delimiter
- allows defining custom parse for op's with multiple/variadic number of
region arguments
- use this on the gpu.launch op (although the latter has a fixed number
of region arguments)
- add a test dialect op to test region argument list parsing (with the
no delimiter case)
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closes tensorflow/mlir#40
PiperOrigin-RevId: 259442536
|
|
|
|
|
|
| |
This CL moves LoopParametricTiling into test/lib as a pass for purely testing purposes.
PiperOrigin-RevId: 259300264
|
|
|
|
|
|
| |
This cl enforces that the conversion of the type signatures for regions, and thus their entry blocks, is handled via ConversionPatterns. A new hook 'applySignatureConversion' is added to the ConversionPatternRewriter to perform the desired conversion on a region. This also means that the handling of rewriting the signature of a FuncOp is moved to a pattern. A default implementation is provided via 'mlir::populateFuncOpTypeConversionPattern'. This removes the hacky implicit 'dynamically legal' status of FuncOp that was present previously, and leaves it up to the user to decide when/how to convert the signature of a function.
PiperOrigin-RevId: 259161999
|
|
|
|
| |
PiperOrigin-RevId: 258956693
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL introduces a simple loop utility function which rewrites the bounds and step of a loop so as to become mappable on a regular grid of processors whose identifiers are given by SSA values.
A corresponding unit test is added.
For example, using CUDA terminology, and assuming a 2-d grid with processorIds = [blockIdx.x, threadIdx.x] and numProcessors = [gridDim.x, blockDim.x], the loop:
```
loop.for %i = %lb to %ub step %step {
...
}
```
is rewritten into a version resembling the following pseudo-IR:
```
loop.for %i = %lb + threadIdx.x + blockIdx.x * blockDim.x to %ub
step %gridDim.x * blockDim.x {
...
}
```
PiperOrigin-RevId: 258945942
|
|
|
|
|
|
|
|
|
|
|
| |
This CL adapts the recently introduced parametric tiling to have an API matching the tiling
of AffineForOp. The transformation using stripmineSink is more general and produces imperfectly nested loops.
Perfect nesting invariants of the tiled version are obtained by selectively applying hoisting of ops to isolate perfectly nested bands. Such hoisting may fail to produce a perfect loop nest in cases where ForOp transitively depend on enclosing induction variables. In such cases, the API provides a LogicalResult return but the SimpleParametricLoopTilingPass does not currently use this result.
A new unit test is added with a triangular loop for which the perfect nesting property does not hold. For this example, the old behavior was to produce IR that did not verify (some use was not dominated by its def).
PiperOrigin-RevId: 258928309
|
|
|
|
|
|
|
|
| |
This cl begins a large refactoring over how signature types are converted in the DialectConversion infrastructure. The signatures of blocks are now converted on-demand when an operation held by that block is being converted. This allows for handling the case where a region is created as part of a pattern, something that wasn't possible previously.
This cl also generalizes the region signature conversion used by FuncOp to work on any region of any operation. This generalization allows for removing the 'apply*Conversion' functions that were specific to FuncOp/ModuleOp. The implementation currently uses a new hook on TypeConverter, 'convertRegionSignature', but this should ideally be removed in favor of using Patterns. That depends on adding support to the PatternRewriter used by ConversionPattern to allow applying signature conversions to regions, which should be coming in a followup.
PiperOrigin-RevId: 258645733
|
|
|
|
|
|
|
|
| |
This explicit tag is useful is several ways:
*) This simplifies how to mark sub sections of a dialect as explicitly unsupported, e.g. my target supports all operations in the foo dialect except for these select few. This is useful for partial lowerings between dialects.
*) Partial conversions will now verify that operations that were explicitly marked as illegal must be converted. This provides some guarantee that the operations that need to be lowered by a specific pass will be.
PiperOrigin-RevId: 258582879
|
|
|
|
|
|
|
|
|
|
|
| |
Multiple (perfectly) nested loops with independent bounds can be combined into
a single loop and than subdivided into blocks of arbitrary size for load
balancing or more efficient parallelism exploitation. However, MLIR wants to
preserve the multi-dimensional multi-loop structure at higher levels of
abstraction. Introduce a transformation that coalesces nested loops with
independent bounds so that they can be further subdivided by tiling.
PiperOrigin-RevId: 258151016
|
|
|
|
|
|
|
| |
These ops should not belong to the std dialect.
This CL extracts them in their own dialect and updates the corresponding conversions and tests.
PiperOrigin-RevId: 258123853
|
|
|
|
|
|
| |
When using a RewritePattern and replacing an operation with an existing value, that value may have already been replaced by something else. This cl ensures that only the final value is used when applying rewrites.
PiperOrigin-RevId: 258058488
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL splits the lowering of affine to LLVM into 2 parts:
1. affine -> std
2. std -> LLVM
The conversions mostly consists of splitting concerns between the affine and non-affine worlds from existing conversions.
Short-circuiting of affine `if` conditions was never tested or exercised and is removed in the process, it can be reintroduced later if needed.
LoopParametricTiling.cpp is updated to reflect the newly added ForOp::build.
PiperOrigin-RevId: 257794436
|
|
|
|
|
|
|
|
|
|
|
| |
Standard load and store operations are evolving to be separated from the Affine
constructs. Special affine.load/store have been introduced to uphold the
restrictions of the Affine control flow constructs on their operands.
EDSC-produced loads and stores were originally intended to uphold those
restrictions as well so they should use affine.load/store instead of
std.load/store.
PiperOrigin-RevId: 257443307
|
|
|
|
|
|
| |
Change the AsmPrinter to number values breadth-first so that values in adjacent regions can have the same name. This allows for ModuleOp to contain operations that produce results. This also standardizes the special name of region entry arguments to "arg[0-9+]" now that Functions are also operations.
PiperOrigin-RevId: 257225069
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Parametric tiling can be used to extract outer loops with fixed number of
iterations. This in turn enables mapping to GPU kernels on a fixed grid
independently of the range of the original loops, which may be unknown
statically, making the kernel adaptable to different sizes. Provide a utility
function that also computes the parametric tile size given the range of the
loop. Exercise the utility function through a simple pass that applies it to
all top-level loop nests. Permutability or parallelism checks must be
performed before calling this utility function in actual passes.
Note that parametric tiling cannot be implemented in a purely affine way,
although it can be encoded using semi-affine maps. The choice to implement it
on standard loops is guided by them being the common representation between
Affine loops, Linalg and GPU kernels.
PiperOrigin-RevId: 257180251
|
|
|
|
|
|
| |
This is an important step in allowing for the top-level of the IR to be extensible. FuncOp and ModuleOp contain all of the necessary functionality, while using the existing operation infrastructure. As an interim step, many of the usages of Function and Module, including the name, will remain the same. In the future, many of these will be relaxed to allow for many different types of top-level operations to co-exist.
PiperOrigin-RevId: 256427100
|