pub struct WarpExecuteOnLane0Operation<'c> { /* private fields */ }
Expand description
A warp_execute_on_lane_0
operation. Executes operations in the associated region on thread #0 of aSPMD program.
warp_execute_on_lane_0
is an operation used to bridge the gap between
vector programming and SPMD programming model like GPU SIMT. It allows to
trivially convert a region of vector code meant to run on a multiple threads
into a valid SPMD region and then allows incremental transformation to
distribute vector operations on the threads.
Any code present in the region would only be executed on first thread/lane
based on the laneid
operand. The laneid
operand is an integer ID between
[0, warp_size
). The warp_size
attribute indicates the number of lanes in
a warp.
Operands are vector values distributed on all lanes that may be used by the single lane execution. The matching region argument is a vector of all the values of those lanes available to the single active lane. The distributed dimension is implicit based on the shape of the operand and argument. the properties of the distribution may be described by extra attributes (e.g. affine map).
Return values are distributed on all lanes using laneId as index. The vector is distributed based on the shape ratio between the vector type of the yield and the result type. If the shapes are the same this means the value is broadcasted to all lanes. In the future the distribution can be made more explicit using affine_maps and will support having multiple Ids.
Therefore the warp_execute_on_lane_0
operations allow to implicitly copy
between lane0 and the lanes of the warp. When distributing a vector
from lane0 to all the lanes, the data are distributed in a block cyclic way.
For exemple vector<64xf32>
gets distributed on 32 threads and map to
vector<2xf32>
where thread 0 contains vector[0] and vector[1].
During lowering values passed as operands and return value need to be visible to different lanes within the warp. This would usually be done by going through memory.
The region is not isolated from above. For values coming from the parent region not going through operands only the lane 0 value will be accesible so it generally only make sense for uniform values.
Example:
// Execute in parallel on all threads/lanes.
vector.warp_execute_on_lane_0 (%laneid)[32] {
// Serial code running only on thread/lane 0.
...
}
// Execute in parallel on all threads/lanes.
This may be lowered to an scf.if region as below:
// Execute in parallel on all threads/lanes.
%cnd = arith.cmpi eq, %laneid, %c0 : index
scf.if %cnd {
// Serial code running only on thread/lane 0.
...
}
// Execute in parallel on all threads/lanes.
When the region has operands and/or return values:
// Execute in parallel on all threads/lanes.
%0 = vector.warp_execute_on_lane_0(%laneid)[32]
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
^bb0(%arg0 : vector<128xi32>) :
// Serial code running only on thread/lane 0.
...
vector.yield %1 : vector<32xf32>
}
// Execute in parallel on all threads/lanes.
values at the region boundary would go through memory:
// Execute in parallel on all threads/lanes.
...
// Store the data from each thread into memory and Synchronization.
%tmp0 = memreg.alloc() : memref<128xf32>
%tmp1 = memreg.alloc() : memref<32xf32>
%cnd = arith.cmpi eq, %laneid, %c0 : index
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
some_synchronization_primitive
scf.if %cnd {
// Serialized code running only on thread 0.
// Load the data from all the threads into a register from thread 0. This
// allow threads 0 to access data from all the threads.
%arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
...
// Store the data from thread 0 into memory.
vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
}
// Synchronization and load the data in a block cyclic way so that the
// vector is distributed on all threads.
some_synchronization_primitive
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
// Execute in parallel on all threads/lanes.
Implementations§
source§impl<'c> WarpExecuteOnLane0Operation<'c>
impl<'c> WarpExecuteOnLane0Operation<'c>
sourcepub fn as_operation(&self) -> &Operation<'c>
pub fn as_operation(&self) -> &Operation<'c>
Returns a generic operation.
sourcepub fn builder(
context: &'c Context,
location: Location<'c>
) -> WarpExecuteOnLane0OperationBuilder<'c, Unset, Unset, Unset, Unset, Unset>
pub fn builder( context: &'c Context, location: Location<'c> ) -> WarpExecuteOnLane0OperationBuilder<'c, Unset, Unset, Unset, Unset, Unset>
Creates a builder.