Struct melior::dialect::ods::vector::WarpExecuteOnLane0Operation

source ·

pub struct WarpExecuteOnLane0Operation<'c> { /* private fields */ }

Expand description

A warp_execute_on_lane_0 operation. Executes operations in the associated region on thread #0 of aSPMD program.

warp_execute_on_lane_0 is an operation used to bridge the gap between vector programming and SPMD programming model like GPU SIMT. It allows to trivially convert a region of vector code meant to run on a multiple threads into a valid SPMD region and then allows incremental transformation to distribute vector operations on the threads.

Any code present in the region would only be executed on first thread/lane based on the laneid operand. The laneid operand is an integer ID between [0, warp_size). The warp_size attribute indicates the number of lanes in a warp.

Operands are vector values distributed on all lanes that may be used by the single lane execution. The matching region argument is a vector of all the values of those lanes available to the single active lane. The distributed dimension is implicit based on the shape of the operand and argument. the properties of the distribution may be described by extra attributes (e.g. affine map).

Return values are distributed on all lanes using laneId as index. The vector is distributed based on the shape ratio between the vector type of the yield and the result type. If the shapes are the same this means the value is broadcasted to all lanes. In the future the distribution can be made more explicit using affine_maps and will support having multiple Ids.

Therefore the warp_execute_on_lane_0 operations allow to implicitly copy between lane0 and the lanes of the warp. When distributing a vector from lane0 to all the lanes, the data are distributed in a block cyclic way. For exemple vector<64xf32> gets distributed on 32 threads and map to vector<2xf32> where thread 0 contains vector[0] and vector[1].

During lowering values passed as operands and return value need to be visible to different lanes within the warp. This would usually be done by going through memory.

The region is not isolated from above. For values coming from the parent region not going through operands only the lane 0 value will be accesible so it generally only make sense for uniform values.

Example:

// Execute in parallel on all threads/lanes.
vector.warp_execute_on_lane_0 (%laneid)[32] {
  // Serial code running only on thread/lane 0.
  ...
}
// Execute in parallel on all threads/lanes.

This may be lowered to an scf.if region as below:

  // Execute in parallel on all threads/lanes.
  %cnd = arith.cmpi eq, %laneid, %c0 : index
  scf.if %cnd {
    // Serial code running only on thread/lane 0.
    ...
  }
  // Execute in parallel on all threads/lanes.

When the region has operands and/or return values:

// Execute in parallel on all threads/lanes.
%0 = vector.warp_execute_on_lane_0(%laneid)[32]
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
^bb0(%arg0 : vector<128xi32>) :
  // Serial code running only on thread/lane 0.
  ...
  vector.yield %1 : vector<32xf32>
}
// Execute in parallel on all threads/lanes.

values at the region boundary would go through memory:

// Execute in parallel on all threads/lanes.
...
// Store the data from each thread into memory and Synchronization.
%tmp0 = memreg.alloc() : memref<128xf32>
%tmp1 = memreg.alloc() : memref<32xf32>
%cnd = arith.cmpi eq, %laneid, %c0 : index
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
some_synchronization_primitive
scf.if %cnd {
  // Serialized code running only on thread 0.
  // Load the data from all the threads into a register from thread 0. This
  // allow threads 0 to access data from all the threads.
  %arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
  ...
  // Store the data from thread 0 into memory.
  vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
}
// Synchronization and load the data in a block cyclic way so that the
// vector is distributed on all threads.
some_synchronization_primitive
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
// Execute in parallel on all threads/lanes.

Struct melior::dialect::ods::vector::WarpExecuteOnLane0Operation

Implementations§

impl<'c> WarpExecuteOnLane0Operation<'c>

pub fn name() -> &'static str

pub fn as_operation(&self) -> &Operation<'c>

pub fn builder( context: &'c Context, location: Location<'c> ) -> WarpExecuteOnLane0OperationBuilder<'c, Unset, Unset, Unset, Unset, Unset>

pub fn results(&self) -> impl Iterator<Item = OperationResult<'c, '_>>

pub fn laneid(&self) -> Result<Value<'c, '_>, Error>

pub fn args(&self) -> impl Iterator<Item = Value<'c, '_>>

pub fn warp_region(&self) -> Result<RegionRef<'c, '_>, Error>

pub fn warp_size(&self) -> Result<IntegerAttribute<'c>, Error>

pub fn set_warp_size(&mut self, value: IntegerAttribute<'c>)

Trait Implementations§

impl<'c> From<WarpExecuteOnLane0Operation<'c>> for Operation<'c>

fn from(operation: WarpExecuteOnLane0Operation<'c>) -> Self

impl<'c> TryFrom<Operation<'c>> for WarpExecuteOnLane0Operation<'c>

type Error = Error

fn try_from(operation: Operation<'c>) -> Result<Self, Self::Error>

Auto Trait Implementations§

impl<'c> RefUnwindSafe for WarpExecuteOnLane0Operation<'c>

impl<'c> !Send for WarpExecuteOnLane0Operation<'c>

impl<'c> !Sync for WarpExecuteOnLane0Operation<'c>

impl<'c> Unpin for WarpExecuteOnLane0Operation<'c>

impl<'c> UnwindSafe for WarpExecuteOnLane0Operation<'c>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,