Struct melior::dialect::ods::scf::ForallOperation

source ·

pub struct ForallOperation<'c> { /* private fields */ }

Expand description

A forall operation. Evaluate a block multiple times in parallel.

scf.forall is a target-independent multi-dimensional parallel region application operation. It has exactly one block that represents the parallel body and it takes index operands that specify lower bounds, upper bounds and steps.

The op also takes a variadic number of tensor operands (shared_outs). The future buffers corresponding to these tensors are shared among all threads. Shared tensors should be accessed via their corresponding block arguments. If multiple threads write to a shared buffer in a racy fashion, these writes will execute in some unspecified order. Tensors that are not shared can be used inside the body (i.e., the op is not isolated from above); however, if a use of such a tensor bufferizes to a memory write, the tensor is privatized, i.e., a thread-local copy of the tensor is used. This ensures that memory side effects of a thread are not visible to other threads (or in the parent body), apart from explicitly shared tensors.

The name “thread” conveys the fact that the parallel execution is mapped (i.e. distributed) to a set of virtual threads of execution, one function application per thread. Further lowerings are responsible for specifying how this is materialized on concrete hardware resources.

An optional mapping is an attribute array that specifies processing units with their dimension, how it remaps 1-1 to a set of concrete processing element resources (e.g. a CUDA grid dimension or a level of concrete nested async parallelism). It is expressed via any attribute that implements the device mapping interface. It is the reponsibility of the lowering mechanism to interpret the mapping attributes in the context of the concrete target the op is lowered to, or to ignore it when the specification is ill-formed or unsupported for a particular target.

The only allowed terminator is scf.forall.in_parallel. scf.forall returns one value per shared_out operand. The actions of the in_parallel terminators specify how to combine the partial results of all parallel invocations into a full value, in some unspecified order. The “destination” of each such op must be a shared_out block argument of the scf.forall op.

The actions involved in constructing the return values are further described by tensor.parallel_insert_slice.

scf.forall acts as an implicit synchronization point.

When the parallel function body has side effects, their order is unspecified across threads.

scf.forall can be printed in two different ways depending on whether the loop is normalized or not. The loop is ‘normalized’ when all lower bounds are equal to zero and steps are equal to one. In that case, lowerBound and step operands will be omitted during printing.

Normalized loop example:

//
// Sequential context.
//
%matmul_and_pointwise:2 = scf.forall (%thread_id_1, %thread_id_2) in
    (%num_threads_1, %numthread_id_2) shared_outs(%o1 = %C, %o2 = %pointwise)
  -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context, each thread with id = (%thread_id_1, %thread_id_2)
  // runs its version of the code.
  //
  %sA = tensor.extract_slice %A[f((%thread_id_1, %thread_id_2))]:
    tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[g((%thread_id_1, %thread_id_2))]:
    tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %o1[h((%thread_id_1, %thread_id_2))]:
    tensor<?x?xT> to tensor<?x?xT>
  %sD = matmul ins(%sA, %sB) outs(%sC)

  %spointwise = subtensor %o2[i((%thread_id_1, %thread_id_2))]:
    tensor<?xT> to tensor<?xT>
  %sE = add ins(%spointwise) outs(%sD)

  scf.forall.in_parallel {
    scf.forall.parallel_insert_slice %sD into %o1[h((%thread_id_1, %thread_id_2))]:
      tensor<?x?xT> into tensor<?x?xT>

    scf.forall.parallel_insert_slice %spointwise into %o2[i((%thread_id_1, %thread_id_2))]:
      tensor<?xT> into tensor<?xT>
  }
}
// Implicit synchronization point.
// Sequential context.
//

Loop with loop bounds example:

//
// Sequential context.
//
%pointwise = scf.forall (%i, %j) = (0, 0) to (%dim1, %dim2)
  step (%tileSize1, %tileSize2) shared_outs(%o1 = %out)
  -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context.
  //
  %sA = tensor.extract_slice %A[%i, %j][%tileSize1, %tileSize2][1, 1]
    : tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[%i, %j][%tileSize1, %tileSize2][1, 1]
    : tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %o[%i, %j][%tileSize1, %tileSize2][1, 1]
    : tensor<?x?xT> to tensor<?x?xT>

  %add = map {"arith.addf"} ins(%sA, %sB) outs(%sC)

  scf.forall.in_parallel {
    scf.forall.parallel_insert_slice %add into
      %o[%i, %j][%tileSize1, %tileSize2][1, 1]
      : tensor<?x?xT> into tensor<?x?xT>
  }
}
// Implicit synchronization point.
// Sequential context.
//

Example with mapping attribute:

//
// Sequential context. Here `mapping` is expressed as GPU thread mapping
// attributes
//
%matmul_and_pointwise:2 = scf.forall (%thread_id_1, %thread_id_2) in
    (%num_threads_1, %numthread_id_2) shared_outs(...)
  -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context, each thread with id = **(%thread_id_2, %thread_id_1)**
  // runs its version of the code.
  //
   scf.forall.in_parallel {
     ...
  }
} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
// Implicit synchronization point.
// Sequential context.
//

Example with privatized tensors:

%t0 = ...
%t1 = ...
%r = scf.forall ... shared_outs(%o = t0) -> tensor<?xf32> {
  // %t0 and %t1 are privatized. %t0 is definitely copied for each thread
  // because the scf.forall op's %t0 use bufferizes to a memory
  // write. In the absence of other conflicts, %t1 is copied only if there
  // are uses of %t1 in the body that bufferize to a memory read and to a
  // memory write.
  "some_use"(%t0)
  "some_use"(%t1)
}

Struct melior::dialect::ods::scf::ForallOperation

Implementations§

impl<'c> ForallOperation<'c>

pub fn name() -> &'static str

pub fn as_operation(&self) -> &Operation<'c>

pub fn builder( context: &'c Context, location: Location<'c> ) -> ForallOperationBuilder<'c, Unset, Unset, Unset, Unset, Unset, Unset, Unset, Unset, Unset>

pub fn results(&self) -> impl Iterator<Item = OperationResult<'c, '_>>

pub fn dynamic_lower_bound( &self ) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

pub fn dynamic_upper_bound( &self ) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

pub fn dynamic_step(&self) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

pub fn outputs(&self) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

pub fn region(&self) -> Result<RegionRef<'c, '_>, Error>

pub fn static_lower_bound(&self) -> Result<Attribute<'c>, Error>

pub fn set_static_lower_bound(&mut self, value: Attribute<'c>)

pub fn static_upper_bound(&self) -> Result<Attribute<'c>, Error>

pub fn set_static_upper_bound(&mut self, value: Attribute<'c>)

pub fn static_step(&self) -> Result<Attribute<'c>, Error>

pub fn set_static_step(&mut self, value: Attribute<'c>)

pub fn mapping(&self) -> Result<ArrayAttribute<'c>, Error>

pub fn set_mapping(&mut self, value: ArrayAttribute<'c>)

pub fn remove_mapping(&mut self) -> Result<(), Error>

Trait Implementations§

impl<'c> From<ForallOperation<'c>> for Operation<'c>

fn from(operation: ForallOperation<'c>) -> Self

impl<'c> TryFrom<Operation<'c>> for ForallOperation<'c>

type Error = Error

fn try_from(operation: Operation<'c>) -> Result<Self, Self::Error>

Auto Trait Implementations§

impl<'c> RefUnwindSafe for ForallOperation<'c>

impl<'c> !Send for ForallOperation<'c>

impl<'c> !Sync for ForallOperation<'c>

impl<'c> Unpin for ForallOperation<'c>

impl<'c> UnwindSafe for ForallOperation<'c>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,