pub struct ForallOperation<'c> { /* private fields */ }
Expand description

A forall operation. Evaluate a block multiple times in parallel.

scf.forall is a target-independent multi-dimensional parallel region application operation. It has exactly one block that represents the parallel body and it takes index operands that specify lower bounds, upper bounds and steps.

The op also takes a variadic number of tensor operands (shared_outs). The future buffers corresponding to these tensors are shared among all threads. Shared tensors should be accessed via their corresponding block arguments. If multiple threads write to a shared buffer in a racy fashion, these writes will execute in some unspecified order. Tensors that are not shared can be used inside the body (i.e., the op is not isolated from above); however, if a use of such a tensor bufferizes to a memory write, the tensor is privatized, i.e., a thread-local copy of the tensor is used. This ensures that memory side effects of a thread are not visible to other threads (or in the parent body), apart from explicitly shared tensors.

The name “thread” conveys the fact that the parallel execution is mapped (i.e. distributed) to a set of virtual threads of execution, one function application per thread. Further lowerings are responsible for specifying how this is materialized on concrete hardware resources.

An optional mapping is an attribute array that specifies processing units with their dimension, how it remaps 1-1 to a set of concrete processing element resources (e.g. a CUDA grid dimension or a level of concrete nested async parallelism). It is expressed via any attribute that implements the device mapping interface. It is the reponsibility of the lowering mechanism to interpret the mapping attributes in the context of the concrete target the op is lowered to, or to ignore it when the specification is ill-formed or unsupported for a particular target.

The only allowed terminator is scf.forall.in_parallel. scf.forall returns one value per shared_out operand. The actions of the in_parallel terminators specify how to combine the partial results of all parallel invocations into a full value, in some unspecified order. The “destination” of each such op must be a shared_out block argument of the scf.forall op.

The actions involved in constructing the return values are further described by tensor.parallel_insert_slice.

scf.forall acts as an implicit synchronization point.

When the parallel function body has side effects, their order is unspecified across threads.

scf.forall can be printed in two different ways depending on whether the loop is normalized or not. The loop is ‘normalized’ when all lower bounds are equal to zero and steps are equal to one. In that case, lowerBound and step operands will be omitted during printing.

Normalized loop example:

//
// Sequential context.
//
%matmul_and_pointwise:2 = scf.forall (%thread_id_1, %thread_id_2) in
    (%num_threads_1, %numthread_id_2) shared_outs(%o1 = %C, %o2 = %pointwise)
  -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context, each thread with id = (%thread_id_1, %thread_id_2)
  // runs its version of the code.
  //
  %sA = tensor.extract_slice %A[f((%thread_id_1, %thread_id_2))]:
    tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[g((%thread_id_1, %thread_id_2))]:
    tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %o1[h((%thread_id_1, %thread_id_2))]:
    tensor<?x?xT> to tensor<?x?xT>
  %sD = matmul ins(%sA, %sB) outs(%sC)

  %spointwise = subtensor %o2[i((%thread_id_1, %thread_id_2))]:
    tensor<?xT> to tensor<?xT>
  %sE = add ins(%spointwise) outs(%sD)

  scf.forall.in_parallel {
    scf.forall.parallel_insert_slice %sD into %o1[h((%thread_id_1, %thread_id_2))]:
      tensor<?x?xT> into tensor<?x?xT>

    scf.forall.parallel_insert_slice %spointwise into %o2[i((%thread_id_1, %thread_id_2))]:
      tensor<?xT> into tensor<?xT>
  }
}
// Implicit synchronization point.
// Sequential context.
//

Loop with loop bounds example:

//
// Sequential context.
//
%pointwise = scf.forall (%i, %j) = (0, 0) to (%dim1, %dim2)
  step (%tileSize1, %tileSize2) shared_outs(%o1 = %out)
  -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context.
  //
  %sA = tensor.extract_slice %A[%i, %j][%tileSize1, %tileSize2][1, 1]
    : tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[%i, %j][%tileSize1, %tileSize2][1, 1]
    : tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %o[%i, %j][%tileSize1, %tileSize2][1, 1]
    : tensor<?x?xT> to tensor<?x?xT>

  %add = map {"arith.addf"} ins(%sA, %sB) outs(%sC)

  scf.forall.in_parallel {
    scf.forall.parallel_insert_slice %add into
      %o[%i, %j][%tileSize1, %tileSize2][1, 1]
      : tensor<?x?xT> into tensor<?x?xT>
  }
}
// Implicit synchronization point.
// Sequential context.
//

Example with mapping attribute:

//
// Sequential context. Here `mapping` is expressed as GPU thread mapping
// attributes
//
%matmul_and_pointwise:2 = scf.forall (%thread_id_1, %thread_id_2) in
    (%num_threads_1, %numthread_id_2) shared_outs(...)
  -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context, each thread with id = **(%thread_id_2, %thread_id_1)**
  // runs its version of the code.
  //
   scf.forall.in_parallel {
     ...
  }
} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
// Implicit synchronization point.
// Sequential context.
//

Example with privatized tensors:

%t0 = ...
%t1 = ...
%r = scf.forall ... shared_outs(%o = t0) -> tensor<?xf32> {
  // %t0 and %t1 are privatized. %t0 is definitely copied for each thread
  // because the scf.forall op's %t0 use bufferizes to a memory
  // write. In the absence of other conflicts, %t1 is copied only if there
  // are uses of %t1 in the body that bufferize to a memory read and to a
  // memory write.
  "some_use"(%t0)
  "some_use"(%t1)
}

Implementations§

source§

impl<'c> ForallOperation<'c>

source

pub fn name() -> &'static str

Returns a name.

source

pub fn as_operation(&self) -> &Operation<'c>

Returns a generic operation.

source

pub fn builder( context: &'c Context, location: Location<'c> ) -> ForallOperationBuilder<'c, Unset, Unset, Unset, Unset, Unset, Unset, Unset, Unset, Unset>

Creates a builder.

source

pub fn results(&self) -> impl Iterator<Item = OperationResult<'c, '_>>

source

pub fn dynamic_lower_bound( &self ) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

source

pub fn dynamic_upper_bound( &self ) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

source

pub fn dynamic_step(&self) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

source

pub fn outputs(&self) -> Result<impl Iterator<Item = Value<'c, '_>>, Error>

source

pub fn region(&self) -> Result<RegionRef<'c, '_>, Error>

source

pub fn static_lower_bound(&self) -> Result<Attribute<'c>, Error>

source

pub fn set_static_lower_bound(&mut self, value: Attribute<'c>)

source

pub fn static_upper_bound(&self) -> Result<Attribute<'c>, Error>

source

pub fn set_static_upper_bound(&mut self, value: Attribute<'c>)

source

pub fn static_step(&self) -> Result<Attribute<'c>, Error>

source

pub fn set_static_step(&mut self, value: Attribute<'c>)

source

pub fn mapping(&self) -> Result<ArrayAttribute<'c>, Error>

source

pub fn set_mapping(&mut self, value: ArrayAttribute<'c>)

source

pub fn remove_mapping(&mut self) -> Result<(), Error>

Trait Implementations§

source§

impl<'c> From<ForallOperation<'c>> for Operation<'c>

source§

fn from(operation: ForallOperation<'c>) -> Self

Converts to this type from the input type.
source§

impl<'c> TryFrom<Operation<'c>> for ForallOperation<'c>

§

type Error = Error

The type returned in the event of a conversion error.
source§

fn try_from(operation: Operation<'c>) -> Result<Self, Self::Error>

Performs the conversion.

Auto Trait Implementations§

§

impl<'c> RefUnwindSafe for ForallOperation<'c>

§

impl<'c> !Send for ForallOperation<'c>

§

impl<'c> !Sync for ForallOperation<'c>

§

impl<'c> Unpin for ForallOperation<'c>

§

impl<'c> UnwindSafe for ForallOperation<'c>

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.