Intel Cilk Plus

From Bauman National Library
This page was last modified on 1 June 2016, at 17:15.
Intel Cilk Plus
Paradigm imperative, structured, parallel programming
Designed by CS Laboratory at MIT
First appeared 1994
Typing discipline static
License open source code for the original Cilk Cilk and proprietary to Intel Cilk Plus
Website http://cilk.com/
Dialects
Cilk, Cilk++
Influenced by
С

Cilk, Cilk++ and Cilk Plus are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages. Originally developed in the 1990s at the Massachusetts Institute of Technology (MIT) in the group of Charles E. Leiserson, Cilk was later commercialized as Cilk++ by a spinoff company, Cilk Arts. That company was subsequently acquired by Intel, which increased compatibility with existing C and C++ code, calling the result Cilk Plus.

History

The Cilk programming language grew out of three separate projects at the MIT Laboratory for Computer Science:

  • Theoretical work on scheduling multi-threaded applications.
  • StarTech – a parallel chess program built to run on the Thinking Machines Corporation's Connection Machine model CM-5.
  • PCM/Threaded-C – a C-based package for scheduling continuation-passing-style threads on the CM-5

In April 1994 the three projects were combined and christened "Cilk". The name Cilk is not an acronym, but an allusion to "nice threads" (silk) and the C programming language. The Cilk-1 compiler was released in September 1994.

Features

  • Keywords: language additions to C/C++ to specify task parallelism in an application.
  • Reducers: a mechanism to eliminate contention for shared variables in a lock-free manner.
  • #pragma simd: a directive to the compiler that a loop contains data parallelism and should be vectorized.
  • Array Notations: Language additions to C/C++ to specify data parallelism for arrays or sections of arrays.
  • Elemental Functions: an annotation specifying that a function can be called with array or array sections as arguments, as well as scalar values.

The principle behind the design of the Cilk language is that the programmer should be responsible for exposing the parallelism, identifying elements that can safely be executed in parallel; it should then be left to the run-time environment, particularly the scheduler, to decide during execution how to actually divide the work between processors. It is because these responsibilities are separated that a Cilk program can run without rewriting on any number of processors, including one.

Task parallelism: spawn and sync

Cilk's main addition to C are two keywords that together allow writing task-parallel programs.

  • The spawn keyword, when preceding a function call (spawn f(x)), indicates that the function call (f(x)) can safely run in parallel with the statements following it in the calling function. Note that the scheduler is not obligated to run this procedure in parallel; the keyword merely alerts the scheduler that it can do so.
  • A sync statement indicates that execution of the current function cannot proceed until all previously spawned function calls have completed. This is an example of a barrier method.

(In Cilk Plus, the keywords are spelled _Cilk_spawn and _Cilk_sync, or cilk_spawn and cilk_sync if the Cilk Plus headers are included.)

Below is a recursive implementation of the Fibonacci function in Cilk, with parallel recursive calls, which demonstrates the cilk, spawn, and sync keywords. (Cilk program code is not numbered; the numbers have been added only to make the discussion easier to follow.)

 1 cilk int fib(int n)
 2 {
 3     if (n < 2) {
 4         return n;
 5     }
 6     else {
 7        int x, y;
 8  
 9        x = spawn fib(n - 1);
10        y = spawn fib(n - 2);
11  
12        sync;
13  
14        return x + y;
15     }
16 }

If this code was executed by a single processor to determine the value of fib(2), that processor would create a frame for fib(2), and execute lines 01 through 06. On line 07, it would create spaces in the frame to hold the values of x and y. On line 09, the processor would have to suspend the current frame, create a new frame to execute the procedure fib(1), execute the code of that frame until reaching a return statement, and then resume the fib(2) frame with the value of fib(1) placed into fib(2)'s x variable. On the next line, it would need to suspend again to execute fib(0) and place the result in fib(2)'s y variable.

When the code is executed on a multiprocessor machine, however, execution proceeds differently. One processor starts the execution of fib(2); when it reaches line 09, however, the spawn keyword modifying the call to fib(n-1) tells the processor that it can safely give the job to a second processor: this second processor can create a frame for fib(1), execute its code, and store its result in fib(2)'s frame when it finishes; the first processor continues executing the code of fib(2) at the same time. A processor is not obligated to assign a spawned procedure elsewhere; if the machine only has two processors and the second is still busy on fib(1) when the processor executing fib(2) gets to the procedure call, the first processor will suspend fib(2) and execute fib(0) itself, as it would if it were the only processor. Of course, if another processor is available, then it will be called into service, and all three processors would be executing separate frames simultaneously.

(The preceding description is not entirely accurate. Even though the common terminology for discussing Cilk refers to processors making the decision to spawn off work to other processors, it is actually the scheduler which assigns procedures to processors for execution, using a policy called work-stealing, described later.)

If the processor executing fib(2) were to execute line 14 before both of the other processors had completed their frames, it would generate an incorrect result or an error; fib(2) would be trying to add the values stored in x and y, but one or both of those values would be missing. This is the purpose of the sync keyword, which we see in line 12: it tells the processor executing a frame that it must suspend its own execution until all the procedure calls it has spawned off have returned. When fib(2) is allowed to proceed past the sync statement in line 12, it can only be because fib(1) and fib(0) have completed and placed their results in x and y, making it safe to perform calculations on those results.

Work-stealing

The Cilk scheduler uses a policy called "work-stealing" to divide procedure execution efficiently among multiple processors. Again, it is easiest to understand if we look first at how Cilk code is executed on a single-processor machine.

The processor maintains a stack on which it places each frame that it has to suspend in order to handle a procedure call. If it is executing fib(2), and encounters a recursive call to fib(1), it will save fib(2)'s state, including its variables and where the code suspended execution, and put that state on the stack. It will not take a suspended state off the stack and resume execution until the procedure call that caused the suspension, and any procedures called in turn by that procedure, have all been fully executed.

With multiple processors, things of course change. Each processor still has a stack for storing frames whose execution has been suspended; however, these stacks are more like deques, in that suspended states can be removed from either end. A processor can still only remove states from its own stack from the same end that it puts them on; however, any processor which is not currently working (having finished its own work, or not yet having been assigned any) will pick another processor at random, through the scheduler, and try to "steal" work from the opposite end of their stack—suspended states, which the stealing processor can then begin to execute. The states which get stolen are the states that the processor stolen from would get around to executing last.

Inlets

The two remaining Cilk keywords are slightly more advanced, and concern the use of inlets. Ordinarily, when a Cilk procedure is spawned, it can return its results to the parent procedure only by putting those results in a variable in the parent's frame, as we assigned the results of our spawned procedure calls in the example to x and y.

The alternative is to use an inlet. An inlet is a function internal to a Cilk procedure which handles the results of a spawned procedure call as they return. One major reason to use inlets is that all the inlets of a procedure are guaranteed to operate atomically with regards to each other and to the parent procedure, thus avoiding the bugs that could occur if the multiple returning procedures tried to update the same variables in the parent frame at the same time.

  • The inlet keyword identifies a function defined within the procedure as an inlet.
  • The abort keyword can only be used inside an inlet; it tells the scheduler that any other procedures that have been spawned off by the parent procedure can safely be aborted.

Inlets were removed when Cilk became Cilk++, and are not present in Cilk Plus.

Reducers and hyperobjects

Cilk++ added a kind of objects called hyperobjects, that allow multiple strands to share state without race conditions and without using explicit locks. Each strand has a view on the hyperobject that it can use and update; when the strands synchronize, the views are combined in a way specified by the programmer.

The most common type of hyperobject is a reducer, which corresponds to the reduction clause in OpenMP or to the algebraic notion of a monoid. Each reducer has an identity element and an associative operation that combines two values. The archetypal reducer is summation of numbers: the identity element is zero, and the associative reduce operation computes a sum. This reducer is built into Cilk++ and Cilk Plus:

// Compute ∑ foo(i) for i from 0 to N, in parallel.
cilk::reducer_opadd<float> result(0);
cilk_for (int i = 0; i < N; i++)
    result += foo(i);

Other reducers can be used to construct linked lists or strings, and programmers can define custom reducers.

A limitation of hyperobjects is that they provide only limited determinacy. Burckhardt et al. point out that even the sum reducer can result in non-deterministic behavior, showing a program that may produce either 1 or 2 depending on the scheduling order:

void add1(cilk::reducer_opadd<int> &amp;r) { r++; }
// ...
cilk::reducer_opadd<int> r(0);
cilk_spawn add1(r);
if (r == 0) { r++; }
cilk_sync;
output(r.get_value());

Parallel loops

Cilk++ added an additional construct, the parallel loop, denoted cilk_for in Cilk Plus. These loops look like

void loop(int *a, int n)
{
    #pragma cilk grainsize = 100  /* optional */
    cilk_for (int i = 0; i < n; i++) {
        a[i] = f(a[i]);
    }
}

This implements the parallel map idiom: the body of the loop, here a call to f followed by an assignment to the array a, is executed for each value of i from zero to n in an indeterminate order. The optional "grain size" pragma determines the coarsening: no work is spawned when less than a hundred elements need to be processed. The implementation is a divide-and-conquer recursion, as if the programmer had written

static void recursion(int *a, int start, int end)
{
    if (end - start <= 100) {  // The 100 here is the grainsize.
        for (int i = start; i < end; i++) {
            a[i] = f(a[i]);
        }
    }
    else {
        int midpoint = start + (end - start) / 2;
        cilk_spawn recursion(a, start, midpoint);
        recursion(a, midpoint, end);
        cilk_sync;
    }
}

void loop(int *a, int n)
{
    recursion(a, 0, n);
}

Array notation

Intel Cilk Plus adds notation that allow users to express high-level operations on entire arrays or sections of arrays; e.g., an axpy-style function that is ordinarily written

 // y ← α x + y
 void axpy(int n, float alpha, const float *x, float *y)
 {
     for (int i = 0; i < n; i++) {
         y[i] += alpha * x[i];
     }
 }

can in Cilk Plus be expressed as

y[0:n] += alpha * x[0:n];

This notation helps the compiler to effectively vectorize the application. Intel Cilk Plus allows C/C++ operations to be applied to multiple array elements in parallel, and also provides a set of built-in functions that can be used to perform vectorized shifts, rotates, and reductions. Similar functionality exists in Fortran 90; Cilk Plus differs in that it never allocates temporary arrays, so memory usage is easier to predict.

Elemental functions

In Cilk Plus, an elemental function is a regular function which can be invoked either on scalar arguments or on array elements in parallel. They are similar to the kernel functions of OpenCL.

#pragma simdt

This pragma gives the compiler permission to vectorize a loop even in cases where auto-vectorization might fail. It is the simplest way to manually apply vectorization.

References

  1. Official website for Cilk Plus
  2. Cilk Project website at MIT
  3. Intel Cilk Plus Developer Zone

The article is written by Fadeev P.V.