User:Aishwinth/sandbox

This is a user sandbox of Aishwinth. You can use it for testing or practicing edits.
This is not the sandbox where you should draft your assigned article for a dashboard.wikiedu.org course.
To find the right sandbox for your assignment, visit your Dashboard course page and follow the Sandbox Draft link for your assigned article in the My Articles section.

Get Help

This is the user sandbox of Aishwinth. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

Dependencies in code[edit]

Transforming a sequential program into a parallel version requires that dependencies within the code are preserved. There are several types of dependencies that can be found:^[1]^[2]

Type	Notation	Description
True (Flow) Dependence	`S1 ->T S2`	A true dependence between S1 and S2 means that S1 writes to a location later read from by S2
Anti Dependence	`S1 ->A S2`	An anti-dependence between S1 and S2 means that S1 reads from a location later written to by S2.
Output Dependence	`S1 ->O S2`	An output dependence between S1 and S2 means that S1 and S2 write to the same location.
Input Dependence	`S1 ->I S2`	An input dependence between S1 and S2 means that S1 and S2 read from the same location.

Anti-Dependence and Output Dependence can be dealt with by giving each process or thread its own copy of variables (known as privatization). However, true dependence must be preserved, requiring process synchronization.^[1]

Example of True Dependence[edit]

S1: int a, b;
S2: a = 2;
S3: b = a + 40;

S2 ->T S3, meaning that S2 has a true dependence on S3 because S2 writes to the variable a, which S3 reads from.

Example of Anti-dependence[edit]

S1: int a, b = 40;
S2: a = b - 38;
S3: b = -1;

S2 ->A S3, meaning that S2 has an anti-dependence on S3 because S2 reads from the variable b before S3 writes to it.

Example of Output-dependence[edit]

S1: int a, b = 40;
S2: a = b - 38;
S3: a = 2;

S2 ->O S3, meaning that S2 has an output dependence on S3 because both write to the variable a.

Example of Input-dependence[edit]

S1: int a, b, c = 2;
S2: a = c - 1;
S3: b = c + 1;

S2 ->I S3, meaning that S2 has an input dependence on S3 because S2 and S3 both read from variable c.

Dependence in loops[edit]

Loop-carried vs loop-independent dependence[edit]

Loops can have two types of dependence:

Loop-carried dependence
Loop-independent dependence

In loop-independent dependence, loops have inter-iteration dependence but do not have dependence between iterations. Each iteration may be treated as a block and performed in parallel without other synchronization efforts.

In the following example code used for swapping the values of two array of length n, there is a loop-independent dependence of S1[i] ->T S3[i].

for (int i = 1; i < n; i ++) {
    S1: tmp = a[i];
    S2: a[i] = b[i];
    S3: b[i] = tmp;
}

In loop-carried dependence, statements in an iteration of a loop depend on statements in another iteration of the loop. Loop-Carried Dependence uses a modified version of the dependence notation seen earlier. The following denotes a statement carries a true dependence to a later iteration, meaning that one iteration of the loop writes to a location read by a subsequent iteration of the loop.

Example of loop-carried dependence where S1[i] ->T S1[i + 1].

for (int i = 1; i < n; i ++) {
    S1: a[i] = a[i - 1] + 1;
}

Loop carried dependence graph[edit]

A Loop-carried dependence graph graphically shows the loop-carried dependencies between iterations. Each iteration is listed as a node on the graph, and directed edges show the true, anti, and output dependencies between each iteration.

Types of Loop-level parallelism[edit]

Loops are one of the most common sources of parallelism found within code, and can be parallelized in a variety of ways, including:

DOALL Parallelism
DOACROSS Parallelism
DOPIPE Parallelism
DISTRIBUTED Loop
HELIX ^[3]

Each implementation varies slightly in how threads work together to manage data. In addition, parallel tasks must somehow be mapped to a processor. These tasks can either be allocated statically or dynamically. Research has shown that load-balancing can be better achieved through some dynamic allocation algorithms than when done dynamically.^[4]

DOALL parallelism[edit]

DOALL parallelism exists when statements within a loop can be executed independently (situations where there is no loop-carried dependence).^[1] For example, the following code does not read from the array a, and does not update the arrays b, c. No iterations have a dependence on any other iteration.

for (int i = 0; i < n; i++) {
    S1: a[i] = b[i] + c[i];
}

Let's say the time of one execution of S1 be $Ts1$ then the execution time for sequential form of above code is $n*Ts1$ , Now because DOALL Parallelism exists when all iterations are independent, speedup may be achieved by executing all iterations in parallel which gives us an execution time of $Ts1$ (the time taken for one iteration in sequential execution).

The following example, using a simplified pseudocode, shows how a loop might be parallelized to execute each iteration independently.

begin_parallelism();
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
    end_parallelism();
}
block();

DOACROSS parallelism[edit]

DOACROSS Parallelism exists where iterations of a loop are parallelized by extracting calculations that can be performed independently and running them simultaneously.^[1]^[5]

Synchronization exists to enforce loop-carried dependence.

Consider the following, synchronous loop with dependence S1[i] ->T S1[i + 1].

for (int i = 1; i < n; i++) {
    a[i] = a[i - 1] + b[i] + 1;
}

Each loop iteration performs two actions

Calculate a[i - 1] + b[i] + 1
Assign the value to a[i]

Calculating the value a[i - 1] + b[i] + 1, and then performing the assignment can be decomposed into two lines:

int tmp = b[i] + 1;
a[i] = a[i - 1] + tmp;

The first line, int tmp = b[i] + 1;, has no loop-carried dependence. The loop can then be parallelized by computing the temp value in parallel, and then synchronizing the assignment to a[i].

post(0);
for (int i = 1; i < n; i++) {

    int tmp = b[i] + 1;
    wait(i - 1);

    a[i] = a[i - 1] + tmp;
    post(i);
}

DOPIPE parallelism[edit]

DOPIPE Parallelism implements pipelined parallelism for loop-carried dependence where a loop iteration is distributed over multiple, synchronized loops.^[1] The goal of DOPIPE is to act like an assembly line, where one stage is started as soon as there is sufficient data available for it from the previous stage.^[6]

Consider the following, synchronous code with dependence S1[i] ->T S1[i + 1].

for (int i = 1; i < n; i++) {
    S1: a[i] = a[i - 1] + b[i];
    S2: c[i] = c[i] + a[i];
}

S1 must be executed sequentially, but S2 has no loop-carried dependence. S2 could executed in parallel using DOALL Parallelism after performing all calculations needed by S1 in series. However, the speedup is limited if this is done. A better approach is to parallelize such that the S2 corresponding to each S1 executes when said S1 is finished.

Implementing pipelined parallelism results in the following set of loops, where the second loop may execute for an index as soon as the first loop has finished its corresponding index.

for (int i = 1; i < n; i++) {
    S1: a[i] = a[i - 1] + b[i];
        post(i);
}

for (int i = 1; i < n; i++) {
        wait(i);
    S2: c[i] = c[i] + a[i];
}

DISTRIBUTED loop[edit]

When a loop has a loop-carried dependence, another way to parallelize it is to distribute the loop into several different loops. Statements that are not dependent with each other are separated so that these distributed loops can be executed in parallel. For example, consider the following code.

for (int i = 1; i < n; i ++) {
    S1: a[i] = a[i -1] + b[i];
    S2: c[i] = c[i] + d[i];
}

The loop has a loop carried dependence S1[i] ->T S1[i + 1] but S2 and S1 do not have a loop-carried dependence so we can rewrite the code as follows.

loop1: for (int i = 1; i < n; i ++) {
    S1: a[i] = a[i -1] + b[i];
}
loop2: for (int i = 1; i < n; i ++) {
    S2: c[i] = c[i] + d[i];
}

Note that now loop1 and loop2 can be executed in parallel. Instead of single instruction being performed in parallel on different data as in data level parallelism, here different loops perform different tasks on different data. We call this type of parallelism either function or task parallelism.

References[edit]

^ ^a ^b ^c ^d ^e Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.
^ Goff, Gina. "Practical dependence testing" (PDF). SIGPLAN. Retrieved 13 September 2016.
^ Murphy, Niall. "Discovering and exploiting parallelism in DOACROSS loops" (PDF). University of Cambridge. Retrieved 10 September 2016.
^ Kavi, Krishna. "Parallelization of DOALL and DOACROSS Loops-a Survey". {{cite journal}}: |access-date= requires |url= (help); Cite journal requires |journal= (help); External link in |ref= (help)
^ Unnikrishnan, Priya. A Practical Approach to DOACROSS Parallelization (PDF). pp. 219–231. Retrieved 13 September 2016.
^ "DoPipe: An Effective Approach to Parallelize Simulation" (PDF). Intel. Retrieved 13 September 2016.

Category:Parallel computing Category:Articles with example pseudocode

[Solihin-1] Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.

[2] Goff, Gina. "Practical dependence testing" (PDF). SIGPLAN. Retrieved 13 September 2016.

[Parallelism_in_DOACROSS_Loops-3] Murphy, Niall. "Discovering and exploiting parallelism in DOACROSS loops" (PDF). University of Cambridge. Retrieved 10 September 2016.

[4] Kavi, Krishna. "Parallelization of DOALL and DOACROSS Loops-a Survey". {{cite journal}}: |access-date= requires |url= (help); Cite journal requires |journal= (help); External link in |ref= (help)

[5] Unnikrishnan, Priya. A Practical Approach to DOACROSS Parallelization (PDF). pp. 219–231. Retrieved 13 September 2016.

[6] "DoPipe: An Effective Approach to Parallelize Simulation" (PDF). Intel. Retrieved 13 September 2016.

[1]

[2]

[3]

[4]

[5]

[6]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing