Jump to content

User:Mlokhan/sandbox

From Wikipedia, the free encyclopedia

Loop Parallelization[edit]

Loop parallelization is the process of distribution of loop iterations across multiple processing elements (or SMT cores) to achieve faster execution time. This can be done manually (by specifying which iteration can execute in parallel by compiler directives) or automatically by algorithms[1] which detect parallelism within loop bodies. Programs in scientific applications generally consist of computational intensive loops which perform a repeated set of operations on the data . Detecting and extracting parallelism from these loops can help achieve significant speedup. Apart from parallelization , the maximum achievable speedup is also bounded by other factors , mainly Amdahl's Law , granularity, data distribution and latency of the memory system.[2]

Dependencies[edit]

Parallelization transformations are used to change execution of programs from one thread to multiple threads without affecting the determinism.[3] . This will occur only if all the dependencies in the single threaded variant are respected.

Consider 2 statements P and Q within a loop where the variable i stores the iteration value.

for i 1:N
     P(i);
     Q(i);
end

where P(i) = ith instance of the statement.
Then instance Q(j) depends on P(i) if there exists a memory location M such that -

  1. both P(i) and Q(j) utilize memory location M
  2. P(i) executes before Q(j) in sequential execution.
  3. No other instance overwrites the value in between P(i) and Q(j)

Now the dependence is called loop carried dependence(inter-iteration) if i!=j and loop independent dependence(intra-iteration) if i=j.[4]

Bernsteins conditions can be used to ascertain if program segments are independent and can be executed in parallel.[5]

Types of dependencies[edit]

Depending on the sequence in which the memory is accessed by different instances, the dependence can be classified as -

  1. Flow Dependence (also known as True Dependence)
  2. Anti Dependence
  3. Output Dependence
  4. Input Dependence.[6]

Types of parallelization techniques[edit]

Depending upon methods used to preserve original dependencies , different communication patterns exist between threads. Based on this the techniques can be broadly classified into -

  1. Independent Multi-threading (IMT) -No communication between the threads. (E.g. - DOALL. )
  2. Cyclic Multi-threading (CMT) - communication exists between threads, within the loop body. Graph produced by threads as nodes and communication as direction is cyclic(E.g. - DOACROSS and TLS)
  3. Pipelined Multi-threading (PMT) - communication exists between threads , within the loop body . Graph produced by threads as nodes and communication as direction is acyclic . (E.g. - DOPIPE)[3]

Loop Parallelization can be achieved by implementing the following techniques :

  1. Parallelization between iterations
  2. Loop distribution
  3. DOALL parallelization
  4. DOACROSS parallelization
  5. DOPIPE parallelization

Parallelization between iterations[edit]

Parallelism can be identified by analyzing which loop iterations can be run in parallel. We have to analyze the loop level dependencies for that. The most critical are the true dependencies, as anti and output dependencies can be eliminated using privatization. Consider the following code:

for(i=4; i<=n; i++)
S1: a[i] = a[i-2];

Here, there is a loop carried dependence S1->T S1[i+2]. But, the odd and the even iterations are independent of each other. Thus it is possible for us to execute them in separate loops. The execution can be showed as follows:

for(i=4; i<=n; i+=2)     //This loop runs on processor 1
S1:a[i] = a[i-2]
for(i=5; i<=n; i+=2)     //This loop runs on processor 2
S1:a[i] = a[i-2]

Here the even and odd loops can run simultaneously. However, it should be noted that the even and odd loops themselves are executed sequentially. If the the execution time for instruction S1 is Ts1, then the execution time for N instructions will be N*Ts1. But with parallelizing as shown above, the execution time reduces to N/2*Ts1.
[7]

Loop distribution[edit]

There may be some statements in the loop which may be independent of other statements. Such statements can be executed in parallel. Consider the following code:

for(i=0; i <=n; i++)
{
S1: a[i] = a[i-1] + e[i] +f[i];
S2: c[i] = b[i] + d[i] - b[i]*3;
}

Here statements 1 and 2 are independent of each other (no loop independent and dependent dependence). Hence we can execute them separately.There is a loop carried dependence in S1(S1-> T S1[i+1]). Therefore S1 has to be executed sequentially. Also, all the iterations of statement 2 can be run in parallel as there is no loop carried dependence in S2(DOALL parallelism). Hence the code can be rewritten as follows:

for(i=0; i <=n; i++)     //This loop runs on one processor. Execution time N*Ts1
{
S1: a[i] = a[i-1] + e[i] +f[i];
}
for(i=0; i <=n; i++)    //This loop runs on n processors. Execution time Ts2
{
S2: c[i] = b[i] + d[i] - b[i]*3;
}

If the execution times of S1 and S2 are Ts1 and Ts2 respectively, then the execution time without parallelization for N statements would be N*(Ts1 + Ts2). After parallelization, the execution time is max(N*Ts1,Ts2).
[7]

DOALL parallelization[edit]

In the best case scenario, there might be no loop dependent and independent dependencies. If such is the case, then all the iterations can be executed in parallel. This is called DOALL parallelization. Consider the following code:

for(i=0; i <=n; i++)    //This loop runs on n processors. Execution time N*Ts1
{
S1: c[i] = b[i] + d[i];
}

There are no loop dependent and independent dependencies. Hence all the iterations can run in parallel. If the execution time of S1 is Ts1, then without parallelization the execution time for N statements would be, N*Ts1. After parallelization, the execution time is Ts1.
[7]
For DOALL prallelization, the parallelism increases linearly with the number of processors, provided that the number of iterations is lesser than number of processors. Generally, DOALL parallelization doesn't exist to any significant level in programs. Although, a fair amount of statistical DOALL parallelism does exist in general purpose applications. Statistical DOALL loops do not show any cross iteration dependencies during profiling, even though the compiler cannot prove the independence. These loops can be parallelized using some lightweight hardware support to detect mis-speculation and rollback the execution if needed.[8]

DOACROSS parallelization[edit]

DOACROSS parallelism allows us to extract parallelism for the tasks which have loop carried dependencies. Consider the following code:

for(i=1; i<= N; i++)
S: e[i] = e[i-1] + f[i]/d[i];

Here, the statement S has loop carried dependence S[i] -> T S[i+1]. However, the part f[i]/d[i] can be executed independently. This can be executed as follows:

for(i=1; i<= N; i++)    //this loop has DOALL parallelism
S1: temp[i] = f[i]/d[i];
for(i=1; i<= N; i++)    // this loop executes sequentially
S2:e[i] = e[i-1] + temp[i];

However, this method creates additional storage overhead because of the array temp[i]. Another method of implementing the above partial loop carried dependence is using DOACROSS parallelism. In DOACROSS parallelism, each iteration is still a parallel task, but synchronisation is used in order to ensure that the consumer iteration only reads the data has been produced by the producer iteration. This can be achieved using wait() and post() synchronisation primitives. The code then becomes as follows:

post(0);
for(i=1; i<= N; i++)
{
S1:temp = f[i]/d[i];    
wait(i-1);
}
for(i=1; i<= N; i++)
{
S2: e[i] = e[i-1] + temp; 
post(i);
}

S1 has no loop carried dependencies and S2 has a loop carried dependence. The temporary multiplication variable is now a private scalar rather than a shared array, thus reducing the storage overhead. Post(i) indicates that the value of e[i] has been produced and is ready for consumption. Wait(i-1) indicates that the consumer must wait until a[i-1] has been produced by the producer.

If Ts1 is the execution time of S1 and Ts2 is the execution time of S2, then without DOACROSS the execution time for N iteration would be N*(Ts1 + Ts2). After parallelization, S1 runs parallel with respect to all other statements in other iterations, whereas S2 is serialized. Hence, the time execution time now is Ts1 + N*Ts2. DOACROSS exploits iteration level parallelization.
[7]

In DOACROSS parallelism it can be seen that data is synchronized , bundled and sent to different cores when it is needed. Hence there is a communication overhead associated with DOACROSS. The communication latency between processes is one of the major factor that determines the effectiveness of implementing DOACROSS parallelism.

DOPIPE parallelization[edit]

Consider the following code:

for(i=1; i<=N; i++)
{
S1: e[i] = e[i-1] + f[i];
S2: g[i] = g[i] -e[i];
}

This code can be parallelized by distributing the loop and introducing pipelined parallelism. Right after e[i] is generated by the statement S1 in first loop, the second loop executes statement S2 that reads the just produced value of e[i]. This type of parallelism is called DOPIPE parallelism and it can be implemented as shown:

for(i=1; i<=N; i++)    //runs on processor 1
{
S1: e[i] = e[i-1] + f[i];
post(i);
}
for(i=1; i<=N; i++)    //runs on processor 2
{
wait(i);
S2: g[i] = g[i] -e[i];
}

If we assume that the execution times of S1(Ts1) and S2(Ts2) are the same (T), then the execution time using DOPIPE becomes N*T for N iterations. Whereas for sequential execution it was N*2T.
[7]

DOPIPE can be better than DOACROSS in applications where regular dependence is exhibited. [9] [10]

See also[edit]

Loop fusion
Loop splitting
Invariant code motion
Interleaving
Cyclic reduction
Automatic parallelization
Parallel computing
Parallel algorithm
Unimodular Loop Transformations[11]
Iteration Traversal Graph

References[edit]

  1. ^ (eds.), Santosh Pande; Dharma P. Agrawal (2001). Compiler Optimizations for Scalable Parallel Systems ([Elektronische Ressource] ed.). Berlin: Springer. ISBN 3-540-41945-4. {{cite book}}: |last1= has generic name (help)CS1 maint: multiple names: authors list (link)
  2. ^ Dean Tullsen, Puppin, Diego, and (2001). "Maximizing TLP with loop-parallelization on SMT". Proc. of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. ^ a b Raman, E. Parallelization techniques with improved dependence handling. Princeton University. {{cite book}}: |access-date= requires |url= (help)
  4. ^ --, Utpal Banerjee. (1997). Dependence analysis. Boston, Mass.: Kluwer Academic Publishers. p. 20. ISBN 9780792398097. {{cite book}}: |last1= has numeric name (help)
  5. ^ Bernstein, A. J. (1966). "Analysis of Programs for Parallel Processing". IEEE Transactions on Electronic Computers. EC-15 (5). {{cite journal}}: |access-date= requires |url= (help)
  6. ^ --, Utpal Banerjee. (1997). Dependence analysis. Boston, Mass.: Kluwer Academic Publishers. p. 21. ISBN 9780792398097. {{cite book}}: |last1= has numeric name (help)
  7. ^ a b c d e Solohin, Yan (November 24, 2015). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. ISBN 9781482211184.
  8. ^ Zhong, Hongtao. "Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications". HPCA07.
  9. ^ Murphy, Niall. "Discovering and exploiting parallelism in DOACROSS loops". Technical Report - University of Cambridge Computer Laboratory. Number 882. {{cite journal}}: |volume= has extra text (help)
  10. ^ Tichy, edited by Victor Pankratius, Ali-Reza Adl-Tabatabai, Walter (2010). Fundamentals of multicore software development. Boca Raton, Fla.: CRC. ISBN 9781439812730. {{cite book}}: |first1= has generic name (help)CS1 maint: multiple names: authors list (link)
  11. ^ Yu, Yijun; D'HOLLANDER, ERIK H. (2001). "Loop parallelization using the 3D iteration space visualizer". Journal of Visual Languages & Computing (12.2 (2001)): 163–181.