User:Ewhorton/sandbox

From Wikipedia, the free encyclopedia

Data parallelism is a form of parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It contrasts to task parallelism as another form of parallelism.

Description[edit]

In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

For instance, consider a 2-processor system (CPUs A and B) in a parallel environment. We wish to do a task on some data d. It is possible to tell CPU A to do that task on one part of d and CPU B on another part simultaneously, thereby reducing the duration of the execution. The data can be assigned using conditional statements as described below. As a specific example, consider adding two matrices. In a data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU B could add all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix addition would take one half the time of performing the same operation in serial using one CPU alone.

Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.

Example[edit]

The program expressed in pseudocode below—which applies some arbitrary operation, foo, on every element in the array d—illustrates data parallelism:[nb 1]

if CPU = "a"
    lower_limit := 1
    upper_limit := round(d.length/2)
else if CPU = "b"
    lower_limit := round(d.length/2) + 1
    upper_limit := d.length

for i from lower_limit to upper_limit by 1
    foo(d[i])

If the above example program is executed on a 2-processor system, the runtime environment may execute it as follows:

  • In an SPMD system, both CPUs will execute the code.
  • In a parallel environment, both will have access to d.
  • A mechanism is presumed to be in place whereby each CPU will create its own copy of lower_limit and upper_limit that is independent of the other.
  • The if clause differentiates between the CPUs. CPU "a" will read true on the if; and CPU "b" will read true on the else if, thus having their own values of lower_limit and upper_limit.
  • Now, both CPUs execute foo(d[i]), but since each CPU has different values of the limits, they operate on different parts of d simultaneously, thereby distributing the task among themselves. Obviously, this will be faster than doing it on a single CPU.

This concept can be generalized to any number of processors. However, when the number of processors increases, it may be helpful to restructure the program in a similar way (where cpuid is an integer between 1 and the number of CPUs, and acts as a unique identifier for every CPU):

for i from cpuid to d.length by number_of_cpus
    foo(d[i])

For example, on a 2-processor system CPU A (cpuid 1) will operate on odd entries and CPU B (cpuid 2) will operate on even entries.

Dependencies in Parallel Code[edit]

Transforming a sequential program into a parallel version requires that dependencies within the code are preserved. There are three types of dependencies that can be found: [1]

Type Notation Description
True Dependence S1 ->T S2 A true dependence between S1 and S2 means that S1 writes to a location later read from by S2
Anti Dependence S1 ->A S2 An anti-dependence between S1 and S2 means that S1 reads from a location later written to by S2.
Output Dependence S1 ->O S2 An output dependence between S1 and S2 means that S1 and S2 write to the same location.

Anti-Dependence and Output Dependence can be dealt with by giving each process or thread its own copy of variables (known as privatization). However, true dependence must be preserved, requiring process synchronization. [1]

Example of True Dependence

S1: int a, b;
S2: a = 2;
S3: b = a + 40;

S2 ->T S3, meaning that S2 has a true dependence on S3 because it writes to the variable a, which S3 reads from.

Example of Anti-Dependence

S1: int a, b = 40;
S2: a = b - 38;
S3: b = -1;

S2 ->A S3, meaning that S2 has an anti-dependence on S3 because it reads from the variable a before S3 writes to it.

Example of Output-Dependence

S1: int a, b = 40;
S2: a = b - 38;
S3: a = 2;

S2 ->O S3, meaning that S2 has an output dependence on S3 because both write to the variable a.

Loop-Carried vs Loop-Independent Dependence[edit]

Loops can have two types of dependence:

  • Loop-Carried Dependence
  • Loop-Independent Dependence

In loop-independent dependence, loops have inter-iteration dependence but do not have dependence between iterations. Each iteration may be treated as a block and performed in parallel without other synchronization efforts.

In the following example code used for swapping the values of two array of length n, there is a loop-independent dependence of S1[i] ->T S3[i].

for (int i = 1; i < n; i ++) {
    S1: tmp = a[i];
    S2: a[i] = b[i];
    S3: b[i] = tmp;
}

In loop-carried dependence, statements in an iteration of a loop depend on statements in another iteration of the loop. Loop-Carried Dependence uses a modified version of the dependence notation seen earlier. The following denotes a statement carries a true dependence to a later iteration, meaning that one iteration of the loop writes to a location read by a subsequent iteration of the loop.

Example of loop-carried dependence where S1[i] ->T S1[i + 1].

for (int i = 1; i < n; i ++) {
    S1: a[i] = a[i - 1] + 1;
}

Loop Carried Dependence Graph[edit]

A Loop-Carried Dependence Graph graphically shows the loop-carried dependencies between iterations. Each iteration is listed as a node on the graph, and directed edges show the true, anti, and output dependencies between each iteration.

Loop-Level Parallelism[edit]

Loops can exhibit four types of parallelism:

  • DOALL Parallelism
  • DOACROSS Parallelism
  • DOPIPE Parallelism
  • DISTRIBUTED Loop

DOALL Parallelism[edit]

DOALL Parallelism exists when statements within a loop can be executed independently (situations where there is no loop-carried dependence). [1] For example, the following code does not read from the array a, and does not update the arrays b, c. No iterations have a dependence on any other iteration.

for (int i = 0; i < n; i++) {
    S1: a[i] = b[i] + c[i];
}

Let's say the time of one execution of S1 be TS1 then the execution time for sequential form of above code is n x TS1, Now because DOALL Parallelism exists when all iterations are independent, speedup may be achieved by executing all iterations in parallel which gives us an execution time of TS1(the time taken for one iteration in sequential execution).

The following example, using a simplified pseudocode, shows how a loop might be parallelized to execute each iteration independently.

begin_parallelism();
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
    end_parallelism();
}
block();

DOACROSS Parallelism[edit]

DOACROSS Parallelism exists where iterations of a loop are parallelized by extracting calculations that can be performed independently and running them simultaneously. [1] Synchronization exists to enforce loop-carried dependence.

Consider the following, synchronous loop with dependence S1[i] ->T S1[i + 1].

for (int i = 1; i < n; i++) {
    a[i] = a[i - 1] + b[i] + 1;
}

Each loop iteration performs two actions

  • Calculate a[i - 1] + b[i] + 1
  • Assign the value to a[i]

Calculating the value a[i - 1] + b[i] + 1, and then performing the assignment can be decomposed into two lines:

int tmp = b[i] + 1;
a[i] = a[i - 1] + tmp;

The first line, int tmp = b[i] + 1;, has no loop-carried dependence. The loop can then be parallelized by computing the temp value in parallel, and then synchronizing the assignment to a[i].

post(0);
for (int i = 1; i < n; i++) {

    int tmp = b[i] + 1;
    wait(i - 1);

    a[i] = a[i - 1] + tmp;
    post(i);
}

DOPIPIE Parallelism[edit]

DOPIPE Parallelism implements pipelined parallelism for loop-carried dependence where a loop iteration is distributed over multiple, synchronized loops. [1]

Consider the following, synchronous code with dependence S1[i] ->T S1[i + 1].

for (int i = 1; i < n; i++) {
    a[i] = a[i - 1] + b[i];
    c[i] = c[i] + a[i];
}

S1 must be executed sequentially, but S2 has no loop-carried dependence. S2 could executed in parallel using DOALL Parallelism after performing all calculations needed by S1 in series. However, the speedup is limited if this is done. A better approach is to parallelize such that the S2 corresponding to each S1 executes when said S1 is finished.

Implementing pipelined parallelism results in the following set of loops, where the second loop may execute for an index as soon as the first loop has finished its corresponding index.

for (int i = 1; i < n; i++) {
    a[i] = a[i - 1] + b[i];
    post(i);
}

for (int i = 1; i < n; i++) {
    wait(i);
    c[i] = c[i] + a[i];
}

DISTRIBUTED Loop[edit]

When a loop has a loop-carried dependence another way to parallelize it is to distribute the loop into several different loops, meaning by seperating statements that are not dependent with each other so that these distributed loops can be executed in parallel. For example consider the following code.

for (int i = 1; i < n; i ++) {
    S1: a[i] = a[i -1] + b[i];
    S2: c[i] = c[i] + d[i];
}

The loop has a loop carried dependence S1[i] ->T S1[i + 1] but S2 and S1 do not have a loop-carried dependence so we can rewrite the code as follows.

loop1: for (int i = 1; i < n; i ++) {
    S1: a[i] = a[i -1] + b[i];
}
loop2: for (int i = 1; i < n; i ++) {
    S2: c[i] = c[i] + d[i];
}

Note that now loop1 and loop2 can be executed in parallel. Instead of single instruction being performed in parallel on different data as in a data level parallelism here different parallel function perform different tasks on different data. We can call this type of parallelism as to a function or task parallelism.

Steps to Parallelization[edit]

The process of parallelizing a sequential program can be broken down into four discrete steps. [1]

Type Description
Decomposition The program is broken down into tasks, the smallest exploitable unit of concurrence.
Assignment Tasks are assigned to processes.
Orchestration Data access, communication, and synchronization of processes.
Mapping Processes are bound to processors.

Programming Model[edit]

The type of programming model chosen affects how a program will be parallelized. There are two common models that are generally used, shared addressing and message passing.

Aspects of Shared Address Space vs Message Passing, as described by Yan Solihin. [1]

Aspect Shared Address Space Message Passing
Communication Implicit Explicit
Synchronization Explicit Implicit
Hardware Support Typically Required Required as number of processors becomes large
Development Effort Lower Higher
Communication Granularity Finer Coarser

Shared Address Space[edit]

In a shared address space model, each processor has access to all memory. This simplifies inter-process communication, because communication occurs implicitly by reading or writing to shared variables. However, such processes need explicit synchronization to prevent race conditions from occurring in regions of shared memory. Further, it typically requires a cache coherency protocol that causes additional overhead.

Message Passing[edit]

With a message passing model, processes live in their own address space that is distinct from that of the other processes. Communication occurs when each process explicitly sends or receives a message from another process. Passing messages and copying data to different address provides additional overhead, but also acts to synchronize processes and prevent memory race conditions by default.

See Also[edit]

Notes[edit]

  1. ^ Some input data (e.g. when d.length evaluates to 1 and round rounds towards zero [this is just an example, there are no requirements on what type of rounding is used]) will lead to lower_limit being greater than upper_limit, it's assumed that the loop will exit immediately (i.e. zero iterations will occur) when this happens.

References[edit]

  1. ^ a b c d e f g Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.

Category:Parallel computing Category:Articles with example pseudocode