User:MaxDZ8/Stream processing

A relatively new, yet very successful paradigm for to allow MIMD data processing at never-seen-before efficiency with minimal effort.

Given a set of input data (input stream), the paradigm is essentially based on defining a series of operations (kernel) to be applied for each element in the stream. While it seems it would be possible to have multiple kernels in a theorical world, the uniform streaming paradigm is the only one which had success. The uniform streaming paradigm uses only one kernel, applied to all the elements of the stream but you can obviously change kernel on a stream level.

Stream processing is essentially a compromise. By sacrifying some flexibility in the model, the implications allow much easier and faster execution. Depending on the context, processor design may be tuned for maximum efficiency or a trade-off for flexibility.

Comparison with previous parallel paradigms

Basic computers started from a sequential execution paradigm. Nearly all CPUs are SISD based, which means they conceptually perform only a single operation at time. As the computing needs of the world evolved, the amount of data to be managed increased very fast. It was obvious the sequential programming model could not cope with the increased need for processing power. Various efforts have been spent on finding alternative ways to perform massive amounts of computations but the only solution was to exploit some level of parallel execution. The result of those efforts was SIMD, a programming paradigm which allowed to apply a single instruction on different data. Most of the time, SIMD was being used in a SWAR environment. By using more complicated structure, one could also have MIMD parallelism.

Although those two paradigms were pretty efficient, real world implementations were plagued with various limitations ranging from memory alignment problems to syncronization issues and parallelism limitation. Only few SIMD processors survived as stand-alone components while most of them were embedded in standard CPUs.

Consider a simple program adding up two arrays containing 100 4-component vectors (i.e. 400 numbers in total).

Conventional, sequential paradigm

for(int i = 0; i < 100 * 4; i++)
    result[i] = source0[i] + source1[i];

This is the naïve method, as most computer science students would think at. Variations do exist (such as inner loops, structures and such) but they ultimetly boil down to that.

Parallel SIMD paradigm, packed registers (SWAR)

for(int el = 0; el < 100; el++) // for each vector
    vector_sum(result[el], source0[el], source1[el]);

This is actually oversimplified. It assumes the instruction vector_sum magically works. Although this is what happens with instruction intrinsics, many information is actually not taken in account here such as the number of vector components and their data format. This is done for cleariness.

You can see however, this method reduces the amount of decoded instructions from numElements * componentsPerElement to numElements. Number of jump instructions is also decreased. Another gain lies in the parallel execution of the four mathematical operations, giving a great speed up.

What happened however is that the packed SIMD register holds a certain amount of data so it's not possible to get more parallelism. The the speed up is somewhat limited by the assumption we made of performing four parallel operations (please note this is common for both AltiVec and SSE).

Parallel Stream paradigm (SIMD/MIMD)

// This is a fictional language for demostration purposes.
streamElements 100
streamElementFormat 4 numbers
elementKernel "@arg0+@arg1"
result = kernel(source0, source1)

As you can see, the idea is to define the whole set of data instead of each single block. Describing the set of data is assumed to be in the first two rows. After that, the result is inferred from the sources and kernel. For simplicity, there's a 1:1 mapping between input and output data but this does not need to be. Applied kernels can also be much more complex.

This paradigm allows the implementation to "unroll" the loop internally, allowing the model to scale with chip complexity. It has been proven ^[1] this programming model can easily scale to hundreds of ALUs. Much of this extra power is made avaiable from the less need to predict complex data patterns which does not exist in this paradigm.

While stream processing is a branch of SIMD/MIMD processing, they must not be confused, although SIMD implementations can often work in a "streaming" manner, their performance is not comparable: the model envisions a much different usage pattern which allows far greater performance by itself. It has been noted ^[2] that when applied on generic processors such as standard CPU, unly a 1.5x speedup can be reached. By contrast, ad-hoc stream processors easily reach over 10x performance.

Although there are various degrees of flexibility allowed by the model, Stream processors usually impose some limitations on the kernel or stream size. For example, consumer hardware often lacked ability to perform high-precision math, complex indirection chains or presents limits on the number of instructions which can be executed.

Stream processing considerations

Avaiable documentation on Stream processing is very scarce as this is written (September 12, 2005), only few, specialized institutions seems to have understood the implied power of the model. The Stanford University has been historically involved in a variety of projects on this, beginning from the Stanford Shading language and deploying a flexible, stand-alone stream processor called Imagine. Both those projects revealed the paradigm has a great potential so a much larger scale project has been started. With the name of Merrimac, a Stream-based supercomputer is now being researched. AT&T also recognized the wide adoption of stream-enhaced processors as GPUs rapidly evolved in both speed and functionality^[3].

Data dependancies and parallelism

A great advantage of the stream programming model lies in the kernel defining independant and local data usage.

Kernel operations define the basic data unit, both as input and output. This allows the hardware to better allocate resources and schedule global I/O. Although usually not exposed in the programming model, the I/O operations seems to be much more advanced on stream processors (at least, on GPUs). I/O operations are also usually pipelined by themselves while chip structure can help hide latencies. Definition of the data unit is usually explicit in the kernel, which is expected to have well-defined inputs (possibly using structures, which is encouraged) and outputs. In some environments, output values are fixed (in GPUs for example, there is a fixed set of output attributes, unless this is relaxed). Having each computing block clearly independant and defined allows to schedule bulk read or write operations, greatly increasing cache and memory bus efficiency.

Data locality is also explicit in the kernel. This concept is usually referred as kernel locality, identifying all the values which are short-lived to a single kernel invocation. All the temporaries are simply assumed to be local to each kernel invocation so, hardware or software can easily allocate them on fast registers. This is strictly related to degree of parallelism that can be exploited.

Inside each kernel, producer-consumer relationships can be individuated by usual means while, when kernels are chained one after the another, this relationship is given by the model. This allows easier scheduling decisions because it's clear that if kernel B requires output from kernel A, it's obvious that A must be completed before B can be run (at least on the data unit being used). As this is written (September 27, 2005) there seems to be no way to explicit control multi-kernel pipelining, although there are hints the Cell processor actually allows this by routing data between various SPEs for example.

Recently, CPU vendors have been pushing for multi-core and multi-threading. While this trend is going to be useful for the average user, there's no chance standard CPUs can reach a stream processor's performance. The parallelism between two kernel instances is similar to a [[thread(computer science)|thread] level parallelism. Each kernel instance gets data parallelism. Inside each kernel, it is still possible to use instruction level parallelism. Task parallelism (such as overlapped I/O) can still happen. It's easy to have thousands of kernel instances but it's simply impossible to have the same amounts of threads. This is the power of the stream.

Programming model notes

One of the drawbacks of SIMD programming was the issue of Array-of-Structures (AoS) and Structure-of-Arrays (SoA). Programmers often wanted to build data structures with a 'real' meaning, for example:

// A particle in a three dimensional space.
struct particle_t
    float x, y, z;          // not even an array!
    unsigned byte color[3]; // 8 bit per channel, say we care about RGB only
    float size;
    // ... and many other attributes may follow...

What happened is that those structures were then assembled in arrays too keep things nicely organized. This is AoS. When the structure is layed out in memory, the compiler will produce interleaved data, in the sense that all the structures will be contiguous but there will be a constant offset between, say, the "size" attribute of a structure instance and the same element of the following instance. The offset depends on the structure definition (and possibly other things not considered here such as compiler's policies). There are also other problems. For example, the three position variables cannot be SIMD-ized that way, because it's not sure they will be allocated in continuous memory space. To make sure SIMD operations can work on them, they shall be grouped in a 'packed memory location' or at least in an array. Another problem lies in both "color" and "xyz" to be defined in three-component vector quantities. SIMD processors usually have support for 4-components operations only (with some exceptions however).

This kind of problems and limitations made SIMD acceleration on standard CPUs quite nasty. The proposed solution, SoA follows as:

struct particle_t
    float *x, *y, *z;
    unsigned byte *colorRed, *colorBlue, *colorGreen;
    float *size;

For readers not experienced with C, the '*' before each identifier means 'array'. For Java programmers, this is roughtly equivalent to "[]". The drawback here is that the various attributes could be spread in memory. To make sure not cause cache misses, we'll have to update all the various "reds", then all the "greens" and "blues". Although this is not so bad after all, it's simply overkill when compared what most stream processors offers.

For stream processors, the usage of structures is encouraged. From an application point of view, all the attributes can be defined with some flexibility. Taking GPUs as reference, there is a set of attributes (at least 16) avaiable. For each attribute, the application can state the number of components and the format of the components (but only primitive data types are supported for now). The various attributes are then attached to a memory block, possibly defining a stride betwen 'consecutive' elements of the same attributes, effectively allowing interleaved data. When the GPU begins the stream processing, it will gather all the various attributes in a single set of parameters (usually this looks like a structure or a "magic global variable"), performs the operations and scatters the results to some memory area for later processing (or retrieving).

Summing up, there's more flexibility by application' side yet everything looks very organized on stream processor' side.

Generic processor architecture

Historically, CPUs began implementing various tiers of memory access optimizations because of the everincreasing performance when compared to relatively slow to grow external memory bandwidth. In the years, this gap widened so, big amounts of die area were dedicated to hiding memory latencies. Since fetching information and opcodes to those few ALUs is expensive, very little die area is dedicated to actual mathematical machinery (as a rough estimation, consider it to be less than 10%).

A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistor dedicated to management is actually very little.

Beginning from a whole system point of view, stream processors usually exist in a controlled environment. GPUs do exist on a add-in board (this seems to also apply to Imagine). CPUs do the dirty job of managing system resources, running applications and such.

The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-buses has been employed in the past). The exact amount of memory lanes is dependant on the market range. As this is written, there are still 64bit wide interconnections around (entry-level). Most mid-range models use a fast 128bit crossbar switch matrix (4 or 2 segments), while high-end models deploy huge amounts of memory (actually up to 512MB) with a slightly slower crossbar 256bit wide. By contrast, standard IA32 processors like Intel Pentium does only have a single 64bit wide data bus.

Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which most closely matches a multiple pointer indirection is an indirection chain, which is however guaranteed to finally read or write from a specific memory area (inside a stream).

Because of the MIMD nature of each stream processor, read/write operations are expected to happen in bulk, so memories are optimized for high bandwidth rather than low latency (this is a difference from Rambus and DDR SDRAM, for example). This also allows for efficient memory bus negotiations.

Most (90%) of a stream processor's work is done on-chip, requiring only 1% of the global data to be stored to memory. This is were knowing the kernel temporaries and dependancies pays.

Internally, a stream processor features some communication and management circuits but what's interesting is the Stream Register File (SRF). This is conceptually a large cache in which stream data is stored to be tranferred to external memory in bulks. As a cache-like structure to the various ALUs, the SRF is shared between all the various ALU clusters.

There is proof, there can be only a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can efficiently exploit a much lower amount of ALUs because inter-cluster communication is common and thus needs to be highly efficient.

To keep those ALUs fetched with data, each ALU is equipped with Local Register Files (LRFs), which are basically its usable registers.

This three-tiered data access pattern, makes easy to keep temporary data away from slow memories, thus making the silicon implementation highly efficient and power-saving.

Hardware-in-the-loop issues

Although an order of magnitude speedup can easily be expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this. Communication latencies are actually the biggest problem. Although PCI Express improved this with full-duplex communications, getting a GPU (and possibly a generic stream processor) to work will possibly take long amounts of time. This means it's usually counter-productive to use them for small datasets. The stream architecture also incur in penalities for small streams, a behaviour which is officially identified as short stream effect. This basically happens because changing kernel is a rather expensive operation.

Pipelining is very radicated practice on stream processors. GPUs actually feature more than 200 stages long pipelines. The cost for switching settings is dependant on the setting being modified but it's now considered to be always expensive. Altough efforts are being spent for lowering the cost of switching, it's predictable this isn't going to happen any time soon. To avoid those problems at various levels of the pipeline, a large amount of techniques have been deployed such as "über shaders" and "texture atlases". Those techniques are actually game-oriented for the nature of GPUs, but the concepts are interesting for generic stream processing as well.

Interesting Stream Processors

Imagine, from the Stanford University is a very flexible architecture which has proven to be both fast and energy efficient. Built using 0.15 micron technology, it features 8 clusters of 6 ALUs each. The chip totally features 40 pipelined ALUs with 16.9GB/s off-chip memory bandwidth. When clocked at 96Mhz, up to 3.5GFLOPs can be reached (MPEG2 encode) while dissipating only 1.6Watts of core power using 12Volt current. Imagine is, proportionally, an order of magnitude faster than Pentium 4 still more power efficient than DSPs^[4].
GPUs are recognized as widespread, consumer-grade stream processors. Although they are usually limited in hardware functionalities, a lot of efforts are being spent in optimizing algoritms for this family of processors, which usually have very high horsepower. Various generations are to be noted by a stream processing point of view.
1. Pre-NV2x: no explicit support for stream processing. Kernel operations are usually hidden in the API and provide too little flexibility for general use.
2. NV2x: kernel stream operations are now explicitly under programmer's control but only for vertex processing (fragments are still using old paradigms). No branching support severely hampers flexibility but some algorithms can be run (notably, low-precision fluid simulation).
3. RD3xx: increased performance and precision with limited support for branching/looping in both vertex and fragment processing. The model is now flexible enough to cover many purposes.
4. NV3x: better branching capabilities and precision at the expense of performance.
5. NV4x: actually (September 25, 2005) state of the art. Very flexible branching support although some limitations still exists on the number of operations to be executed and strict recursion depth. Performance is estimated to be from 20 to 44GFLOPs.
What most people really does not realize is that the Cell processor is actually a generic CPU (the PPE) tighly coupled with a stream processor (the various SPEs). The streaming-like architecture is actually tuned up for flexibility rather than high performance but the model still allows, by an high-level point of view to scale with the number of SPEs. Please note that scaling to a different amount of SPEs may require to recompile the application however (while GPUs automatically do this). Various components can be seen as fitting in a streaming paradigm.

References

^ Khailany, Dally, Rixner, Kapasi, Owens and Towles: "Exploring VLSI Scalability of Stream Processors", Stanford and Rice University.
^ Venkatasubramanian, "The Graphics Card as a Stream Computer", AT&T Labs - research.
^ Gummaraju and Rosenblum, "Stream processing in General-Purpose Processors", Stanford University.
^ Khailany, Dally, Rixner, Kapasi, Owens, Ho Ahn and Mattson, "Programmable Stream Processors", Universities of Stanford, Rice, California (Davis) and Reservoir Labs.