AI accelerator

From Wikipedia, the free encyclopedia
Jump to: navigation, search

An AI accelerator is (as of 2016) an emerging class of microprocessor[1] or computer system[2] designed to accelerate artificial neural networks, machine vision and other machine learning algorithms for robotics, internet of things and other data-intensive or sensor-driven tasks.[3] They are sometimes manycore designs (mirroring the massively-parallel nature of biological neural networks). Many vendor-specific terms exist for devices in this space.

They are distinct from GPUs (which are commonly used for the same role) in that they lack any fixed function units for graphics, and generally focus on low-precision arithmetic.

History of AI acceleration[edit]

Computer systems have frequently complemented the CPU with special purpose accelerators for intensive tasks, most notably graphics, but also sound, video, etc. Over time various accelerators have appeared that have been applicable to AI workloads.

Early attempts[edit]

In the early days, DSPs (such as the AT&T DSP32C) have been used as neural network accelerators e.g. to accelerate OCR software,[4] and there have been attempts to create parallel high throughput systems for workstations (e.g. TetraSpert in the 1990s, which was a parallel fixed-point vector processor[5]), aimed at various applications including neural network simulations.[6] FPGA-based accelerators were also first explored in the 1990s for both inference[7] and training[8]. ANNA was a neural net CMOS accelerator developed by Yann LeCun.[9] There was another attempt to build a neural net workstation called Synapse-1[10] (not to be confused with the current IBM SyNAPSE project).

Heterogeneous computing[edit]

Architectures such as the Cell microprocessor[11] have exhibited features significantly overlaping with AI accelerators - in its support for packed low precision arithmetic, dataflow architecture, and prioritising 'throughput' over latency and "branchy-int" code. This was a move toward heterogeneous computing, with a number of throughput-oriented accelerators intended to assist the CPU with a range of intensive tasks: physics-simulation, AI, video encoding/decoding, and certain graphics tasks beyond its contemporary GPUs.[not in citation given]

The physics processing unit was yet another example of an attempt to fill the gap between CPU and GPU in PC hardware, however physics tends to require 32-bit precision and up, whilst much lower precision can be a better tradeoff for AI.[12]

CPUs themselves have gained increasingly wide SIMD units (driven by video and gaming workloads) and increased the number of cores in a bid to eliminate the need for another accelerator, as well as for accelerating application code. These tend to support packed low precision data types.[13]

Use of GPGPU[edit]

Spontaneous innovative software appeared using vertex and pixel shaders for general purpose computation through rendering APIs, by storing non-graphical data in vertex buffers and texture maps (including implementations of convolutional neural networks for OCR[14]),[15] Vendors of graphics processing units subsequently saw the opportunity to expand their market and generalised their shader pipelines with specific support for GPGPU, mostly motivated by the demands of video game-physics but also targeting scientific computing.[16]

This killed off the market for a dedicated physics accelerator, and superseded Cell in video game consoles,[17] and eventually led to their use in running convolutional neural networks such as AlexNet (which exhibited leading performance the ImageNet Large Scale Visual Recognition Challenge).[18]

As such, as of 2016 GPUs are popular for AI work, and they continue to evolve in a direction to facilitate deep learning, both for training[19] and inference in devices such as self-driving cars.[20] - and gaining additional connective capability for the kind of dataflow workloads AI benefits from (e.g. Nvidia NVLink).[21]

Use of FPGA[edit]

Deep learning frameworks are still evolving, making it hard to design custom hardware. Reconfigurable devices like field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks and software alongside each other.[7][8][22]

Microsoft has used FPGA chips to accelerate inference.[23][24] This has motivated Intel to purchase Altera with the aim of integrating FPGAs in server CPUs, which would be capable of accelerating AI as well as other tasks.[citation needed]

Emergence of dedicated AI accelerator ASICs[edit]

Whilst GPUs and FPGAs perform far better than CPUs for these tasks, a factor of 10 in efficiency[25][26] can still be gained with a more specific design, via an application-specific integrated circuit (ASIC).

Memory access pattern[edit]

The memory access pattern of AI calculations differs from graphics: a more predictable but deeper dataflow, benefiting more from the ability to keep more temporary variables on-chip (e.g. in scratchpad memory rather than caches); GPUs by contrast devote silicon to efficiently dealing with highly non-linear gather-scatter addressing between texture maps and frame-buffers, and texture filtering, as is needed for their primary role in 3D rendering.


AI researchers are often finding minimal accuracy losses whilst dropping to 16 or even 8 bits,[12] suggesting that a larger volume of low precision arithmetic is a better use of the same bandwidth. Some researchers have even tried using 1-bit precision (i.e. putting the emphasis entirely on spatial information in vision tasks).[27] IBM's design is more radical, dispensing with scalar values altogether and accumulating timed pulses to represent activations stochastically, requiring conversion of traditional representations.[28]

Slowing of Moore's law[edit]

As of 2016, the slowing (and possible end of) Moore's law[29] drives some to suggest refocusing industry efforts on application led silicon design,[30] whereas in the past, increasingly powerful general purpose chips have been applied to varying applications via software. In this scenario, a diversification of dedicated AI accelerators makes more sense than continuing to stretch GPUs and CPUs.


As of 2016, the field is still in flux and vendors are pushing their own marketing term for what amounts to an "AI accelerator", in the hope that their designs and APIs will dominate. There is no consensus on the boundary between these devices, nor the exact form they will take, however several examples clearly aim to fill this new space, with a fair amount of overlap in capabilities.

In the past when consumer graphics accelerators emerged, the industry eventually adopted Nvidia's self-assigned term, "the GPU",[31] as the collective noun for "graphics accelerators", which had taken many forms before settling on an overall pipeline implementing a model presented by Direct3D.

Potential applications[edit]


  • STMicroelectronics at the start of 2017 presented a demonstrator SoC manufactured in a 28 nm process containing a deep CNN accelerator.[36]
  • Cadence Tensilica Vision C5,introduced May 2017, is a neural networks optimized DSP IP core for SoCs. C5 contains 1024 MAC units.[37]
  • PowerVR 2NX NNA (Neural Net Accelerator), launched September 2017, is an IP core from Imagination Technologies licensed for integration into chips. It supports four to sixteen bits of precision[38]
  • NM500 is the latest as of 2016 in a series of accelerator chips for Radial Basis Function neural nets from General Vision. [39]
  • Huawei's smartphone chip Kirin 970 contains a dedicated ”Neural Processing Unit” (NPU)[40]
  • There is a Neural Engine in the Apple iPhone X's A11 Bionic SoC.[41]
  • Vision processing units
  • Google Tensor processing unit was presented as an accelerator for Google's TensorFlow framework, which is extensively used for convolutional neural networks. It focuses on a high volume of 8-bit precision arithmetic.
  • SpiNNaker is a many-core design combining traditional ARM architecture cores with an enhanced network fabric design specialised for simulating a large neural network.
  • Accelerators for spiking neural networks:-
    • Intel Loihi, introduced in September 2017, is an experimental neuromorphic chip containing 130,000 artificial neurons communicating asynchronously using spiking through 130 million artificial synapses. [43]
    • TrueNorth is a manycore design based on spiking neurons rather than traditional arithmetic. The frequency of pulses represents signal intensity.[44] but some results are promising, with large energy savings demonstrated for vision tasks.[45]
    • BrainChip in September 2017 introduced a commercial PCI Express card with a Xilinx Kintex Ultrascale FPGA running neuromorphic neural cores applying pattern recognition on 600 video images per second using 16 watts of power.[46]
    • IIT Madras is designing a spiking neuron accelerator for new RISC-V systems, aimed at big-data analytics in servers.[47]
  • Intel Nervana NNP (Neural Network Processor) (a.k.a. ”Lake Crest”) was available in samples in October 2017. According to Intel this was the first commercially available chip with a purpose built architecture for deep learning. Facebook was a partner in the design process. [48][49]
  • Eyeriss, a design aimed explicitly at convolutional neural networks, using a scratchpad and on chip network architecture.[50]
  • Adapteva epiphany is targeted as a coprocessor, featuring a network on a chip scratchpad memory model, suitable for a dataflow programming model, which should be suitable for many machine learning tasks.
  • Kalray have demonstrated an MPPA[51] and report efficiency gains over GPUs for convolutional neural nets.
  • Nvidia DGX-1 is based on GPU technology however the use of multiple chips forming a fabric via NVLink specialises its memory architecture in a way that is particularly suitable for deep learning.
  • Vathys is building an accelerator that, in contrast to others, uses floating point, not fixed point, a decision which recent papers such as Hill et al, 2016[clarification needed] support.
  • Nvidia Volta, augments the GPU with additional 'tensor units' targeted specifically at accelerating calculations for neural networks[52]
  • Graphcore IPU, a graph-based AI accelerator[53]
  • DPU, by wave computing, a dataflow architecture [54]

See also[edit]


  1. ^ "Intel unveils Movidius Compute Stick USB AI Accelerator". 
  2. ^ "Inspurs unveils GX4 AI Accelerator". 
  3. ^ "google developing AI processors". google using its own AI accelerators.
  4. ^ "convolutional neural network demo from 1993 featuring DSP32 accelerator". 
  5. ^ "design of a connectionist network supercomputer". 
  6. ^ "The end of general purpose computers (not)". This presentation covers a past attempt at neural net accelerators, notes the similarity to the modern SLI GPGPU processor setup, and argues that general purpose vector accelerators are the way forward (in relation to RISC-V hwacha project. Argues that NN's are just dense and sparse matrices, one of several recurring algorithms)
  7. ^ a b "Space Efficient Neural Net Implementation" (PDF). 
  8. ^ a b "A Generic Building Block for Hopfield Neural Networks with On-Chip Learning" (PDF). 
  9. ^ Application of the ANNA Neural Network Chip to High-Speed Character Recognition
  10. ^ "SYNAPSE-1: a high-speed general purpose parallel neurocomputer system". 
  11. ^ "Synergistic Processing in Cell's Multicore Architecture". 
  12. ^ a b "Deep Learning with Limited Numerical Precision" (PDF). 
  13. ^ "Improving the performance of video with AVX". 
  14. ^ "microsoft research/pixel shaders/MNIST". 
  15. ^ "how the gpu came to be used for general computation". 
  16. ^ "nvidia tesla microarchitecture" (PDF). 
  17. ^ "End of the line for IBM’s Cell". 
  18. ^ "imagenet classification with deep convolutional neural networks" (PDF). 
  19. ^ "nvidia driving the development of deep learning". 
  20. ^ "nvidia introduces supercomputer for self driving cars". 
  21. ^ "how nvlink will enable faster easier multi GPU computing". 
  22. ^ "FPGA Based Deep Learning Accelerators Take on ASICs". The Next Platform. 2016-08-23. Retrieved 2016-09-07. 
  23. ^ "microsoft extends fpga reach from bing to deep learning". 
  24. ^ "Accelerating Deep Convolutional Neural Networks Using Specialized Hardware" (PDF). 
  25. ^ "Google boosts machine learning with its Tensor Processing Unit". 2016-05-19. Retrieved 2016-09-13. 
  26. ^ "Chip could bring deep learning to mobile devices". 2016-02-03. Retrieved 2016-09-13. 
  27. ^ Rastegari, Mohammad; Ordonez, Vicente; Redmon, Joseph; Farhadi, Ali (2016). "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks". arXiv:1603.05279Freely accessible [cs.CV]. 
  28. ^ Diehl, Peter U.; Zarrella, Guido; Cassidy, Andrew; Pedroni, Bruno U.; Neftci, Emre (2016). "Conversion of Artificial Recurrent Neural Networks to Spiking Neural Networks for Low-power Neuromorphic Hardware". arXiv:1601.04187Freely accessible [cs.NE]. 
  29. ^ "intels former chief architect - moore's law will be dead within a decade". 
  30. ^ "more than moore" (PDF). 
  31. ^ "NVIDIA launches he Worlds First Graphics Processing Unit, the GeForce 256,". 
  32. ^ "drive px". 
  33. ^ "design of a machine vision system for weed control" (PDF). 
  34. ^ "qualcomm research brings server class machine learning to every data devices". 
  35. ^ "movidius powers worlds most intelligent drone". 
  36. ^ "A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems" (PDF). 
  37. ^ "Cadence Unveils Industry’s First Neural Network DSP IP for Automotive, Surveillance, Drone and Mobile Markets". 
  38. ^ "The highest performance neural network inference accelerator". 
  39. ^ "NM500, Neuromorphic chip with 576 neurons". 
  40. ^ "HUAWEI Reveals the Future of Mobile AI at IFA 2017". 
  41. ^ "The iPhone X’s new neural engine exemplifies Apple’s approach to AI". The Verge. Retrieved 2017-09-23. 
  42. ^ "The Evolution of EyeQ". 
  43. ^ "Intel’s New Self-Learning Chip Promises to Accelerate Artificial Intelligence". 
  44. ^ "yann lecun on IBM truenorth". argues that spiking neurons have never produced leading quality results, and that 8-16 bit precision is optimal, pushes the competing 'neuflow' design
  45. ^ "IBM cracks open new era of neuromorphic computing". TrueNorth is incredibly efficient: The chip consumes just 72 milliwatts at max load, which equates to around 400 billion synaptic operations per second per watt — or about 176,000 times more efficient than a modern CPU running the same brain-like workload, or 769 times more efficient than other state-of-the-art neuromorphic approaches 
  46. ^ "BrainChip Accelerator". 
  47. ^ "India preps RISC-V Processors - Shakti targets servers, IoT, analytics". The Shakti project now includes plans for at least six microprocessor designs as well as associated fabrics and an accelerator chip 
  48. ^ Kampman, Jeff (17 October 2017). "Intel unveils purpose-built Neural Network Processor for deep learning". Tech Report. Retrieved 18 October 2017. 
  49. ^ "Intel Nervana Neural Network Processors (NNP) Redefine AI Silicon". Retrieved 20 October 2017. 
  50. ^ Chen, Yu-Hsin; Krishna, Tushar; Emer, Joel; Sze, Vivienne (2016). "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks". IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. pp. 262–263. 
  51. ^ "kalray MPPA" (PDF). 
  52. ^ "Nvidia goes beyond the GPU for AI with Volta". 
  53. ^ "Graphcore Technology". 
  54. ^ "Wave Computing's DPU architecture". 

External links[edit]