Michael Gschwind

From Wikipedia, the free encyclopedia
Michael Gschwind
Michael Gschwind
Born
Vienna, Austria
NationalityUSA
Alma materTechnische Universität Wien

Michael Karl Gschwind is an American computer scientist who currently is a director and principal engineer at Meta Platforms in Menlo Park, California. He is recognized for his seminal contributions to the design and exploitation of general-purpose programmable accelerators, as an early advocate of sustainability in computer design and as a prolific inventor.[1]

Accelerators[edit]

Gschwind is best known for his contributions to general-purpose programmable Accelerators and Heterogeneous computing as architect of the Cell Broadband Engine processor used in the Sony PlayStation 3,[2][3] and RoadRunner, the first supercomputer to reach sustained Petaflop operation. As Chief Architect for IBM System Architecture, he led the integration of Nvidia GPUs and IBM CPUs to create the Summit and Sierra supercomputers.

AI Acceleration[edit]

Gschwind was an early advocate of AI Hardware Acceleration with GPUs and programmable accelerators. As IBM's Chief Engineer for AI, he led the development of IBM's first AI products and initiated the PowerAI project which brought to market AI-optimized hardware (originally known as "Minsky systems"), and the first prebuilt hardware-optimized AI frameworks. At Facebook, Gschwind led the company-wide adoption of ASIC[4] and GPU Inference, and AI Accelerator Enablement for PyTorch, leading the development of Accelerated Transformers[5] (formerly "Better Transformer"[6]) to establish PyTorch as the standard ecosystem for Large Language Models and Generative AI. Gschwind is one of the architects of Multiray, an accelerator-based platform for serving foundation models and the first production system to serve Large Language Models at scale in the industry, serving over 800 billion queries per day in 2022.[7][8] Gschwind is a pioneer and advocate of Sustainable AI.[9]https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf

Supercomputer Design[edit]

Gschwind was a chief architect for hardware design and software architecture for several supercomputers, including three top-ranked supercomputer systems Roadrunner (June 2008 – November 2009), Sequoia (June 2012 – November 2012), and Summit (June 2018 – June 2020).

Roadrunner was a supercomputer built by IBM for the Los Alamos National Laboratory in New Mexico, USA. The US$100-million Roadrunner was designed for a peak performance of 1.7 petaflops. It achieved 1.026 petaflops on May 25, 2008, to become the world's first TOP500 LINPACK sustained 1.0 petaflops system.[10][11] It was also the fourth-most energy-efficient supercomputer in the world on the Supermicro Green500 list, with an operational rate of 444.94 megaflops per watt of power used.

Sequoia was a petascale Blue Gene/Q supercomputer constructed by IBM for the National Nuclear Security Administration as part of the Advanced Simulation and Computing Program (ASC). It was delivered to the Lawrence Livermore National Laboratory (LLNL) in 2011 and was fully deployed in June 2012.[12] Sequoia was dismantled in 2020, its last position on the top500.org list was #22 in the November 2019 list.

Summit is a supercomputer developed by IBM for use at Oak Ridge Leadership Computing Facility (OLCF), a facility at the Oak Ridge National Laboratory. It held the number 1 position from November 2018 to June 2020.[13][14] Its current LINPACK benchmark is clocked at 148.6 petaFLOPS.[15]

Many-Core Processor Design[edit]

Gschwind was an early advocate of many-core processor design to overcome the power and performance limitations of single-processor designs. Gschwind co-authored an analysis of the limitations of frequency scaling which arguably led to an industry-wide transition to many-core designs.[16] Gschwind was a lead architect for several many-core designs, including the first commercial many-core processor Cell with 9 cores, BlueGene/Q with 18 cores, and several enterprise and mainframe processors (POWER7/POWER8/POWER9 with up to 24 cores; z10-z15 with up to 12 cores).

Compiler Technologies[edit]

Gschwind has made seminal contributions to compiler technology, with a particular emphasis on pioneering contributions to just-in-time compilation and compilers in supercomputing.

Just-in-time-Compilation[17][18][19][20][21][22]

Accelerator Compilation[23] [24]

Compilation and APIs[25][26][27]

AI Compilation[28][29]

SIMD Parallel Vector Architecture[edit]

Gschwind is a pioneer of SIMD parallel vector architecture to increase the number of operations which can be performed per cycle. To enable efficient compilation, Gschwind proposed the implementation of merged scalar and vector execution units, eliminating the cost of copies between scalar and vectorized code, and simplifying compiler architecture by resolving phase ordering problems in compilers.

The Cell's accelerator cores (Synergistic Processor Unit SPU) contain a single 128 element register file with 128 bit per register. Registers may hold either scalar or a vector of multiple values.[30] The simplified cost model leads to significantly improved vectorization success, improving overall program performance and efficiency.[31]

The vector-scalar approach was also adopted by the IBM Power VSX (Vector Scalar Extension) SIMD instructions,[32] BlueGene/Q vector instructions[33][34] and System/z mainframe vector instruction set,[35][36] the design of all three IBM vector-scalar architectures having been led by Gschwind as Chief Architect for IBM System Architecture.

Background[edit]

Gschwind was born in Vienna and obtained his doctorate degree in Computer Engineering at the Technische Universität Wien in 1996. He joined the IBM Thomas J. Watson Research Center in Yorktown Heights, NY and also held positions IBM Systems product group and at its corporate headquarter in Armonk, NY. At Huawei, Gschwind served Vice President of Artificial Intelligence and Accelerated Systems at Huawei. Gschwind is currently a director at Meta Platforms where he has been responsible for AI Acceleration and AI infrastructure.[citation needed]

References[edit]

  1. ^ "Michael Karl Gschwind". www.ppubs.uspto.gov.
  2. ^ David Becker (December 3, 2004). "PlayStation 3 chip goes easy on developers". CNET. Retrieved January 13, 2019.
  3. ^ Scarpino, M. (2008). Programming the cell processor: for games, graphics, and computation. Pearson Education.
  4. ^ First-Generation Inference Accelerator Deployment at Facebook, https://arxiv.org/pdf/2107.04140.pdf
  5. ^ "PyTorch". www.pytorch.org. Retrieved 2023-10-28.
  6. ^ "A BetterTransformer for Fast Transformer Inference". pytorch.org. Retrieved 2023-10-28.
  7. ^ "MultiRay: Optimizing efficiency for large-scale AI models". ai.meta.com. Retrieved 2023-10-28.
  8. ^ MultiRay: An Accelerated Embedding Service for Content Understanding, https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf
  9. ^ Sustainable AI: Environmental Implications, Challenges and Opportunities, https://arxiv.org/pdf/2111.00364.pdf
  10. ^ Gaudin, Sharon (2008-06-09). "IBM's Roadrunner smashes 4-minute mile of supercomputing". Computerworld. Archived from the original on 2008-12-24. Retrieved 2008-06-10.
  11. ^ Fildes, Jonathan (2008-06-09). "Supercomputer sets petaflop pace". BBC News. Retrieved 2008-06-09.
  12. ^ NNSA awards IBM contract to build next generation supercomputer, February 3, 2009
  13. ^ Lohr, Steve (8 June 2018). "Move Over, China: U.S. Is Again Home to World's Speediest Supercomputer". The New York Times. Retrieved 19 July 2018.
  14. ^ "Top 500 List - November 2022". TOP500. November 2022. Retrieved 13 April 2022.
  15. ^ "November 2022 | TOP500 Supercomputer Sites". TOP500. Retrieved 13 April 2022.
  16. ^ Optimizing pipelines for power and performance, MICRO 2002. https://www.researchgate.net/publication/4001353_Optimizing_pipelines_for_power_and_performance
  17. ^ Efficient Instruction Scheduling with Precise Exceptions, https://www.researchgate.net/publication/244186152_Efficient_instruction_scheduling_with_precise_exceptions
  18. ^ Optimizations and oracle parallelism with dynamic translation, https://www.researchgate.net/publication/3830428_Optimizations_and_oracle_parallelism_with_dynamic_translation
  19. ^ Dynamic and Transparent Binary Translation, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=ee7ad16a1f0c1988e93209d4b56d7ff4e8b68566
  20. ^ Dynamic binary translation and optimization, https://www.researchgate.net/publication/3044344_Dynamic_binary_translation_and_optimization
  21. ^ Advances and future challenges in binary translation and optimization, https://ieeexplore.ieee.org/document/964447
  22. ^ Binary translation and architecture convergence issues for IBM System/390, https://www.researchgate.net/profile/Michael-Gschwind/publication/221235791_Binary_translation_and_architecture_convergence_issues_for_IBM_system390/links/0046352f27d9de5653000000/Binary-translation-and-architecture-convergence-issues-for-IBM-system-390.pdf
  23. ^ Optimizing Compiler for the CELL Processor, Conference on Parallel Architectures and Compilation Techniques (PACT 2005), September 2005. https://dl.acm.org/doi/10.1109/PACT.2005.33
  24. ^ An Open Source Environment for Cell Broadband Engine System Software, https://www.researchgate.net/publication/2961855_An_Open_Source_Environment_for_Cell_Broadband_Engine_System_Software
  25. ^ OpenPOWER: Reengineering a server ecosystem for large-scale data centers, https://old.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-7-Dense-Servers-epub/HC26.12.730-%20OpenPower-Gschwind-IBM.pdf
  26. ^ Power Architecture 64-Bit ELF V2 ABI Specification, https://ftp.rtems.org/pub/rtems/people/sebh/ABI64BitOpenPOWERv1.1_16July2015_pub.pdf
  27. ^ Reengineering a server ecosystem for enhanced portability and performance, https://www.researchgate.net/publication/322706081_Reengineering_a_server_ecosystem_for_enhanced_portability_and_performance
  28. ^ First-Generation Inference Accelerator Deployment at Facebook, https://research.facebook.com/publications/first-generation-inference-accelerator-deployment-at-facebook
  29. ^ PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf
  30. ^ Synergistic Processing in Cell's Multicore Architecture, IEEE MICRO, https://ieeexplore.ieee.org/document/1624323
  31. ^ Optimizing Compiler for the CELL Processor, Conference on Parallel Architectures and Compilation Techniques (PACT 2005), September 2005. https://dl.acm.org/doi/10.1109/PACT.2005.33
  32. ^ Workload acceleration with the IBM POWER vector-scalar architecture, IBM JRD, https://ieeexplore.ieee.org/abstract/document/7442604
  33. ^ The IBM Blue Gene/Q Compute Chip, https://ieeexplore.ieee.org/document/6109225
  34. ^ Morgan, Timothy Prickett (22 November 2010). "IBM uncloaks 20 petaflops BlueGene/Q super". The Register.
  35. ^ The SIMD accelerator for business analytics on the IBM z13, https://dl.acm.org/doi/10.1147/JRD.2015.2426576
  36. ^ SIMD Processing on IBM z14, z13 and z13s, https://www.ibm.com/downloads/cas/WVPALM0N