Michael Gschwind

Michael Gschwind
Michael Gschwind
	Michael Gschwind
Born	Vienna, Austria
Nationality	USA
Alma mater	Technische Universität Wien

Michael Karl Gschwind is an American computer scientist who currently is a director and principal engineer at Meta Platforms in Menlo Park, California. He is recognized for his seminal contributions to the design and exploitation of general-purpose programmable accelerators, as an early advocate of sustainability in computer design and as a prolific inventor.^[1]

Accelerators[edit]

Gschwind is best known for his contributions to general-purpose programmable Accelerators and Heterogeneous computing as architect of the Cell Broadband Engine processor used in the Sony PlayStation 3,^[2]^[3] and RoadRunner, the first supercomputer to reach sustained Petaflop operation. As Chief Architect for IBM System Architecture, he led the integration of Nvidia GPUs and IBM CPUs to create the Summit and Sierra supercomputers.

AI Acceleration[edit]

Gschwind was an early advocate of AI Hardware Acceleration with GPUs and programmable accelerators. As IBM's Chief Engineer for AI, he led the development of IBM's first AI products and initiated the PowerAI project which brought to market AI-optimized hardware (originally known as "Minsky systems"), and the first prebuilt hardware-optimized AI frameworks. At Facebook, Gschwind led the company-wide adoption of ASIC^[4] and GPU Inference, and AI Accelerator Enablement for PyTorch, leading the development of Accelerated Transformers^[5] (formerly "Better Transformer"^[6]) to establish PyTorch as the standard ecosystem for Large Language Models and Generative AI. Gschwind is one of the architects of Multiray, an accelerator-based platform for serving foundation models and the first production system to serve Large Language Models at scale in the industry, serving over 800 billion queries per day in 2022.^[7]^[8] Gschwind is a pioneer and advocate of Sustainable AI.^[9]https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf

Supercomputer Design[edit]

Gschwind was a chief architect for hardware design and software architecture for several supercomputers, including three top-ranked supercomputer systems Roadrunner (June 2008 – November 2009), Sequoia (June 2012 – November 2012), and Summit (June 2018 – June 2020).

Roadrunner was a supercomputer built by IBM for the Los Alamos National Laboratory in New Mexico, USA. The US$100-million Roadrunner was designed for a peak performance of 1.7 petaflops. It achieved 1.026 petaflops on May 25, 2008, to become the world's first TOP500 LINPACK sustained 1.0 petaflops system.^[10]^[11] It was also the fourth-most energy-efficient supercomputer in the world on the Supermicro Green500 list, with an operational rate of 444.94 megaflops per watt of power used.

Sequoia was a petascale Blue Gene/Q supercomputer constructed by IBM for the National Nuclear Security Administration as part of the Advanced Simulation and Computing Program (ASC). It was delivered to the Lawrence Livermore National Laboratory (LLNL) in 2011 and was fully deployed in June 2012.^[12] Sequoia was dismantled in 2020, its last position on the top500.org list was #22 in the November 2019 list.

Summit is a supercomputer developed by IBM for use at Oak Ridge Leadership Computing Facility (OLCF), a facility at the Oak Ridge National Laboratory. It held the number 1 position from November 2018 to June 2020.^[13]^[14] Its current LINPACK benchmark is clocked at 148.6 petaFLOPS.^[15]

Many-Core Processor Design[edit]

Gschwind was an early advocate of many-core processor design to overcome the power and performance limitations of single-processor designs. Gschwind co-authored an analysis of the limitations of frequency scaling which arguably led to an industry-wide transition to many-core designs.^[16] Gschwind was a lead architect for several many-core designs, including the first commercial many-core processor Cell with 9 cores, BlueGene/Q with 18 cores, and several enterprise and mainframe processors (POWER7/POWER8/POWER9 with up to 24 cores; z10-z15 with up to 12 cores).

Compiler Technologies[edit]

Gschwind has made seminal contributions to compiler technology, with a particular emphasis on pioneering contributions to just-in-time compilation and compilers in supercomputing.

Just-in-time-Compilation^[17]^[18]^[19]^[20]^[21]^[22]

Accelerator Compilation^[23] ^[24]

Compilation and APIs^[25]^[26]^[27]

AI Compilation^[28]^[29]

SIMD Parallel Vector Architecture[edit]

Gschwind is a pioneer of SIMD parallel vector architecture to increase the number of operations which can be performed per cycle. To enable efficient compilation, Gschwind proposed the implementation of merged scalar and vector execution units, eliminating the cost of copies between scalar and vectorized code, and simplifying compiler architecture by resolving phase ordering problems in compilers.

The Cell's accelerator cores (Synergistic Processor Unit SPU) contain a single 128 element register file with 128 bit per register. Registers may hold either scalar or a vector of multiple values.^[30] The simplified cost model leads to significantly improved vectorization success, improving overall program performance and efficiency.^[31]

The vector-scalar approach was also adopted by the IBM Power VSX (Vector Scalar Extension) SIMD instructions,^[32] BlueGene/Q vector instructions^[33]^[34] and System/z mainframe vector instruction set,^[35]^[36] the design of all three IBM vector-scalar architectures having been led by Gschwind as Chief Architect for IBM System Architecture.

Background[edit]

Gschwind was born in Vienna and obtained his doctorate degree in Computer Engineering at the Technische Universität Wien in 1996. He joined the IBM Thomas J. Watson Research Center in Yorktown Heights, NY and also held positions IBM Systems product group and at its corporate headquarter in Armonk, NY. At Huawei, Gschwind served Vice President of Artificial Intelligence and Accelerated Systems at Huawei. Gschwind is currently a director at Meta Platforms where he has been responsible for AI Acceleration and AI infrastructure.^{[citation needed]}

References[edit]

^ "Michael Karl Gschwind". www.ppubs.uspto.gov.
^ David Becker (December 3, 2004). "PlayStation 3 chip goes easy on developers". CNET. Retrieved January 13, 2019.
^ Scarpino, M. (2008). Programming the cell processor: for games, graphics, and computation. Pearson Education.
^ First-Generation Inference Accelerator Deployment at Facebook, https://arxiv.org/pdf/2107.04140.pdf
^ "PyTorch". www.pytorch.org. Retrieved 2023-10-28.
^ "A BetterTransformer for Fast Transformer Inference". pytorch.org. Retrieved 2023-10-28.
^ "MultiRay: Optimizing efficiency for large-scale AI models". ai.meta.com. Retrieved 2023-10-28.
^ MultiRay: An Accelerated Embedding Service for Content Understanding, https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf
^ Sustainable AI: Environmental Implications, Challenges and Opportunities, https://arxiv.org/pdf/2111.00364.pdf
^ Gaudin, Sharon (2008-06-09). "IBM's Roadrunner smashes 4-minute mile of supercomputing". Computerworld. Archived from the original on 2008-12-24. Retrieved 2008-06-10.
^ Fildes, Jonathan (2008-06-09). "Supercomputer sets petaflop pace". BBC News. Retrieved 2008-06-09.
^ NNSA awards IBM contract to build next generation supercomputer, February 3, 2009
^ Lohr, Steve (8 June 2018). "Move Over, China: U.S. Is Again Home to World's Speediest Supercomputer". The New York Times. Retrieved 19 July 2018.
^ "Top 500 List - November 2022". TOP500. November 2022. Retrieved 13 April 2022.
^ "November 2022 | TOP500 Supercomputer Sites". TOP500. Retrieved 13 April 2022.
^ Optimizing pipelines for power and performance, MICRO 2002. https://www.researchgate.net/publication/4001353_Optimizing_pipelines_for_power_and_performance
^ Efficient Instruction Scheduling with Precise Exceptions, https://www.researchgate.net/publication/244186152_Efficient_instruction_scheduling_with_precise_exceptions
^ Optimizations and oracle parallelism with dynamic translation, https://www.researchgate.net/publication/3830428_Optimizations_and_oracle_parallelism_with_dynamic_translation
^ Dynamic and Transparent Binary Translation, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=ee7ad16a1f0c1988e93209d4b56d7ff4e8b68566
^ Dynamic binary translation and optimization, https://www.researchgate.net/publication/3044344_Dynamic_binary_translation_and_optimization
^ Advances and future challenges in binary translation and optimization, https://ieeexplore.ieee.org/document/964447
^ Binary translation and architecture convergence issues for IBM System/390, https://www.researchgate.net/profile/Michael-Gschwind/publication/221235791_Binary_translation_and_architecture_convergence_issues_for_IBM_system390/links/0046352f27d9de5653000000/Binary-translation-and-architecture-convergence-issues-for-IBM-system-390.pdf
^ Optimizing Compiler for the CELL Processor, Conference on Parallel Architectures and Compilation Techniques (PACT 2005), September 2005. https://dl.acm.org/doi/10.1109/PACT.2005.33
^ An Open Source Environment for Cell Broadband Engine System Software, https://www.researchgate.net/publication/2961855_An_Open_Source_Environment_for_Cell_Broadband_Engine_System_Software
^ OpenPOWER: Reengineering a server ecosystem for large-scale data centers, https://old.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-7-Dense-Servers-epub/HC26.12.730-%20OpenPower-Gschwind-IBM.pdf
^ Power Architecture 64-Bit ELF V2 ABI Specification, https://ftp.rtems.org/pub/rtems/people/sebh/ABI64BitOpenPOWERv1.1_16July2015_pub.pdf
^ Reengineering a server ecosystem for enhanced portability and performance, https://www.researchgate.net/publication/322706081_Reengineering_a_server_ecosystem_for_enhanced_portability_and_performance
^ First-Generation Inference Accelerator Deployment at Facebook, https://research.facebook.com/publications/first-generation-inference-accelerator-deployment-at-facebook
^ PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf
^ Synergistic Processing in Cell's Multicore Architecture, IEEE MICRO, https://ieeexplore.ieee.org/document/1624323
^ Optimizing Compiler for the CELL Processor, Conference on Parallel Architectures and Compilation Techniques (PACT 2005), September 2005. https://dl.acm.org/doi/10.1109/PACT.2005.33
^ Workload acceleration with the IBM POWER vector-scalar architecture, IBM JRD, https://ieeexplore.ieee.org/abstract/document/7442604
^ The IBM Blue Gene/Q Compute Chip, https://ieeexplore.ieee.org/document/6109225
^ Morgan, Timothy Prickett (22 November 2010). "IBM uncloaks 20 petaflops BlueGene/Q super". The Register.
^ The SIMD accelerator for business analytics on the IBM z13, https://dl.acm.org/doi/10.1147/JRD.2015.2426576
^ SIMD Processing on IBM z14, z13 and z13s, https://www.ibm.com/downloads/cas/WVPALM0N

[1] "Michael Karl Gschwind". www.ppubs.uspto.gov.

[2] David Becker (December 3, 2004). "PlayStation 3 chip goes easy on developers". CNET. Retrieved January 13, 2019.

[3] Scarpino, M. (2008). Programming the cell processor: for games, graphics, and computation. Pearson Education.

[4] First-Generation Inference Accelerator Deployment at Facebook, https://arxiv.org/pdf/2107.04140.pdf

[5] "PyTorch". www.pytorch.org. Retrieved 2023-10-28.

[6] "A BetterTransformer for Fast Transformer Inference". pytorch.org. Retrieved 2023-10-28.

[7] "MultiRay: Optimizing efficiency for large-scale AI models". ai.meta.com. Retrieved 2023-10-28.

[8] MultiRay: An Accelerated Embedding Service for Content Understanding, https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf

[9] Sustainable AI: Environmental Implications, Challenges and Opportunities, https://arxiv.org/pdf/2111.00364.pdf

[10] Gaudin, Sharon (2008-06-09). "IBM's Roadrunner smashes 4-minute mile of supercomputing". Computerworld. Archived from the original on 2008-12-24. Retrieved 2008-06-10.

[11] Fildes, Jonathan (2008-06-09). "Supercomputer sets petaflop pace". BBC News. Retrieved 2008-06-09.

[12] NNSA awards IBM contract to build next generation supercomputer, February 3, 2009

[nytimes-13] Lohr, Steve (8 June 2018). "Move Over, China: U.S. Is Again Home to World's Speediest Supercomputer". The New York Times. Retrieved 19 July 2018.

[top500-14] "Top 500 List - November 2022". TOP500. November 2022. Retrieved 13 April 2022.

[15] "November 2022 | TOP500 Supercomputer Sites". TOP500. Retrieved 13 April 2022.

[16] Optimizing pipelines for power and performance, MICRO 2002. https://www.researchgate.net/publication/4001353_Optimizing_pipelines_for_power_and_performance

[17] Efficient Instruction Scheduling with Precise Exceptions, https://www.researchgate.net/publication/244186152_Efficient_instruction_scheduling_with_precise_exceptions

[18] Optimizations and oracle parallelism with dynamic translation, https://www.researchgate.net/publication/3830428_Optimizations_and_oracle_parallelism_with_dynamic_translation

[19] Dynamic and Transparent Binary Translation, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=ee7ad16a1f0c1988e93209d4b56d7ff4e8b68566

[20] Dynamic binary translation and optimization, https://www.researchgate.net/publication/3044344_Dynamic_binary_translation_and_optimization

[21] Advances and future challenges in binary translation and optimization, https://ieeexplore.ieee.org/document/964447

[22] Binary translation and architecture convergence issues for IBM System/390, https://www.researchgate.net/profile/Michael-Gschwind/publication/221235791_Binary_translation_and_architecture_convergence_issues_for_IBM_system390/links/0046352f27d9de5653000000/Binary-translation-and-architecture-convergence-issues-for-IBM-system-390.pdf

[23] Optimizing Compiler for the CELL Processor, Conference on Parallel Architectures and Compilation Techniques (PACT 2005), September 2005. https://dl.acm.org/doi/10.1109/PACT.2005.33

[24] An Open Source Environment for Cell Broadband Engine System Software, https://www.researchgate.net/publication/2961855_An_Open_Source_Environment_for_Cell_Broadband_Engine_System_Software

[25] OpenPOWER: Reengineering a server ecosystem for large-scale data centers, https://old.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-7-Dense-Servers-epub/HC26.12.730-%20OpenPower-Gschwind-IBM.pdf

[26] Power Architecture 64-Bit ELF V2 ABI Specification, https://ftp.rtems.org/pub/rtems/people/sebh/ABI64BitOpenPOWERv1.1_16July2015_pub.pdf

[27] Reengineering a server ecosystem for enhanced portability and performance, https://www.researchgate.net/publication/322706081_Reengineering_a_server_ecosystem_for_enhanced_portability_and_performance

[28] First-Generation Inference Accelerator Deployment at Facebook, https://research.facebook.com/publications/first-generation-inference-accelerator-deployment-at-facebook

[29] PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf

[30] Synergistic Processing in Cell's Multicore Architecture, IEEE MICRO, https://ieeexplore.ieee.org/document/1624323

[31] Optimizing Compiler for the CELL Processor, Conference on Parallel Architectures and Compilation Techniques (PACT 2005), September 2005. https://dl.acm.org/doi/10.1109/PACT.2005.33

[32] Workload acceleration with the IBM POWER vector-scalar architecture, IBM JRD, https://ieeexplore.ieee.org/abstract/document/7442604

[33] The IBM Blue Gene/Q Compute Chip, https://ieeexplore.ieee.org/document/6109225

[34] Morgan, Timothy Prickett (22 November 2010). "IBM uncloaks 20 petaflops BlueGene/Q super". The Register.

[35] The SIMD accelerator for business analytics on the IBM z13, https://dl.acm.org/doi/10.1147/JRD.2015.2426576

[36] SIMD Processing on IBM z14, z13 and z13s, https://www.ibm.com/downloads/cas/WVPALM0N

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]