Michael Gschwind

Michael Gschwind
Michael Gschwind
	Michael Gschwind
Born	Vienna, Austria
Nationality	USA
Alma mater	Technische Universität Wien

Michael Karl Gschwind is an American computer scientist at Meta Platforms in Menlo Park, California. He is recognized for his seminal contributions to the design and exploitation of general-purpose programmable accelerators, as an early advocate of sustainability in computer design and as a prolific inventor.^[1]

Accelerators

Gschwind led hardware and software architecture for the first general-purpose programmable accelerator Accelerators and is widely recognized for his contributionsHeterogeneous computing as architect of the Cell Broadband Engine processor used in the Sony PlayStation 3,^[2]^[3] and RoadRunner, the first supercomputer to reach sustained Petaflop operation. As Chief Architect for IBM System Architecture, he led the integration of Nvidia GPUs and IBM CPUs to create the Summit and Sierra supercomputers.

Gschwind was an early advocate for accelerator virtualization^[4]^[5] and as IBM System Chief Architect led I/O and accelerator virtualization.^[6]

Gschwind has had a critical influence on the development of accelerator programming models with the development of APIs and best practices for accelerator programming,^[7]^[8]^[9]^[10]^[11] application studies for a diverse range of HPC^[12] and non-HPC applications.^[13] and as co-editor of books^[14] and journals^[15] on practice and experience of programming accelerator-based systems.

AI Acceleration

Gschwind was an early advocate of AI Hardware Acceleration with GPUs and programmable accelerators. As IBM's Chief Engineer for AI, he led the development of IBM's first AI products and initiated the PowerAI project which brought to market AI-optimized hardware (codenamed "Minsky"), and the first prebuilt hardware-optimized AI frameworks. These frameworks were delivered as the firstfreely installable, binary package-managed AI software stacks paving the path for adoption.^[16]

At Facebook, Gschwind demonstrated accelerated Large Language Models (LLMs) for Facebook's First Generation ASIC accelerators and for GPUs, leading the first LLLM production deployments at scale for embedding serving for content analysis and platform safety, and for numerous user surfaces such as Facebook Assistant, and FB Marketplace starting in 2020.^[17] Gschwind led the development of and is one of the architects of Multiray, an accelerator-based platform for serving foundation models and the first production system to serve Large Language Models at scale in the industry, serving over 800 billion queries per day in 2022.^[18]^[19]

Gschwind led the company-wide adoption of ASIC^[20] and Facebook's subsequent "strategic pivot" to GPU Inference, deploying GPU Inference at scale, a move highlighted by FB CEO Mark Zuckerburg in his earnings call. Among the first recommendation models deployed with GPU Inference was a Reels video recommendation model which delivered a 30% user surge within 2 weeks of deployment, as reported by FB CEO Mark Zuckerburg in his Q1 2022 earnings call,^[21] and a subsequent $3B to $10B growth for REeels year-over-year.^[22]

Gschwind also led AI Accelerator Enablement for PyTorch with a particular focus on LLM acceleration, leading the development of Accelerated Transformers^[23] (formerly "Better Transformer"^[24]) and partnered with companies such as HuggingFace to drive industry-wide LLM Acceleration^[25] to establish PyTorch 2.0 as the standard ecosystem for Large Language Models and Generative AI.^[26]^[27]^[28]^[29]

Gschwind subsequently led expanding LLM acceleration to on-device AI models with ExecuTorch, the PyTorch ecosystem solution for on-device AI, making on-device generative AI feasible for the first time.^[30] ExecuTorch LLM acceleration (across multiple surfaces including NPUs, MPS, and Qualcomm accelerators) delivered significant speedups making it practical to deploy Llama3 unmodified on servers and on-device (demonstrated on iOS, Android, and Raspberry Pi 5) at launch with developers reporting up to 5x-10x speedups over prior on-device AI solutions.^[31]^[32]

Gschwind's multiple contributions to AI software stacks and frameworks, AI accelerators, mobile/embedded on-device AI and low-precision numeric representations in torchchat,^[33]^[34] representing a seminal milestone as the industry's first integrated softwarestack for servers and on-device AI with support for a broad set of server and embedded/mobile accelerators.

Gschwind is a pioneer and advocate of Sustainable AI.^[35]

Supercomputer Design

Gschwind was a chief architect for hardware design and software architecture for several supercomputers, including three top-ranked supercomputer systems Roadrunner (June 2008 – November 2009), Sequoia (June 2012 – November 2012), and Summit (June 2018 – June 2020).

Roadrunner was a supercomputer built by IBM for the Los Alamos National Laboratory in New Mexico, USA. The US$100-million Roadrunner was designed for a peak performance of 1.7 petaflops. It achieved 1.026 petaflops on May 25, 2008, to become the world's first TOP500 LINPACK sustained 1.0 petaflops system.^[36]^[37] It was also the fourth-most energy-efficient supercomputer in the world on the Supermicro Green500 list, with an operational rate of 444.94 megaflops per watt of power used.

Sequoia was a petascale Blue Gene/Q supercomputer constructed by IBM for the National Nuclear Security Administration as part of the Advanced Simulation and Computing Program (ASC). It was delivered to the Lawrence Livermore National Laboratory (LLNL) in 2011 and was fully deployed in June 2012.^[38] Sequoia was dismantled in 2020, its last position on the top500.org list was #22 in the November 2019 list.

Summit is a supercomputer developed by IBM for use at Oak Ridge Leadership Computing Facility (OLCF), a facility at the Oak Ridge National Laboratory. It held the number 1 position from November 2018 to June 2020.^[39]^[40] Its current LINPACK benchmark is clocked at 148.6 petaFLOPS.^[41]

Many-Core Processor Design

Gschwind was an early advocate of many-core processor design to overcome the power and performance limitations of single-processor designs. Gschwind co-authored an analysis of the limitations of frequency scaling which arguably led to an industry-wide transition to many-core designs.^[42] Gschwind was a lead architect for several many-core designs, including the first commercial many-core processor Cell with 9 cores, BlueGene/Q with 18 cores, and several enterprise and mainframe processors (POWER7/POWER8/POWER9 with up to 24 cores; z10-z15 with up to 12 cores).

System Reliability

Gschwind coined the term "reliability wall" for obstacles to sustained operation of large-scale systems. He has made major contributions to system-level reliability modeling and improvements, with a particular view to enabling sustained supercomputing system operation. As chief architect of BlueGene/Q, he led system-level reliability and processor design in addition to being the chief ISA architect and QPU vector floating point unit design lead.^[43]^[44]

Gschwind led the first processor and chip-level architectural vulnerability modeling and selective hardening to achieve target MTBF, first implemented in BlueGene/Q using stacked DICE latches for critical state-holding latches.^[45] To increase system reliability while avoiding the performance and power cost associated with ECC-based designs, Gschwind proposed and led the design of register files and minor buses protected with parity with state recovery. In accordance with this approach, error detection is implemented in datapaths which may occur in parallel with initiating compute operations, with a recovery operation when a soft error is detected in parallel with the operation. Recovery then proceeds from good-state maintained in alternate copies of the register file commonly used to scale the number of register file read portsa and reduce wiring delay from register file reads to execution units.^[46]

Compiler Technologies

Gschwind has made seminal contributions to compiler technology, with a particular emphasis on pioneering contributions to just-in-time compilation, dynamic optimization, binary translation and compilers in supercomputing.

Just-in-time-Compilation

Gschwid was an early proponent of just-in-time compilation and has been a driving force in the field. He has proposed critical improvements for the implementation of JIT compilation based systems, with a particular view to dynamic optimization, binary translation and virtual machine implementation. Gschwind's contributions includes implementation of precise exceptions with deferred state materialization,^[47] high-performance computing optimization such as software pipelining at JIT translation time,^[48]^[49] hardware/software co-design for binary emulation and dynamic optimization.^[50]^[51]^[52]^[53] Gschwind's seminal contributions to Virtual Machine design and implementation are reflected by being the most-cited author in the `Virtual Machines' textbook by Smith and Nair.^[54]

Compilation for Accelerators and Accelerator-based Supercomputers

Gschwind is credited with seminal contributions for compiling general-purpose programmable accelerators and GPUs, supporting the launch of the nascent discipline as keynote speaker at the frst General-Purpose Programmable GPU workshop (GPGPU). His contributions include code partitioning, code optimization, code partitioning and APIs for accelerators.^[55]^[56]^[57]^[58]

His innovations include compiler/hardware co-design for integrated register files to resolve phase ordering issues in auto-vectorization between unit assignment and vectorization decisions to simplify the cost model, an innovation adopted by general-purpose programmable accelerators, including the Cell SPU and GPUseneral-purpose CPU designs, starting with Gschwind's pioneering work for SIMD CPU accelerators.

More recently, his contributions to HPC compilation have included pioneering work in enabling high-performance execution of AI workloads.^[59]^[60]^[61]

System and Compiler APIs

Gschwind led the development of the ELFv2 Power execution environment, which has been broadly adopted for Power execution environments. Advantageously, the new environment updates the APIs and ABIs for object-oriented environments. Departing from traditional Power architecture big-endian data conventions, the ELFv2 ABI and APIs were first launched to support a new little-endian version of Linux on Power. This has since been adopted for all Linux versions on Power servers and to support GPU acceleration with Nvidia GPUs, e.g., in the Minsky AI-optimized servers and the Summit and Sierra supercomputers.^[62]^[63]^[64]

SIMD Parallel Vector Architecture

Gschwind is a pioneer of SIMD parallel vector architecture to increase the number of operations which can be performed per cycle. To enable efficient compilation, Gschwind proposed the implementation of merged scalar and vector execution units, eliminating the cost of copies between scalar and vectorized code, and simplifying compiler architecture by resolving phase ordering problems in compilers.

The Cell's accelerator cores (Synergistic Processor Unit SPU) contain a single 128 element register file with 128 bit per register. Registers may hold either scalar or a vector of multiple values.^[65] The simplified cost model leads to significantly improved vectorization success, improving overall program performance and efficiency.^[66]

The vector-scalar approach was also adopted by the IBM Power VSX (Vector Scalar Extension) SIMD instructions,^[67] BlueGene/Q vector instructions^[68]^[69] and System/z mainframe vector instruction set,^[70]^[71] the design of all three IBM vector-scalar architectures having been led by Gschwind as Chief Architect for IBM System Architecture.

Service, Education, Diversity, Inclusion and Digital Inclusion

Gschwind is a strong believer in the power of education and its power to help overcome the effects of all types of discrimination and colonialism. He has served as faculty member at [Princeton] and [TU Wien] to advance education. To overcome the effects of colonialism and bridge the digital divide, Gschwind has volunteered in Senegal to contribute to the expansion and improvement of Senegal's education and research network, snRER.

Background

Gschwind was born in Vienna and obtained his doctorate degree in Computer Engineering at the Technische Universität Wien in 1996. He joined the IBM Thomas J. Watson Research Center in Yorktown Heights, NY and also held positions IBM Systems product group and at its corporate headquarters in Armonk, NY. At Huawei, Gschwind served Vice President of Artificial Intelligence and Accelerated Systems at Huawei. Gschwind is currently a software engineer at Meta Platforms where he has been responsible for AI Acceleration and AI infrastructure.^{[citation needed]}

References

^ "Michael Karl Gschwind". www.ppubs.uspto.gov.
^ David Becker (December 3, 2004). "PlayStation 3 chip goes easy on developers". CNET. Retrieved January 13, 2019.
^ Scarpino, M. (2008). Programming the cell processor: for games, graphics, and computation. Pearson Education.
^ https://on-demand.gputechconf.com/gtc/2017/presentation/S7320-tim-kaldewey-optimizing-efficiency-of-deep-learning-workloads-through-gpu-virtualization.pdf, https://on-demand.gputechconf.com/gtc/2017/presentation/S7320-tim-kaldewey-optimizing-efficiency-of-deep-learning-workloads-through-gpu-virtualization.pdf
^ Optimizing the efficiency of deep learning through accelerator virtualization, https://ieeexplore.ieee.org/document/8030299
^ I/O Vrtualization and System Acceleration in Power9, https://old.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.24-Monday-Epub/HC27.24.30-HP-Cloud-Comm-Epub/HC27.24.340-IO-Virtualization-POWER8-Gschwind-IBM.pdf
^ Gschwind, Michael (2007-06-01). "The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor". International Journal of Parallel Programming. 35 (3): 233–262. doi:10.1007/s10766-007-0035-4. ISSN 1573-7640.
^ "ntegrated execution: A programming model for accelerators". Retrieved 2024-09-04.
^ Chip Multiprocessing and the Cell Broadband Engine, https://computingfrontiers.org/2006/cf06-gschwind.pdf
^ CBE Programming Handbook
^ CBE Programming Tutorial, https://public.dhe.ibm.com/software/dw/cell/CBE_Programming_Tutorial_v3.1.pdf
^ Shi, Guochun; Kindratenko, Volodymyr; Pratas, Frederico; Trancoso, Pedro; Gschwind, Michael. "Application acceleration with the cell broadband engine". Computing in Science and Engineering. 12 (1): 76–81. doi:10.1109/MCSE.2010.4. ISSN 1521-9615.
^ Cell GC: using the cell synergistic processor as a garbage collection coprocessor, ACM Virtual Execution Environments, https://dominoweb.draco.res.ibm.com/reports/rc24520.pdf
^ M. Gschwind, F. Gustavson, J. Prins (eds), High Performance Computing with the Cell Broadband Engine Scientific Programming 2009, https://www.semanticscholar.org/paper/High-Performance-Computing-with-the-Cell-Broadband-Gschwind-Gustavson/c6775765100eb3b9eb7b7bc003a8eba1ca90667f
^ M. Gschwind, M. Perrone (Eds), Topical Issue On Hybrid Systems IBM Journal of Research and Development 53(5):1-2 September 2009, DOI:10.1147/JRD.2009.5429079
^ "PowerAI: A Co-Optimized Software Stack for AI on Power". Retrieved 2024-09-04.
^ "From Ingestion to Deployment for Large Language Models | GTC Digital September 2022 | NVIDIA On-Demand". NVIDIA. Retrieved 2024-09-04.
^ "MultiRay: Optimizing efficiency for large-scale AI models". ai.meta.com. Retrieved 2023-10-28.
^ MultiRay: An Accelerated Embedding Service for Content Understanding, https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf
^ First-Generation Inference Accelerator Deployment at Facebook, https://arxiv.org/pdf/2107.04140.pdf
^ "Mark Zuckerberg says AI boosts monetization by 30% on Instagram, 40% on Facebook". Yahoo Finance. 2023-04-27. Retrieved 2024-09-04.
^ Gairola, Ananya. "From $3B to $10B: Meta's AI-Driven Reels Skyrocketed Revenue Growth Beyond Expectations - Meta Platforms (NASDAQ:META)". Benzinga. Retrieved 2024-09-04.
^ "PyTorch". www.pytorch.org. Retrieved 2023-10-28.
^ "A BetterTransformer for Fast Transformer Inference". pytorch.org. Retrieved 2023-10-28.
^ Belkada, Younes (2022-11-21). "BetterTransformer, Out of the Box Performance for Hugging Face Transformers". PyTorch. Retrieved 2024-09-04.
^ "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever". PyTorch. Retrieved 2024-09-04.
^ "Accelerated Generative Diffusion Models with PyTorch 2". PyTorch. Retrieved 2024-09-04.
^ "Accelerating Large Language Models with Accelerated Transformers". PyTorch. Retrieved 2024-09-04.
^ PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf
^ "ExecuTorch Alpha: Taking LLMs and AI to the Edge with Our Community and Partners". PyTorch. Retrieved 2024-09-04.
^ "Layla v4.6.0 has been published!". Layla. 2024-04-26. Retrieved 2024-09-04.
^ "⚡️Blazing fast LLama2-7B-Chat on 8GB RAM Android device via Executorch". r/LocalLLaMA. 2024-05-15. Retrieved 2024-09-04.
^ "Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile". PyTorch. Retrieved 2024-09-04.
^ pytorch/torchchat, pytorch, 2024-09-04, retrieved 2024-09-04
^ Sustainable AI: Environmental Implications, Challenges and Opportunities, https://arxiv.org/pdf/2111.00364.pdf
^ Gaudin, Sharon (2008-06-09). "IBM's Roadrunner smashes 4-minute mile of supercomputing". Computerworld. Archived from the original on 2008-12-24. Retrieved 2008-06-10.
^ Fildes, Jonathan (2008-06-09). "Supercomputer sets petaflop pace". BBC News. Retrieved 2008-06-09.
^ NNSA awards IBM contract to build next generation supercomputer, February 3, 2009
^ Lohr, Steve (8 June 2018). "Move Over, China: U.S. Is Again Home to World's Speediest Supercomputer". The New York Times. Retrieved 19 July 2018.
^ "Top 500 List - November 2022". TOP500. November 2022. Retrieved 13 April 2022.
^ "November 2022 | TOP500 Supercomputer Sites". TOP500. Retrieved 13 April 2022.
^ "Optimizing pipelines for power and performance". Retrieved 2024-09-04.
^ "Michael Gschwind - ICS 2012 BlueGeneQ keynote presentation". Retrieved 2024-09-04.
^ US9081501B2, Asaad, Sameh; Bellofatto, Ralph E. & Blocksome, Michael A. et al., "Multi-petascale highly efficient parallel supercomputer", issued 2015-07-14
^ "SoftBeam: Precise tracking of transient faults and vulnerability analysis at processor design time". Retrieved 2024-09-04.
^ US7512772B2, Gschwind, Michael Karl & Philhower, Robert, "Soft error handling in microprocessors", issued 2009-03-31
^ "Efficient instruction scheduling with precise exceptions". Retrieved 2024-09-04.
^ "Optimizations and oracle parallelism with dynamic translation". Retrieved 2024-09-04.
^ "Dynamic and Transparent Binary Translation". Retrieved 2024-09-04.
^ "Dynamic binary translation and optimization". Retrieved 2024-09-04.
^ "Advances and future challenges in binary translation and optimization". Retrieved 2024-09-04.
^ Binary translation and architecture convergence issues for IBM System/390, https://www.researchgate.net/profile/Michael-Gschwind/publication/221235791_Binary_translation_and_architecture_convergence_issues_for_IBM_system390/links/0046352f27d9de5653000000/Binary-translation-and-architecture-convergence-issues-for-IBM-system-390.pdf
^ Advances and future challenges in binary translation and optimization, Proceedings of the IEEE, https://ieeexplore.ieee.org/document/964447
^ Smith, Nair, Virtual Machines: Versatile Platforms for Systems and Processes, https://www.amazon.com/Virtual-Machines-Versatile-Platforms-Architecture/dp/1558609105
^ Eichenberger, Alexandre E.; O'Brien, Kathryn; O'Brien, Kevin; Wu, Peng; Chen, Tong; Oden, Peter H.; Prener, Daniel A.; Shepherd, Janice C.; So, Byoungro; Sura, Zehra; Wang, Amy; Zhang, Tao; Zhao, Peng; Gschwind, Michael (2005-09-17). "Optimizing Compiler for the CELL Processor". Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. PACT '05. USA: IEEE Computer Society: 161–172. doi:10.1109/PACT.2005.33. ISBN 978-0-7695-2429-0.
^ "An Open Source Environment for Cell Broadband Engine System Software". Retrieved 2024-09-04.
^ Chip Multiprocessing and the Cell Broadband Engine, https://www.computingfrontiers.org/2006/cf06-gschwind.pdf
^ Gschwind, Michael (2007-06-01). "The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor". International Journal of Parallel Programming. 35 (3): 233–262. doi:10.1007/s10766-007-0035-4. ISSN 1573-7640.
^ "First-Generation Inference Accelerator Deployment at Facebook". research.facebook.com. Retrieved 2024-09-04.
^ PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf
^ "ExecuTorch Alpha: Taking LLMs and AI to the Edge with Our Community and Partners". PyTorch. Retrieved 2024-09-04.
^ OpenPOWER Reengineering a server ecosystem for large-scale data centers, https://old.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-7-Dense-Servers-epub/HC26.12.730-%20OpenPower-Gschwind-IBM.pdf
^ Power Architecture 64-Bit ELF V2 ABI Specification, https://ftp.rtems.org/pub/rtems/people/sebh/ABI64BitOpenPOWERv1.1_16July2015_pub.pdf
^ "Reengineering a server ecosystem for enhanced portability and performance". Retrieved 2024-09-04.
^ "Synergistic Processing in Cell's Multicore Architecture". Retrieved 2024-09-04.
^ Eichenberger, Alexandre E.; O'Brien, Kathryn; O'Brien, Kevin; Wu, Peng; Chen, Tong; Oden, Peter H.; Prener, Daniel A.; Shepherd, Janice C.; So, Byoungro; Sura, Zehra; Wang, Amy; Zhang, Tao; Zhao, Peng; Gschwind, Michael (2005-09-17). "Optimizing Compiler for the CELL Processor". Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. PACT '05. USA: IEEE Computer Society: 161–172. doi:10.1109/PACT.2005.33. ISBN 978-0-7695-2429-0.
^ "Workload acceleration with the IBM POWER vector-scalar architecture". Retrieved 2024-09-04.
^ "The IBM Blue Gene/Q Compute Chip". Retrieved 2024-09-04.
^ Morgan, Timothy Prickett (22 November 2010). "IBM uncloaks 20 petaflops BlueGene/Q super". The Register.
^ Schwarz, E. M.; Krishnamurthy, R. B.; Parris, C. J.; Bradbury, J. D.; Nnebe, I. M.; Gschwind, M. (2015-07-01). "The SIMD accelerator for business analytics on the IBM z13". IBM J. Res. Dev. 59 (4–5): 2:1–2:16. doi:10.1147/JRD.2015.2426576. ISSN 0018-8646.
^ SIMD Processing on IBM z14, z13 and z13s, https://www.ibm.com/downloads/cas/WVPALM0N

[1] "Michael Karl Gschwind". www.ppubs.uspto.gov.

[2] David Becker (December 3, 2004). "PlayStation 3 chip goes easy on developers". CNET. Retrieved January 13, 2019.

[3] Scarpino, M. (2008). Programming the cell processor: for games, graphics, and computation. Pearson Education.

[4] ttps://on-demand.gputechconf.com/gtc/2017/presentation/S7320-tim-kaldewey-optimizing-efficiency-of-deep-learning-workloads-through-gpu-virtualization.pdf, https://on-demand.gputechconf.com/gtc/2017/presentation/S7320-tim-kaldewey-optimizing-efficiency-of-deep-learning-workloads-through-gpu-virtualization.pdf

[5] Optimizing the efficiency of deep learning through accelerator virtualization, https://ieeexplore.ieee.org/document/8030299

[6] I/O Vrtualization and System Acceleration in Power9, https://old.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.24-Monday-Epub/HC27.24.30-HP-Cloud-Comm-Epub/HC27.24.340-IO-Virtualization-POWER8-Gschwind-IBM.pdf

[7] Gschwind, Michael (2007-06-01). "The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor". International Journal of Parallel Programming. 35 (3): 233–262. doi:10.1007/s10766-007-0035-4. ISSN 1573-7640.

[8] "ntegrated execution: A programming model for accelerators". Retrieved 2024-09-04.

[9] Chip Multiprocessing and the Cell Broadband Engine, https://computingfrontiers.org/2006/cf06-gschwind.pdf

[10] CBE Programming Handbook

[11] CBE Programming Tutorial, https://public.dhe.ibm.com/software/dw/cell/CBE_Programming_Tutorial_v3.1.pdf

[12] Shi, Guochun; Kindratenko, Volodymyr; Pratas, Frederico; Trancoso, Pedro; Gschwind, Michael. "Application acceleration with the cell broadband engine". Computing in Science and Engineering. 12 (1): 76–81. doi:10.1109/MCSE.2010.4. ISSN 1521-9615.

[13] Cell GC: using the cell synergistic processor as a garbage collection coprocessor, ACM Virtual Execution Environments, https://dominoweb.draco.res.ibm.com/reports/rc24520.pdf

[14] M. Gschwind, F. Gustavson, J. Prins (eds), High Performance Computing with the Cell Broadband Engine Scientific Programming 2009, https://www.semanticscholar.org/paper/High-Performance-Computing-with-the-Cell-Broadband-Gschwind-Gustavson/c6775765100eb3b9eb7b7bc003a8eba1ca90667f

[15] M. Gschwind, M. Perrone (Eds), Topical Issue On Hybrid Systems IBM Journal of Research and Development 53(5):1-2 September 2009, DOI:10.1147/JRD.2009.5429079

[16] "PowerAI: A Co-Optimized Software Stack for AI on Power". Retrieved 2024-09-04.

[17] "From Ingestion to Deployment for Large Language Models | GTC Digital September 2022 | NVIDIA On-Demand". NVIDIA. Retrieved 2024-09-04.

[18] "MultiRay: Optimizing efficiency for large-scale AI models". ai.meta.com. Retrieved 2023-10-28.

[19] MultiRay: An Accelerated Embedding Service for Content Understanding, https://static.sched.com/hosted_files/pytorch2023/60/PyTorch_Conf_2023-Multiray.pdf

[20] First-Generation Inference Accelerator Deployment at Facebook, https://arxiv.org/pdf/2107.04140.pdf

[21] "Mark Zuckerberg says AI boosts monetization by 30% on Instagram, 40% on Facebook". Yahoo Finance. 2023-04-27. Retrieved 2024-09-04.

[22] Gairola, Ananya. "From $3B to $10B: Meta's AI-Driven Reels Skyrocketed Revenue Growth Beyond Expectations - Meta Platforms (NASDAQ:META)". Benzinga. Retrieved 2024-09-04.

[23] "PyTorch". www.pytorch.org. Retrieved 2023-10-28.

[24] "A BetterTransformer for Fast Transformer Inference". pytorch.org. Retrieved 2023-10-28.

[25] Belkada, Younes (2022-11-21). "BetterTransformer, Out of the Box Performance for Hugging Face Transformers". PyTorch. Retrieved 2024-09-04.

[26] "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever". PyTorch. Retrieved 2024-09-04.

[27] "Accelerated Generative Diffusion Models with PyTorch 2". PyTorch. Retrieved 2024-09-04.

[28] "Accelerating Large Language Models with Accelerated Transformers". PyTorch. Retrieved 2024-09-04.

[29] PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf

[30] "ExecuTorch Alpha: Taking LLMs and AI to the Edge with Our Community and Partners". PyTorch. Retrieved 2024-09-04.

[31] "Layla v4.6.0 has been published!". Layla. 2024-04-26. Retrieved 2024-09-04.

[32] "⚡️Blazing fast LLama2-7B-Chat on 8GB RAM Android device via Executorch". r/LocalLLaMA. 2024-05-15. Retrieved 2024-09-04.

[33] "Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile". PyTorch. Retrieved 2024-09-04.

[34] pytorch/torchchat, pytorch, 2024-09-04, retrieved 2024-09-04

[35] Sustainable AI: Environmental Implications, Challenges and Opportunities, https://arxiv.org/pdf/2111.00364.pdf

[36] Gaudin, Sharon (2008-06-09). "IBM's Roadrunner smashes 4-minute mile of supercomputing". Computerworld. Archived from the original on 2008-12-24. Retrieved 2008-06-10.

[37] Fildes, Jonathan (2008-06-09). "Supercomputer sets petaflop pace". BBC News. Retrieved 2008-06-09.

[38] NNSA awards IBM contract to build next generation supercomputer, February 3, 2009

[nytimes-39] Lohr, Steve (8 June 2018). "Move Over, China: U.S. Is Again Home to World's Speediest Supercomputer". The New York Times. Retrieved 19 July 2018.

[top500-40] "Top 500 List - November 2022". TOP500. November 2022. Retrieved 13 April 2022.

[41] "November 2022 | TOP500 Supercomputer Sites". TOP500. Retrieved 13 April 2022.

[42] "Optimizing pipelines for power and performance". Retrieved 2024-09-04.

[43] "Michael Gschwind - ICS 2012 BlueGeneQ keynote presentation". Retrieved 2024-09-04.

[44] US9081501B2, Asaad, Sameh; Bellofatto, Ralph E. & Blocksome, Michael A. et al., "Multi-petascale highly efficient parallel supercomputer", issued 2015-07-14

[45] "SoftBeam: Precise tracking of transient faults and vulnerability analysis at processor design time". Retrieved 2024-09-04.

[46] US7512772B2, Gschwind, Michael Karl & Philhower, Robert, "Soft error handling in microprocessors", issued 2009-03-31

[47] "Efficient instruction scheduling with precise exceptions". Retrieved 2024-09-04.

[48] "Optimizations and oracle parallelism with dynamic translation". Retrieved 2024-09-04.

[49] "Dynamic and Transparent Binary Translation". Retrieved 2024-09-04.

[50] "Dynamic binary translation and optimization". Retrieved 2024-09-04.

[51] "Advances and future challenges in binary translation and optimization". Retrieved 2024-09-04.

[52] Binary translation and architecture convergence issues for IBM System/390, https://www.researchgate.net/profile/Michael-Gschwind/publication/221235791_Binary_translation_and_architecture_convergence_issues_for_IBM_system390/links/0046352f27d9de5653000000/Binary-translation-and-architecture-convergence-issues-for-IBM-system-390.pdf

[53] Advances and future challenges in binary translation and optimization, Proceedings of the IEEE, https://ieeexplore.ieee.org/document/964447

[54] Smith, Nair, Virtual Machines: Versatile Platforms for Systems and Processes, https://www.amazon.com/Virtual-Machines-Versatile-Platforms-Architecture/dp/1558609105

[55] Eichenberger, Alexandre E.; O'Brien, Kathryn; O'Brien, Kevin; Wu, Peng; Chen, Tong; Oden, Peter H.; Prener, Daniel A.; Shepherd, Janice C.; So, Byoungro; Sura, Zehra; Wang, Amy; Zhang, Tao; Zhao, Peng; Gschwind, Michael (2005-09-17). "Optimizing Compiler for the CELL Processor". Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. PACT '05. USA: IEEE Computer Society: 161–172. doi:10.1109/PACT.2005.33. ISBN 978-0-7695-2429-0.

[56] "An Open Source Environment for Cell Broadband Engine System Software". Retrieved 2024-09-04.

[57] Chip Multiprocessing and the Cell Broadband Engine, https://www.computingfrontiers.org/2006/cf06-gschwind.pdf

[58] Gschwind, Michael (2007-06-01). "The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor". International Journal of Parallel Programming. 35 (3): 233–262. doi:10.1007/s10766-007-0035-4. ISSN 1573-7640.

[59] "First-Generation Inference Accelerator Deployment at Facebook". research.facebook.com. Retrieved 2024-09-04.

[60] PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation, https://pytorch.org/assets/pytorch2-2.pdf

[61] "ExecuTorch Alpha: Taking LLMs and AI to the Edge with Our Community and Partners". PyTorch. Retrieved 2024-09-04.

[62] OpenPOWER Reengineering a server ecosystem for large-scale data centers, https://old.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-7-Dense-Servers-epub/HC26.12.730-%20OpenPower-Gschwind-IBM.pdf

[63] Power Architecture 64-Bit ELF V2 ABI Specification, https://ftp.rtems.org/pub/rtems/people/sebh/ABI64BitOpenPOWERv1.1_16July2015_pub.pdf

[64] "Reengineering a server ecosystem for enhanced portability and performance". Retrieved 2024-09-04.

[65] "Synergistic Processing in Cell's Multicore Architecture". Retrieved 2024-09-04.

[66] Eichenberger, Alexandre E.; O'Brien, Kathryn; O'Brien, Kevin; Wu, Peng; Chen, Tong; Oden, Peter H.; Prener, Daniel A.; Shepherd, Janice C.; So, Byoungro; Sura, Zehra; Wang, Amy; Zhang, Tao; Zhao, Peng; Gschwind, Michael (2005-09-17). "Optimizing Compiler for the CELL Processor". Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. PACT '05. USA: IEEE Computer Society: 161–172. doi:10.1109/PACT.2005.33. ISBN 978-0-7695-2429-0.

[67] "Workload acceleration with the IBM POWER vector-scalar architecture". Retrieved 2024-09-04.

[68] "The IBM Blue Gene/Q Compute Chip". Retrieved 2024-09-04.

[69] Morgan, Timothy Prickett (22 November 2010). "IBM uncloaks 20 petaflops BlueGene/Q super". The Register.

[70] Schwarz, E. M.; Krishnamurthy, R. B.; Parris, C. J.; Bradbury, J. D.; Nnebe, I. M.; Gschwind, M. (2015-07-01). "The SIMD accelerator for business analytics on the IBM z13". IBM J. Res. Dev. 59 (4–5): 2:1–2:16. doi:10.1147/JRD.2015.2426576. ISSN 0018-8646.

[71] SIMD Processing on IBM z14, z13 and z13s, https://www.ibm.com/downloads/cas/WVPALM0N

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]