llama.cpp

GGUF
Filename extension	.gguf
Magic number	0x47 0x47 0x55 0x46
Developed by	Georgi Gerganov and community
Initial release	August 22, 2023; 10 months ago
Type of format	Machine-learning

llama.cpp
Original author(s)	Georgi Gerganov
Developer(s)	Georgi Gerganov and community
Initial release	March 10, 2023; 16 months ago
Repository	github.com/ggerganov/llama.cpp
Written in	C++
License	MIT License

llama.cpp is an open source software library written in C++ that performs inference on various Large Language Models such as Llama.^[3] It is co-developed alongside the ggml library, a general-purpose tensor library.^[4]

History[edit]

llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C++ with no Python dependencies. This bettered performance on computers without GPU or other dedicated hardware.^[3]^[5] As of July 2024 it has 61 thousand stars on GitHub.^[6] Before llama.cpp, Gerganov worked on a similar library called whisper.cpp^[7] which implemented Whisper, a speech to text model by OpenAI.

Architecture[edit]

llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code.^[8] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.^[9]

Supported models[edit]

GGUF file format[edit]

The GGUF file format is a binary format used by llama.cpp that stores both tensors and metadata.^[11] GGUF files are typically created by converting models developed in another file format from a different machine learning library such as PyTorch. It is the intention of GGUF's to make model files easy and fast to load within llama.cpp and other ggml projects.^[12]

GGUF was created to replace previous file formats used by the project which didn't include architecture metadata, and therefore made it difficult to extend the software without breaking backwards compatibility.^[12]

The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision.^[13]

Supported data types[edit]

GGUF supports common floating-point data formats float32, float16, and bfloat16, as well as 1.5-bit and 2-bit to 8-bit quantized integer types.

References[edit]

^ "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
^ "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
^ ^a ^b Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
^ Gerganov, Georgi (17 May 2024). "ggerganov/ggml".
^ Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
^ "ggerganov/llama.cpp". GitHub.
^ "ggerganov/whisper.cpp". GitHub.
^ Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
^ Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.
^ "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
^ "GGUF". huggingface.co. Retrieved 9 May 2024.
^ ^a ^b "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
^ Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[githubrelease-1] "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.

[license-2] "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.

[register-llamafile-3] Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.

[ggml-4] Gerganov, Georgi (17 May 2024). "ggerganov/ggml".

[arstechnica-5] Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.

[llama.cpprepo-6] "ggerganov/llama.cpp". GitHub.

[whisper-7] "ggerganov/whisper.cpp". GitHub.

[tomshardware-8] Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.

[Walkowiak-9] Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.

[githubgguf-10] "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.

[huggingface-11] "GGUF". huggingface.co. Retrieved 9 May 2024.

[ggufdoc-12] "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.

[towardsdatascience-13] Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]