llama.cpp
![]() | |
Original author(s) | Georgi Gerganov |
---|---|
Developer(s) | Georgi Gerganov and community |
Initial release | March 10, 2023[1] |
Repository | github |
Written in | C++ |
License | MIT License[2] |
llama.cpp is an open source software library written in C++ that performs inference on various Large Language Models such as Llama.[3] It is co-developed alongside the ggml library, a general-purpose tensor library.[4]
History[edit]
llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C++ with no Python dependencies. This bettered performance on computers without GPU or other dedicated hardware.[3][5] As of July 2024 it has 61 thousand stars on GitHub.[6] Before llama.cpp, Gerganov worked on a similar library called whisper.cpp[7] which implemented Whisper, a speech to text model by OpenAI.
Architecture[edit]
llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code.[8] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.[9]
Supported models[edit]
GGUF file format[edit]
Filename extension | .gguf |
---|---|
Magic number | 0x47 0x47 0x55 0x46 |
Developed by | Georgi Gerganov and community |
Initial release | August 22, 2023[10] |
Type of format | Machine-learning |
The GGUF file format is a binary format used by llama.cpp that stores both tensors and metadata.[11] GGUF files are typically created by converting models developed in another file format from a different machine learning library such as PyTorch. It is the intention of GGUF's to make model files easy and fast to load within llama.cpp and other ggml projects.[12]
GGUF was created to replace previous file formats used by the project which didn't include architecture metadata, and therefore made it difficult to extend the software without breaking backwards compatibility.[12]
The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision.[13]
Supported data types[edit]
GGUF supports common floating-point data formats float32, float16, and bfloat16, as well as 1.5-bit and 2-bit to 8-bit quantized integer types.
References[edit]
- ^ "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
- ^ "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
- ^ a b Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
- ^ Gerganov, Georgi (17 May 2024). "ggerganov/ggml".
- ^ Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
- ^ "ggerganov/llama.cpp". GitHub.
- ^ "ggerganov/whisper.cpp". GitHub.
- ^ Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
- ^ Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.
- ^ "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
- ^ "GGUF". huggingface.co. Retrieved 9 May 2024.
- ^ a b "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
- ^ Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.