Jump to content

llama.cpp

From Wikipedia, the free encyclopedia
(Redirected from Draft:Llama.cpp)
llama.cpp
Original author(s)Georgi Gerganov
Developer(s)Georgi Gerganov and community
Initial releaseMarch 10, 2023; 16 months ago (2023-03-10)[1]
Repositorygithub.com/ggerganov/llama.cpp
Written inC++
LicenseMIT License[2]

llama.cpp is an open source software library written in C++ that performs inference on various Large Language Models such as Llama.[3] It is co-developed alongside the ggml library, a general-purpose tensor library.[4]

History[edit]

llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C++ with no Python dependencies. This bettered performance on computers without GPU or other dedicated hardware.[3][5] As of July 2024 it has 61 thousand stars on GitHub.[6] Before llama.cpp, Gerganov worked on a similar library called whisper.cpp[7] which implemented Whisper, a speech to text model by OpenAI.

Architecture[edit]

llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code.[8] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.[9]

Supported models[edit]

GGUF file format[edit]

GGUF
Filename extension.gguf
Magic number0x47 0x47 0x55 0x46
Developed byGeorgi Gerganov and community
Initial releaseAugust 22, 2023; 10 months ago (2023-08-22)[10]
Type of formatMachine-learning

The GGUF file format is a binary format used by llama.cpp that stores both tensors and metadata.[11] GGUF files are typically created by converting models developed in another file format from a different machine learning library such as PyTorch. It is the intention of GGUF's to make model files easy and fast to load within llama.cpp and other ggml projects.[12]

GGUF was created to replace previous file formats used by the project which didn't include architecture metadata, and therefore made it difficult to extend the software without breaking backwards compatibility.[12]

The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision.[13]

Supported data types[edit]

GGUF supports common floating-point data formats float32, float16, and bfloat16, as well as 1.5-bit and 2-bit to 8-bit quantized integer types.

References[edit]

  1. ^ "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
  2. ^ "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
  3. ^ a b Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
  4. ^ Gerganov, Georgi (17 May 2024). "ggerganov/ggml".
  5. ^ Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
  6. ^ "ggerganov/llama.cpp". GitHub.
  7. ^ "ggerganov/whisper.cpp". GitHub.
  8. ^ Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
  9. ^ Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.
  10. ^ "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
  11. ^ "GGUF". huggingface.co. Retrieved 9 May 2024.
  12. ^ a b "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
  13. ^ Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.