Name: llama.cpp
Brand: llama.cpp
Rating: 5 (1 reviews)

Review

The reference C++ inference engine for open-source LLMs — the project Ollama / LM Studio / Jan / KoboldCpp / llamafile all wrap, by the developer who invented the GGUF model format the open-weight ecosystem standardised on. Listed at Grade A because `llama.cpp` is the irreducible local-inference layer: no operator on the data path, MIT-licensed, single-binary install, zero account anywhere, native acceleration on CPU + NVIDIA CUDA + AMD ROCm + Apple Metal + Vulkan. The strongest privacy posture available in this directory, jointly with Ollama (which sits one wrapper layer above), and the right answer when you want maximum control over quantisation, context window, batch size, and offload split — the things Ollama abstracts away.

What it is. `llama.cpp` is a single-repo C++ project (`github.com/ggml-org/llama.cpp`, 70k+ stars) that loads quantised GGUF model weights and runs LLM inference against them. It ships:

A CLI (`./llama-cli`) for one-shot prompts and interactive REPL
An HTTP server (`./llama-server`) exposing an OpenAI-compatible `/v1/chat/completions` endpoint
A C++ library you can link from any host language (Python via `llama-cpp-python`, Rust via `llama-rs`, Go, Node, etc.)
The `convert_hf_to_gguf.py` toolchain for converting any Hugging Face model to GGUF
The `quantize` binary that produces every quantisation level from `Q2_K` to `Q8_0` to full FP16 / BF16

Where Ollama is the "consumer app" wrapper, `llama.cpp` is the engine. Every Ollama feature (model serving, prompt templates, GPU offload, function calling) is `llama.cpp` underneath. If you want to skip the wrapper — own your own model files, control quantisation per-deployment, link the library into a custom server — this is the project.

Background. Started in March 2023 by Georgi Gerganov as a port of Meta's LLaMA model to C++, originally targeting Apple Silicon Macs. The project rapidly became the de-facto reference implementation for running open-weight LLMs locally — by mid-2024 the entire local-LLM ecosystem (Ollama, LM Studio, KoboldCpp, GPT4All, llamafile, OpenWebUI's local backend, vLLM's CPU mode) had standardised on Gerganov's GGUF model format and his quantisation algorithms.

`llama.cpp` is maintained by an open-source collective via the `ggml-org` GitHub organisation (~600 contributors as of mid-2025). Gerganov also runs the ggml.ai company (Sofia, Bulgaria) which provides commercial support and contributes inference-engine improvements upstream, but the codebase remains MIT-licensed and community-governed — no CLA, no copyright assignment, no enterprise fork.

What you trust.

Inference is local. Once you have a GGUF on disk, generation runs entirely on your hardware. The binary makes zero network calls during inference. You can run `llama-server` on an air-gapped box and use it forever.
No account, no signup, no registry dependency. You download the source / a release binary from GitHub, build (or unpack the precompiled), point at a GGUF file. No `llama.cpp account` exists. No upstream lookup happens.
MIT licence, ~70k stars, ~600 contributors. Audit anyone's commit at `github.com/ggml-org/llama.cpp`. Bug fixes ship within hours of a serious issue being reported. Multiple companies (Mozilla, ggml.ai, individuals) sponsor maintainers.
Bring your own model weights. GGUF files come from Hugging Face (`hf.co/<org>/<model>-GGUF`), TheBloke's archive, the operator's own conversion of a `safetensors` checkpoint, or any third party you trust. There's no central registry controlling what you can run.
Reproducible builds. The source compiles cleanly with stock `cmake` + `make` on any POSIX system; the `release` tarballs include source + binaries together. You can verify the binary by rebuilding from the tagged source.
Telemetry is zero. No analytics, no usage pings, no error reports leaving your machine. The project has no operator that could collect data even if they wanted to.

Operational specs.

Install — `git clone https://github.com/ggml-org/llama.cpp && cmake -B build && cmake --build build` (~5 min on a modern machine). Or download a release tarball from GitHub. Precompiled Linux / macOS / Windows binaries shipped per-release.
Hardware — minimum 4 GB RAM for tiny (1-3B) models at Q4; ~8 GB for 7B; 16-32 GB for 13B; 64+ GB for 70B. GPU optional. Acceleration paths: NVIDIA CUDA, AMD ROCm + Vulkan, Apple Metal, Intel SYCL, Vulkan (cross-vendor). CPU-only inference works on x86-64 + ARM (including phones via the Android port).
Model formats — GGUF (native, all quantisation levels from `IQ1_S` 1.5-bit to `F32` 32-bit float). Converters available from Hugging Face `safetensors`, PyTorch `.bin`, original `LLaMA` checkpoints.
CLI flags — `./llama-cli -m model.gguf -p "your prompt" -n 256 --temp 0.7 -ngl 35 -c 4096`. `-ngl` controls GPU layer offload (33-99 for full GPU on 7B/Q4). `-c` sets context length. `--cache-type-k Q8_0 --cache-type-v Q8_0` quantises the KV cache for longer context.
Server mode — `./llama-server -m model.gguf --port 8080 --host 0.0.0.0 -c 8192 -ngl 99`. Exposes `/completion`, `/chat/completions` (OpenAI-compatible), `/embeddings`, `/v1/audio/transcriptions` (whisper.cpp integration), and a built-in chat UI at the root.
Python bindings — `pip install llama-cpp-python` gives you the same engine under a Python API; first-class for LangChain, LlamaIndex, custom RAG pipelines.
Sampling — top-k, top-p, min-p, locally-typical, mirostat, XTC, DRY, temperature shaping. Every modern sampler the field has invented, exposed as CLI flags.
Speculative decoding — draft-model acceleration (`--model-draft`) for 1.5-3× faster generation on capable hardware.
Support — GitHub Issues (active, weeks-deep backlog handled), Discord, the project's discussions tab. No commercial SLA on the free tier; `ggml.ai` sells engagements.

Operator philosophy. Gerganov has been explicit that `llama.cpp` is built for "everyone, everywhere" — the explicit goal is to make LLM inference work on whatever hardware the user has, including phones, microcontrollers, and CPU-only servers. The project's first-class support for `IQ1_S` (1.5-bit) and `Q2_K` (2-bit) quantisation is downstream of this: even a 70B model can squeeze into 16 GB of RAM at the cost of some quality. The GGML library underneath is a separately-maintained tensor primitive that other projects (whisper.cpp, stable-diffusion.cpp, bark.cpp) all share — Gerganov is building the open inference substrate rather than any one application.

Grade rationale. Grade A reflects: strongest privacy posture in the directory (inference local, no operator on data path, no account, no telemetry), MIT licence under permissive open-source norms (forkable, auditable, no CLA), broadest hardware support (every consumer accelerator + CPU on every common platform), the foundational engine the entire local-LLM ecosystem builds on (Ollama, LM Studio, Jan, KoboldCpp, llamafile all depend on it — is inherited by the engine downstream), named-operator accountability without operator dependency (Gerganov + ggml.ai publicly identified, but the runtime keeps working if either disappears), no major incident in `r/LocalLLaMA` / `r/MachineLearning` / GitHub issues in the last 12 months, and active maintenance — multiple releases per month. Last verified 2026-05-26.

Useful when:

You want full control over quantisation level, KV-cache type, context length, and GPU offload split — things Ollama abstracts away.
You're integrating LLM inference into a custom application (Electron, FFI binding, embedded system) and need the C++ library rather than a server.
You're optimising for absolute minimum binary size / minimum runtime footprint (ZeroMQ + `llama.cpp` is a viable inference stack in ~5 MB).
You need a specific sampler (mirostat, XTC, DRY) that Ollama doesn't expose.
You want to run on hardware Ollama doesn't support cleanly (Vulkan-only GPUs, Intel SYCL, weird ARM SoCs, the Hetzner CPU box you already pay for).
You're building / running benchmarks and need deterministic per-run control over every inference parameter.

Caveats:

Setup is C++-developer-grade. First-time install is `git clone + cmake + make` and ~5 minutes; for non-developers, Ollama's curl-install + auto-config is the friendlier path to the same underlying engine.
No model registry. You source GGUF files yourself (Hugging Face is the de-facto repository, but verify the uploader; some quantisation jobs introduce subtle quality regressions). Ollama wraps this with `ollama.com/library`; `llama.cpp` doesn't.
API surface is minimal but functional. The built-in chat UI at `localhost:8080/` is basic — fine for testing, not for production end-users. Pair with a real frontend (Open WebUI, custom React, your own thing).
Breaking changes during fast iteration. The project ships multiple releases per month and sometimes breaks model compatibility (older GGUFs need re-quantisation against new versions). For production, pin a release tag and re-test before bumping.
GPU offload requires correct build flags. `cmake -DGGML_CUDA=ON` for NVIDIA, `-DGGML_HIPBLAS=ON` for AMD, `-DGGML_METAL=ON` (default on macOS) for Apple. Forgetting a flag silently CPU-falls-back, which destroys throughput. The README documents this; check before assuming GPU is working.
No vendor support contract on the free tier. Bugs are filed as GitHub Issues; serious users buy engagement from `ggml.ai`. Pair the project with internal triage capability if you depend on inference for revenue.
`llama.cpp`-the-CLI changes API across releases. `./main` was renamed to `./llama-cli` in mid-2024, `./server` to `./llama-server`, etc. Scripts pinned to old binary names need updating against new releases.

llama.cpp

At a glance

Review

Fees

Links

Audit trail — receipts for the editorial claim

Reviews — moderated · rules

Add a review