Name: Ollama
Brand: Ollama
Rating: 5 (1 reviews)

Review

Local-first LLM runtime that pulls quantised open-source model weights and exposes them through an OpenAI-compatible API on `localhost`. Listed at Grade A because Ollama is the canonical "I want OpenAI-quality output without sending my prompts anywhere" answer for the ~80% of users who don't want to compile `llama.cpp` themselves — MIT-licensed, no account anywhere in the flow, no telemetry on the inference path, and the inference itself never leaves your machine. The strongest privacy posture available in this directory because there is no operator to trust on the data path.

What it is. Ollama is a desktop + server application that wraps the `llama.cpp` inference engine in a clean CLI (`ollama run <model>`, `ollama serve`), a model registry (`ollama.com/library` hosts quantised GGUF weights for ~80 popular open-source models — Llama, Mistral, DeepSeek, Qwen, Phi, Gemma, Mixtral, and many more), and an OpenAI-compatible HTTP API on `localhost:11434/v1`. You install it once (~600 MB binary), `ollama pull llama3` downloads the weights (~4-40 GB depending on model size), and `ollama run llama3` drops you into a chat REPL — or you point any OpenAI-SDK consumer (LangChain, Continue.dev, Cursor, Aider, the `openai` Python SDK) at the local endpoint and it works without code changes.

Background. Ollama was started in 2023 by Jeffrey Morgan and Michael Chiang as a Mac-first project (Apple Silicon's unified memory makes consumer-grade LLM inference unusually tractable). It expanded to Linux + Windows within months and now runs on CPU, NVIDIA CUDA, AMD ROCm, and Apple Metal, picking the best available accelerator automatically. The team operates Ollama, Inc. (a Delaware C-corp, SF-based) with venture backing — but the runtime is fully open-source under MIT with the codebase at `github.com/ollama/ollama`, and the company's business model is enterprise support / on-prem deployment, not the consumer CLI.

The registry at `ollama.com/library` is the project's centralised distribution surface — analogous to Docker Hub for model weights. You can also point Ollama at any GGUF file on disk via a `Modelfile` (the project's compact spec for declaring a model + system prompt + parameters), so air-gapped or fully self-hosted registry workflows are first-class.

What you trust.

No account, no signup, no email, anywhere in the install or use flow. Ollama is software you run locally; there is no "Ollama account" to create. The first time you launch the desktop app, it doesn't ask you to register — you `ollama pull <model>` and you're done.
Inference is local. Once a model is pulled to disk, generation runs on your hardware with no network calls during the prompt → completion roundtrip. You can pull the network plug after `ollama pull` finishes and inference continues to work.
Open source under MIT. Runtime + all CLI code at `github.com/ollama/ollama` (95k+ stars, active maintenance). Forkable, auditable, and the `Modelfile` format means you can recreate the same model behaviour from raw GGUF weights without touching Ollama's distribution.
Telemetry is opt-in. The desktop app asks at install time; declining means zero usage data leaves your machine. The CLI / server modes have no telemetry. `ollama --help` documents the relevant env vars (`OLLAMA_TELEMETRY=0` as an additional safeguard).
Built on `llama.cpp`. The underlying inference engine is an open-source MIT-licensed inference engine, MIT-licensed, audited by a large community. Ollama's value-add is packaging + registry + API surface, not the inference path.
No vendor lock-in on model weights. Everything Ollama runs is also runnable directly via `llama.cpp` if Ollama ever disappeared — the GGUF format is open and the `Modelfile` spec is human-readable.

Operational specs.

Install: single binary download for macOS / Linux / Windows from `ollama.com/download`, or `curl -fsSL https://ollama.com/install.sh | sh` on Linux. ~600 MB runtime.
Hardware: minimum 8 GB RAM for 7B-parameter models at Q4 quantisation; 16-32 GB for 13B; 64+ GB for 70B. GPU optional — NVIDIA (CUDA 12+), AMD (ROCm 5.7+), Apple Silicon (Metal, all M-series). Without a GPU, CPU inference works but is 5-20× slower.
Storage: per-model disk usage ranges from ~1 GB (1B-parameter Q4) to ~40 GB (70B Q4) to 240+ GB (405B Q4). Models cache to `~/.ollama/models` (configurable via `OLLAMA_MODELS` env).
Models exposed: ~80 in the public library as of mid-2025 — Llama 3 / 3.1 / 3.2, Llama 4 (when released), Mistral / Mixtral, DeepSeek V2/V3/R1, Qwen 2.5, Phi 3.5, Gemma 2, plus vision (LLaVA, Llama-Vision), embedding (nomic-embed-text), and code (CodeLlama, DeepSeek-Coder) variants.
API: `localhost:11434/v1/chat/completions` (OpenAI-compatible — streaming, JSON-mode, tools, vision where the model supports it), plus Ollama's native `/api/generate` and `/api/chat` endpoints. CORS configurable via `OLLAMA_ORIGINS`.
CLI: `ollama run <model>`, `ollama pull <model>`, `ollama list`, `ollama rm <model>`, `ollama show <model> --modelfile`, `ollama create <name> -f Modelfile`. The Modelfile spec covers system prompts, temperature, context length, stop sequences, and adapter merging.
Self-host registry: optional. You can point Ollama at a private GGUF registry (e.g. inside an air-gapped environment) by setting `OLLAMA_HOST` + serving the registry protocol yourself.
Support: GitHub Issues (~1500 open, active maintainer triage), Discord (large + active), no commercial support contract for the free tier (enterprise tier exists via Ollama, Inc.).

Operator philosophy. Jeffrey Morgan's framing in conference talks is "local inference is the default, not the fallback" — the team's design choices consistently favour latency + privacy over feature completeness on the hosted side. The Modelfile + GGUF approach makes Ollama functionally a packaging layer over `llama.cpp`, which means the project's value depreciates if the hosted-LLM economy gets cheaper / more private (good thing) and accretes if local hardware gets faster (also good thing). The Ollama, Inc. enterprise side is decoupled from the open-source runtime — the CLI doesn't degrade if you don't pay, and there's no "free tier" rate limit (because there's no server to limit).

Grade rationale. Grade A reflects: the strongest privacy posture available (inference is local, no operator on the data path, no account to compromise), open-source under permissive MIT licence (forkable + auditable), named-operator accountability without operator dependency (Ollama, Inc. + Jeffrey Morgan publicly identified, but the runtime keeps working if they vanish — switching to `llama.cpp` directly is the equivalent of changing a wrapper), broad hardware support (every consumer accelerator + CPU fallback), rich model library (~80 open-source models, all the post-2024 frontier-grade open releases), OpenAI-compatible API surface (works as a drop-in for any existing tool), kycnot.me corroboration on the no-KYC posture, no major incident or trust-erosion thread in r/LocalLLaMA / r/MachineLearning / GitHub issues in the last 12 months, and deliberate refusal to add usage telemetry. Last verified 2026-05-26.

Useful when:

You want OpenAI / Claude-quality output for sensitive prompts (medical, legal, security research, financial) and can't accept the prompt being seen by any third party.
You're a developer using Continue.dev / Cursor / Aider / LangChain who wants a free local endpoint that's API-compatible with paid frontier-model setups.
You want to comparison-test open-source models against each other or against hosted vendors without paying per-call.
You have a GPU (or even just Apple Silicon) sitting idle and the marginal inference cost is effectively zero.
You need air-gapped / offline inference for a journalism / research / activist workflow where no network connection is acceptable.
You're building a local-first application (Obsidian plugin, Raycast extension, custom Electron app) and want LLM features without telling your users to get an OpenAI key.

Caveats:

Hardware is your bottleneck. A 7B model at Q4 needs 8 GB RAM minimum to run usefully; a 70B model needs 48-64 GB RAM (or a GPU with that VRAM). If your machine doesn't meet the bar, performance is unusable — Ollama doesn't magically make inference cheap, it just removes the network cost.
Output quality lags hosted frontier. Open-source models in the Ollama library run from "as good as GPT-3.5" (Llama 3 7B) to "approaches GPT-4o" (Llama 3.1 405B, DeepSeek V3) but rarely match Claude Opus / Sonnet on reasoning-heavy tasks. For coding + reasoning, NanoGPT or direct API access to Claude is still measurably better — Ollama's pitch is privacy, not raw capability.
No vendor support contract on the free tier. Bug? File a GitHub issue. Enterprise support exists from Ollama, Inc. but isn't free.
Model weights live on `ollama.com/library` by default. If the registry goes down, `ollama pull <new-model>` breaks until you point at an alternate source — but existing pulled models keep working forever.
Telemetry opt-in is at install time only. If you accept on the first launch, the desktop app sends usage pings. Disabling later requires editing config OR setting `OLLAMA_TELEMETRY=0` as an additional safeguard. The CLI / `ollama serve` modes have no telemetry regardless.
Updates are manual by default (the desktop app prompts; CLI users `brew upgrade ollama` or re-run the install script). New model formats sometimes require a runtime bump.
VRAM accounting is approximate. Ollama will sometimes try to fit a model that's too big for your GPU and fall back to CPU mid-generation, which silently reduces throughput to unusable levels. Watch `ollama ps` to confirm which device is doing inference.

Fees

Free · MIT · runtime + model weights local

Links

WEB https://ollama.com

Sourced from operator pages — verify identity via more than one channel before trusting time-sensitive instructions.

Audit trail — receipts for the editorial claim

● UPSTREAM Up · HTTP 200 · 135ms · checked 1h ago
○ ONION No .onion mirror listed
✎ MANUAL Last manual verification 2026-05-26 (<90d)
⌕ LOG See curator log for Ollama

Reviews — moderated · rules

No community reviews yet. Be the first below.

Add a review

Honest, brand-neutral feedback welcome. A curator approves before it appears here. No JS required.

At a glance