Run uncensored LLMs locally — the access nobody can revoke

Why local, why now

On 12 June 2026 a US export-control directive forced Anthropic to suspend access to Fable 5 and Mythos 5 for every foreign national — overnight, no wrongdoing required. A policy change upstream, and hundreds of millions of people lost a tool they relied on. That is the structural risk of renting intelligence from a gatekeeper: access is a permission, and permissions get revoked, geofenced, repriced, or logged.

A model whose weights live on your own disk has none of that fragility. It can't be cut off by a directive you never saw, throttled, or quietly fine-tuned against you. Open-weight models are to AI what running your own node is to Bitcoin: clunkier than the hosted option, and yours in a way the hosted option can never be.

"Uncensored" here means two things: weights you can run with no API gate, and fine-tunes that don't refuse benign requests. Both matter — but neither makes a model smarter or more truthful. Treat the outputs like any tool's: useful, fallible, and your responsibility.

The hardware reality (and the quantization cheat)

The one number that matters is memory — VRAM if you have a GPU, system RAM if you don't. The trick that makes local models practical is quantization: compressing weights from 16-bit down to 4-bit with little quality loss. A rough rule for a 4-bit (Q4_K_M) GGUF model:

7–8B parameters: ~5 GB. Runs on a laptop, even CPU-only (slowly). 8 GB VRAM is comfortable.
13–14B: ~9 GB. A 12 GB GPU or a 16 GB Mac.
30–34B: ~20 GB. A 24 GB GPU (3090/4090) or a 32 GB Mac.
70B: ~42 GB. Two 24 GB GPUs, a 48 GB card, or a 64 GB+ Mac.

Apple Silicon punches above its weight because the GPU shares system RAM — a 64 GB Mac runs models a comparable PC needs two graphics cards for. No GPU at all? A 7B still runs on CPU; expect a few tokens per second, not dozens.

Pick a runtime

Ollama — the easiest start. One install, then ollama run llama3.1 pulls and runs a model. Exposes a local API on port 11434 that most chat UIs speak. Recommended for almost everyone.
LM Studio — a polished desktop GUI. Browse and download models from Hugging Face, chat, and expose an OpenAI-compatible local server. Best if you want zero terminal.
llama.cpp — the bare-metal engine under most of the others. Maximum control and the widest hardware support; you build it and manage GGUF files yourself.
vLLM / TGI — for serving one model to many requests at speed on a real GPU. Overkill for one person; right for a shared box.

Pick a model

Start with a strong open-weight base, then choose a fine-tune if you want fewer refusals:

Open-weight bases: Llama (Meta), Qwen (Alibaba), Mistral / Mixtral, Gemma (Google), DeepSeek. All downloadable, all run locally. Qwen and Llama 8B are the best "fits on a laptop" all-rounders today.
Uncensored / "abliterated" fine-tunes: the Dolphin series, Nous Hermes, and "abliterated" builds (a technique that surgically removes the refusal direction from an existing model). They answer instead of lecturing — useful for security research, fiction, and edge-case questions a hosted model nannies. The cost: they hallucinate at least as much, sometimes more, with no guardrail between you and a confidently wrong answer.

Get weights from Hugging Face. With Ollama, most popular uncensored builds are one command away (e.g. ollama run dolphin-mistral).

Quickstart with Ollama

// install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

// pull + chat with an 8B all-rounder
ollama run llama3.1:8b

// an uncensored fine-tune instead
ollama run dolphin-mistral

// list what you have, free disk later
ollama list
ollama rm dolphin-mistral

Point any OpenAI-compatible client at http://localhost:11434/v1 and you have a private, local drop-in. For a chat UI, Open WebUI runs in one container and talks to Ollama out of the box.

Make it truly private

Pull weights, then go offline. Once a model is on disk it needs no network at all. Download over Tor or a VPN if you'd rather Hugging Face / the registry not log your IP against a model list.
Block the runtime from phoning home. Ollama and llama.cpp run fully local, but firewall the process anyway (or run it on an air-gapped box) so a future update can't add telemetry behind your back.
Keep prompts on-device. The whole point: your conversations never leave the machine. No account, no server-side history, nothing to subpoena.
Disk encryption matters more now. Your prompt history and any saved chats live locally — full-disk encryption (see our device guides) is the backstop if the hardware is seized or lost.

Honest caveats

Uncensored is not smarter. Removing refusals doesn't add knowledge or accuracy. An abliterated 8B is still an 8B.
Local is not frontier. A 70B on your desk is genuinely useful, but it won't match the best hosted models on the hardest tasks. The trade you're making is capability for sovereignty — go in clear-eyed.
You own the output. No provider is filtering on your behalf, which is the point — and the responsibility. What you generate and do with it is on you.