xmr.club
EN 中文 ES RU
← all guides
guide · long-form explainer

Run uncensored LLMs locally — the access nobody can revoke

On 12 June 2026 a single export-control directive forced Anthropic to suspend Fable 5 and Mythos 5 for every foreign national, overnight. Hosted intelligence is a permission — and permissions get revoked. A model whose weights live on your own disk has none of that fragility. Here is how to run one.

Why local, why now

On 12 June 2026 a US export-control directive forced Anthropic to suspend access to Fable 5 and Mythos 5 for every foreign national — overnight, no wrongdoing required. A policy change upstream, and hundreds of millions of people lost a tool they relied on. That is the structural risk of renting intelligence from a gatekeeper: access is a permission, and permissions get revoked, geofenced, repriced, or logged.

A model whose weights live on your own disk has none of that fragility. It can't be cut off by a directive you never saw, throttled, or quietly fine-tuned against you. Open-weight models are to AI what running your own node is to Bitcoin: clunkier than the hosted option, and yours in a way the hosted option can never be.

"Uncensored" here means two things: weights you can run with no API gate, and fine-tunes that don't refuse benign requests. Both matter — but neither makes a model smarter or more truthful. Treat the outputs like any tool's: useful, fallible, and your responsibility.

The hardware reality (and the quantization cheat)

The one number that matters is memory — VRAM if you have a GPU, system RAM if you don't. The trick that makes local models practical is quantization: compressing weights from 16-bit down to 4-bit with little quality loss. A rough rule for a 4-bit (Q4_K_M) GGUF model:

  • 7–8B parameters: ~5 GB. Runs on a laptop, even CPU-only (slowly). 8 GB VRAM is comfortable.
  • 13–14B: ~9 GB. A 12 GB GPU or a 16 GB Mac.
  • 30–34B: ~20 GB. A 24 GB GPU (3090/4090) or a 32 GB Mac.
  • 70B: ~42 GB. Two 24 GB GPUs, a 48 GB card, or a 64 GB+ Mac.

Apple Silicon punches above its weight because the GPU shares system RAM — a 64 GB Mac runs models a comparable PC needs two graphics cards for. No GPU at all? A 7B still runs on CPU; expect a few tokens per second, not dozens.

Pick a runtime

  • Ollama — the easiest start. One install, then ollama run llama3.1 pulls and runs a model. Exposes a local API on port 11434 that most chat UIs speak. Recommended for almost everyone.
  • LM Studio — a polished desktop GUI. Browse and download models from Hugging Face, chat, and expose an OpenAI-compatible local server. Best if you want zero terminal.
  • llama.cpp — the bare-metal engine under most of the others. Maximum control and the widest hardware support; you build it and manage GGUF files yourself.
  • vLLM / TGI — for serving one model to many requests at speed on a real GPU. Overkill for one person; right for a shared box.

Pick a model

Start with a strong open-weight base, then choose a fine-tune if you want fewer refusals:

  • Open-weight bases: Llama (Meta), Qwen (Alibaba), Mistral / Mixtral, Gemma (Google), DeepSeek. All downloadable, all run locally. Qwen and Llama 8B are the best "fits on a laptop" all-rounders today.
  • Uncensored / "abliterated" fine-tunes: the Dolphin series, Nous Hermes, and "abliterated" builds (a technique that surgically removes the refusal direction from an existing model). They answer instead of lecturing — useful for security research, fiction, and edge-case questions a hosted model nannies. The cost: they hallucinate at least as much, sometimes more, with no guardrail between you and a confidently wrong answer.

Get weights from Hugging Face. With Ollama, most popular uncensored builds are one command away (e.g. ollama run dolphin-mistral).

Quickstart with Ollama

// install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

// pull + chat with an 8B all-rounder
ollama run llama3.1:8b

// an uncensored fine-tune instead
ollama run dolphin-mistral

// list what you have, free disk later
ollama list
ollama rm dolphin-mistral

Point any OpenAI-compatible client at http://localhost:11434/v1 and you have a private, local drop-in. For a chat UI, Open WebUI runs in one container and talks to Ollama out of the box.

Make it truly private

  • Pull weights, then go offline. Once a model is on disk it needs no network at all. Download over Tor or a VPN if you'd rather Hugging Face / the registry not log your IP against a model list.
  • Block the runtime from phoning home. Ollama and llama.cpp run fully local, but firewall the process anyway (or run it on an air-gapped box) so a future update can't add telemetry behind your back.
  • Keep prompts on-device. The whole point: your conversations never leave the machine. No account, no server-side history, nothing to subpoena.
  • Disk encryption matters more now. Your prompt history and any saved chats live locally — full-disk encryption (see our device guides) is the backstop if the hardware is seized or lost.

Honest caveats

  • Uncensored is not smarter. Removing refusals doesn't add knowledge or accuracy. An abliterated 8B is still an 8B.
  • Local is not frontier. A 70B on your desk is genuinely useful, but it won't match the best hosted models on the hardest tasks. The trade you're making is capability for sovereignty — go in clear-eyed.
  • You own the output. No provider is filtering on your behalf, which is the point — and the responsibility. What you generate and do with it is on you.

See also

The sovereignty logic here is the same one behind every listing on this site: tools you control beat tools you rent. See our OPSEC52 series for the broader threat-model work, and /vpns if you want to download weights without your ISP building a profile.