I have a desktop with an RTX 5070 (12GB VRAM) sitting mostly idle when I'm not gaming. It turns out that's enough hardware to run useful language models locally.
Why local?
- Zero API costs: No per-token billing.
- Privacy: Your data never leaves your network.
- Latency: First token in ~200ms on a local network.
- No rate limits: Run as many queries as you want.
The trade-off is model size. With 12GB VRAM, I'm limited to 7B–14B parameter models quantized to 4-bit. That's not GPT-4, but it's good enough for coding assistance, summarization, and chat.
The setup
# On the desktop (RTX 5070, Linux)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma2:9b
ollama pull qwen2.5:7b
# Test it
ollama run gemma2:9b "Write a Three.js function to load a GLB model"
The RTX 5070 handles 9B models at about 30–40 tokens/second. For a 7B model, it hits 50+ tok/s. That's comfortable for interactive use.
Exposing it to the network
I don't expose Ollama directly. Instead, I use it through Hermes Agent (an AI CLI agent) that can switch between local and cloud models depending on the task:
hermes config set model gemma2:9b
hermes "explain this code" < somefile.py
For lightweight tasks (summarization, classification), local models work great. For complex reasoning, I fall back to cloud APIs.
What works well locally
| Task | Model | Quality |
|---|---|---|
| Code explanation | Qwen2.5 7B | Good |
| Bash one-liners | Gemma2 9B | Very good |
| Document summarization | Any 7B+ | Good enough |
| Creative writing | Not great | Stick to cloud |
| Complex debugging | Cloud better | Needs larger models |
What I learned
- Quantization matters more than raw parameters. A 4-bit 9B model often outperforms a full-precision 7B model.
- Context length is the real bottleneck. 12GB VRAM gives you about 8K–16K context depending on the model. Long documents need chunking.
- The GPU doesn't need to be dedicated. Ollama uses CUDA when available but falls back to CPU. Inference at 30 tok/s barely uses 40% of the GPU, so you can still browse or code while it runs.
If you've got a consumer GPU sitting around, throw Ollama on it. It's one command to install, and you've got your own private AI endpoint.