Running Local LLMs on a Gaming GPU

I have a desktop with an RTX 5070 (12GB VRAM) sitting mostly idle when I'm not gaming. It turns out that's enough hardware to run useful language models locally.

Why local?

Zero API costs: No per-token billing.
Privacy: Your data never leaves your network.
Latency: First token in ~200ms on a local network.
No rate limits: Run as many queries as you want.

The trade-off is model size. With 12GB VRAM, I'm limited to 7B–14B parameter models quantized to 4-bit. That's not GPT-4, but it's good enough for coding assistance, summarization, and chat.

The setup

# On the desktop (RTX 5070, Linux)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma2:9b
ollama pull qwen2.5:7b

# Test it
ollama run gemma2:9b "Write a Three.js function to load a GLB model"

The RTX 5070 handles 9B models at about 30–40 tokens/second. For a 7B model, it hits 50+ tok/s. That's comfortable for interactive use.

Exposing it to the network

I don't expose Ollama directly. Instead, I use it through Hermes Agent (an AI CLI agent) that can switch between local and cloud models depending on the task:

hermes config set model gemma2:9b
hermes "explain this code" < somefile.py

For lightweight tasks (summarization, classification), local models work great. For complex reasoning, I fall back to cloud APIs.

What works well locally

Task	Model	Quality
Code explanation	Qwen2.5 7B	Good
Bash one-liners	Gemma2 9B	Very good
Document summarization	Any 7B+	Good enough
Creative writing	Not great	Stick to cloud
Complex debugging	Cloud better	Needs larger models

What I learned

Quantization matters more than raw parameters. A 4-bit 9B model often outperforms a full-precision 7B model.
Context length is the real bottleneck. 12GB VRAM gives you about 8K–16K context depending on the model. Long documents need chunking.
The GPU doesn't need to be dedicated. Ollama uses CUDA when available but falls back to CPU. Inference at 30 tok/s barely uses 40% of the GPU, so you can still browse or code while it runs.

If you've got a consumer GPU sitting around, throw Ollama on it. It's one command to install, and you've got your own private AI endpoint.