Running Local LLMs on a Gaming GPU

I have a desktop with an RTX 5070 (12GB VRAM) sitting mostly idle when I'm not gaming. It turns out that's enough hardware to run useful language models locally.

Why local?

  • Zero API costs: No per-token billing.
  • Privacy: Your data never leaves your network.
  • Latency: First token in ~200ms on a local network.
  • No rate limits: Run as many queries as you want.

The trade-off is model size. With 12GB VRAM, I'm limited to 7B–14B parameter models quantized to 4-bit. That's not GPT-4, but it's good enough for coding assistance, summarization, and chat.

The setup

# On the desktop (RTX 5070, Linux)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma2:9b
ollama pull qwen2.5:7b

# Test it
ollama run gemma2:9b "Write a Three.js function to load a GLB model"

The RTX 5070 handles 9B models at about 30–40 tokens/second. For a 7B model, it hits 50+ tok/s. That's comfortable for interactive use.

Exposing it to the network

I don't expose Ollama directly. Instead, I use it through Hermes Agent (an AI CLI agent) that can switch between local and cloud models depending on the task:

hermes config set model gemma2:9b
hermes "explain this code" < somefile.py

For lightweight tasks (summarization, classification), local models work great. For complex reasoning, I fall back to cloud APIs.

What works well locally

Task Model Quality
Code explanation Qwen2.5 7B Good
Bash one-liners Gemma2 9B Very good
Document summarization Any 7B+ Good enough
Creative writing Not great Stick to cloud
Complex debugging Cloud better Needs larger models

What I learned

  • Quantization matters more than raw parameters. A 4-bit 9B model often outperforms a full-precision 7B model.
  • Context length is the real bottleneck. 12GB VRAM gives you about 8K–16K context depending on the model. Long documents need chunking.
  • The GPU doesn't need to be dedicated. Ollama uses CUDA when available but falls back to CPU. Inference at 30 tok/s barely uses 40% of the GPU, so you can still browse or code while it runs.

If you've got a consumer GPU sitting around, throw Ollama on it. It's one command to install, and you've got your own private AI endpoint.