Why I Started Running AI Models Locally
A few months ago I got tired of pasting sensitive work notes into ChatGPT and hoping for the best. Every time I used a cloud AI for anything remotely private — reading a config file, summarizing an internal document, debugging code that touched credentials — there was that nagging voice in the back of my head asking exactly what was happening with that data. I knew the answer was probably “nothing bad,” but probably is not good enough when it comes to infrastructure you are actually responsible for.
Then I discovered Ollama, and within about 45 minutes I had a capable large language model running entirely on my homelab server, processing requests locally, with zero data ever leaving my network. I have been running it ever since, and it has become one of the most genuinely useful things in my self-hosted stack. This guide walks you through the entire setup — from bare metal to a working local AI with a web interface, API access, and Docker integration.
What Is Ollama?
Ollama is an open-source tool that makes downloading, managing, and running large language models locally about as easy as running Docker containers. Before Ollama, getting a model like Llama or Mistral running locally required wrestling with Python environments, CUDA libraries, quantization tools, and model weights scattered across huggingface repositories. It was doable, but it was not something you would set up on a Tuesday afternoon.
Ollama wraps all of that complexity behind a clean CLI and REST API. You pull a model the same way you pull a Docker image, run it with a single command, and get a fully functional local AI that responds to HTTP requests. It supports CPU inference, NVIDIA GPU acceleration, AMD ROCm, and Apple Silicon — so it works on almost any hardware you might have sitting in your homelab.
Supported Models
Ollama supports a wide range of open-source models. Here are the ones I have found most useful:
- llama3.2 (3B/11B) — Meta’s latest generation. The 3B model runs well on CPU-only machines. The 11B is great with a modest GPU.
- mistral (7B) — Excellent general-purpose model with strong reasoning. One of the best per-dollar (or per-watt) options.
- qwen2.5-coder (7B/14B) — Alibaba’s code-focused model. Outstanding for code completion, debugging, and explaining code. I use this one constantly.
- deepseek-r1 (7B/14B) — Strong reasoning and math capabilities. Impressive for a local model.
- phi4 (14B) — Microsoft’s latest compact model. Punches well above its weight for reasoning tasks.
- nomic-embed-text — Not a chat model, but a text embedding model for building semantic search and RAG pipelines locally.
Hardware Requirements: What Do You Actually Need?
This is the question everyone asks first. The honest answer is: less than you think for basic use, more than you think for great performance. Here is a practical breakdown.
CPU-Only (Minimum)
- Any modern multi-core CPU (Intel Core i5/i7, AMD Ryzen 5/7)
- 16 GB RAM minimum; 32 GB recommended if you want to run 7B+ models comfortably
- Models run in RAM, so RAM capacity is your main constraint
- Inference is slow — expect 3-8 tokens per second on a 7B model without GPU
- Good for: light use, summarization, short conversations
NVIDIA GPU (Recommended for Performance)
- Any NVIDIA GPU with 8 GB VRAM or more (RTX 3060 12GB is the sweet spot)
- RTX 3060 12GB: comfortably runs 7B models, can run 13B with some layers offloaded to CPU
- RTX 3080 10GB / RTX 4070: runs 13B models fully in VRAM
- RTX 3090 24GB / RTX 4090 24GB: handles 30B+ models, excellent for production homelab use
- Inference speed: 30-80+ tokens per second depending on model and GPU
My Setup
I run Ollama on two machines depending on the task. My primary Ollama host is a desktop with an RTX 3060 12GB, Ubuntu 24.04, and 32 GB of DDR4. This handles mistral:7b and qwen2.5-coder:7b entirely in VRAM at a comfortable ~45 tokens/second. For lighter tasks, my Intel NUC (the same one that runs Plex) handles small models like llama3.2:3b on CPU when the desktop is off.
Installing Ollama on Linux
Ollama provides a one-line install script that handles everything including NVIDIA driver detection:
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama as a systemd service that starts automatically on boot. Verify it is running:
systemctl status ollama
You should see it active and listening. By default, Ollama binds to 127.0.0.1:11434. If you want it accessible from other machines on your network (which you almost certainly do for a homelab setup), you need to change the bind address.
Exposing Ollama to Your Local Network
Edit the systemd service override:
sudo systemctl edit ollama
Add these lines:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Now any machine on your network can reach Ollama at http://YOUR_SERVER_IP:11434. Keep this on your local network only — do not expose port 11434 to the internet without authentication in front of it.
Pulling and Running Your First Model
Downloading and running a model works exactly like Docker:
# Pull a model (downloads it to ~/.ollama/models)
ollama pull mistral
# Run an interactive chat session
ollama run mistral
# Run a one-shot prompt
ollama run mistral "Explain what BGP is in two sentences"
# List downloaded models
ollama list
# Remove a model
ollama rm mistral
The first pull takes a few minutes depending on your connection speed and model size. A 7B model in Q4 quantization (the default) is around 4-5 GB. Once downloaded, it is cached locally — subsequent runs are instant.
Model Quantization Variants
You will notice models come in variants like mistral:7b-q4_0, mistral:7b-q8_0, and so on. These refer to quantization levels:
- Q4_0 / Q4_K_M — 4-bit quantization. Smallest file size, lowest VRAM usage, slight quality reduction. Best for constrained hardware.
- Q5_K_M — 5-bit. Good balance of size and quality. My usual choice.
- Q8_0 — 8-bit. Near full-precision quality but doubles the VRAM requirement.
- FP16 — Full 16-bit floating point. Best quality, requires the most VRAM. For a 7B model this is around 14 GB VRAM.
For most homelab use, Q4_K_M or Q5_K_M hits the right balance. The quality difference between Q4 and Q8 is real but often not meaningful for everyday tasks like summarization, code help, and Q&A.
Running Ollama in Docker
If you prefer to keep everything containerized (I do for most services), Ollama has an official Docker image. Here is a Docker Compose file for a GPU-enabled setup:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- OLLAMA_HOST=0.0.0.0:11434
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
volumes:
ollama_data:
For CPU-only, drop the runtime: nvidia and NVIDIA_VISIBLE_DEVICES lines. You will also need the NVIDIA Container Toolkit installed on the host for GPU passthrough:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Adding a Web UI: Open WebUI
The CLI is great for testing, but for day-to-day use you want a proper web interface. Open WebUI (formerly Ollama WebUI) is a polished, feature-rich frontend that looks and feels like ChatGPT but runs entirely on your infrastructure.
Add it to your Docker Compose file:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui_data:/app/backend/data
ports:
- "3000:8080"
depends_on:
- ollama
restart: unless-stopped
And add open-webui_data: to your volumes section. Bring everything up with docker compose up -d and navigate to http://YOUR_SERVER_IP:3000.
Open WebUI gives you:
- Multi-model conversations — switch between models in the same interface
- Conversation history with search
- System prompt templates you can save and reuse
- File uploads for document analysis (with vision-capable models)
- Multi-user support with separate accounts and usage tracking
- RAG (Retrieval Augmented Generation) support for querying your own documents
- Image generation integration if you run a Stable Diffusion backend
Honestly, Open WebUI is so good it has fully replaced ChatGPT for me on tasks where the 7B/13B local models are competitive. For complex reasoning, I still reach for a frontier model, but for code explanation, summarization, and quick Q&A, local is fast enough and completely private.
Using the Ollama REST API
One of Ollama’s most powerful features is its REST API, which is compatible with the OpenAI API format. This means any application built for ChatGPT can be pointed at your Ollama instance with a one-line config change.
Basic Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "mistral",
"messages": [
{
"role": "user",
"content": "What is the difference between TCP and UDP?"
}
],
"stream": false
}'
OpenAI-Compatible Endpoint
Ollama also exposes an OpenAI-compatible endpoint at /v1/chat/completions:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hello!"}]
}'
This means you can use Ollama as a drop-in replacement with any tool that supports custom OpenAI endpoints — VS Code extensions like Continue.dev, LangChain, LlamaIndex, and many more.
Integrating Ollama with VS Code for Local Code Completion
This is the integration that genuinely changed how I work. The Continue extension for VS Code connects to Ollama and provides real-time code completion and chat entirely in your editor, privately.
- Install the Continue extension from the VS Code marketplace
- Open
~/.continue/config.json - Add your Ollama instance as a model provider:
{
"models": [
{
"title": "Qwen 2.5 Coder 7B (Local)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://YOUR_SERVER_IP:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen 2.5 Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://YOUR_SERVER_IP:11434"
}
}
With this configuration, you get tab completion powered by qwen2.5-coder that never leaves your network. The latency is slightly higher than GitHub Copilot on a remote GPU, but it is fast enough to be genuinely useful, and you can use it on proprietary codebases without any concern about data exposure.
Performance Tuning Tips
Set the Number of GPU Layers
By default, Ollama offloads as many layers as possible to the GPU. You can override this with the num_gpu parameter in your model file or API request. If you have limited VRAM, experiment with partial offloading — running some layers on the GPU and the rest on CPU. Even 50% GPU offload dramatically speeds up inference compared to full CPU.
Adjust Context Length
The default context window in Ollama is often set conservatively. For tasks requiring long documents, increase it via API:
curl http://localhost:11434/api/chat -d '{
"model": "mistral",
"options": {"num_ctx": 8192},
"messages": [...]
}'
Keep in mind that longer context windows increase VRAM usage.
Keep Models Loaded in Memory
By default, Ollama unloads models after 5 minutes of inactivity to free VRAM. For interactive use you might want to keep a model warm. Set OLLAMA_KEEP_ALIVE=-1 in your environment to keep models loaded indefinitely, or pass "keep_alive": -1 in your API request.
Pulling Multiple Models for Different Tasks
My current model lineup for reference:
# General assistant — fast, great quality
ollama pull mistral:7b-instruct-q5_K_M
# Code tasks — best local coding model in this size class
ollama pull qwen2.5-coder:7b-instruct-q5_K_M
# Reasoning and analysis
ollama pull deepseek-r1:7b
# Text embeddings for RAG
ollama pull nomic-embed-text
Having multiple models available lets you route tasks to the right tool. Open WebUI makes switching between them trivial during a conversation.
Is This Worth It for Your Homelab?
Let me be straight about the tradeoffs. Local LLMs in the 7B-13B parameter range are genuinely useful, but they are not GPT-4. For complex multi-step reasoning, nuanced writing, or tasks that require deep world knowledge, a frontier cloud model is still better. The quality gap is real.
But for my actual day-to-day use cases — explaining code I am reading, drafting config files, summarizing long documentation, answering questions about my homelab stack, converting data formats — mistral:7b and qwen2.5-coder:7b handle probably 70% of my AI requests adequately. And they do it instantly, privately, and at zero marginal cost per query.
If you already have a decent GPU in a homelab machine, running Ollama is a no-brainer. If you are CPU-only, the slower inference makes it less ideal for real-time chat but still excellent for batch processing tasks you can kick off and walk away from.
Either way, getting full control over your AI stack is worth it. Ollama makes that easier than it has ever been.