Auto Draft

How I Built a Self-Managing Homelab Using Claude AI Routines

Why I Wanted a Homelab That Could Take Care of Itself

I’ve been running a homelab for years. Docker stacks, self-hosted services, monitoring dashboards, cron jobs, Ansible playbooks — the works. But no matter how much I automated, there was always a gap: something breaks at 2am, a container crashes silently, a disk fills up, or I need to spin up a new service and I can’t remember which Compose file pattern I used last time.

The fix I kept reaching for was more documentation, more runbooks, more alerts. But what I actually needed was something that could reason about my lab — not just fire off a webhook when a threshold was breached. That’s where Claude AI routines came in, and honestly, they changed how I think about homelab automation entirely.

This post walks through exactly how I wired up Claude to act as an autonomous homelab assistant — one that can diagnose issues, suggest fixes, execute predefined safe actions, and keep a running log of what it did and why. It’s not magic, but it’s closer than anything I’ve built before.


The Architecture: What “Self-Managing” Actually Means

Let me be clear about scope upfront. “Self-managing” doesn’t mean Claude has root on everything and goes rogue at midnight. It means I’ve built a set of Claude AI routines — scheduled or event-triggered scripts that gather context, reason about it using the Claude API, and take one of a small set of pre-authorized actions. Everything else gets escalated to me as a notification.

The three-layer architecture looks like this:

  1. Data collectors — scripts that pull metrics, logs, and state from my infrastructure
  2. Claude reasoning layer — sends context to the Claude API, gets back analysis and action recommendations
  3. Action executor — a restricted set of safe actions Claude can trigger directly (restart container, send alert, write to log), with everything else escalated to me

If you’ve been following along with AI agents and MCP workflows, this is the same concept applied to infrastructure ops rather than general task automation.


The Tech Stack

  • Python 3.12 — all routines are Python scripts
  • Anthropic Python SDK — for calling Claude
  • Docker + Portainer — my services all run in containers (see the Docker beginner’s guide if you’re new to this)
  • Prometheus + node_exporter — metrics scraping
  • Uptime Kuma — service availability monitoring
  • ntfy.sh — push notifications to my phone
  • cron — scheduling the routines

Setting Up the Claude API Client

Install the SDK and set your API key:

pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

I store the key in a .env file in my homelab scripts directory and source it in my cron environment. Here’s the base client wrapper I use across all routines:

import anthropic
import os

def ask_claude(system_prompt: str, user_message: str, model: str = "claude-opus-4-8") -> str:
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    message = client.messages.create(
        model=model,
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {"role": "user", "content": user_message}
        ]
    )
    return message.content[0].text

For most homelab routines, claude-haiku-4-5-20251001 is fast and cheap enough. I only reach for claude-opus-4-8 when I need deep reasoning, like root cause analysis of a multi-service failure cascade.


Routine #1: The Daily Health Digest

Every morning at 7am, this routine fires. It collects system state, hands it to Claude, and sends a plain-English summary to my phone. No more squinting at Grafana before coffee.

#!/usr/bin/env python3
# daily_digest.py

import subprocess, json, requests, os
from claude_helper import ask_claude

NTFY_URL = "https://ntfy.sh/my-homelab-alerts"

def collect_docker_state():
    result = subprocess.run(
        ["docker", "ps", "-a", "--format", "json"],
        capture_output=True, text=True
    )
    containers = [json.loads(line) for line in result.stdout.strip().split('\n') if line]
    return containers

def collect_disk_usage():
    result = subprocess.run(["df", "-h", "--output=source,pcent,target"],
                            capture_output=True, text=True)
    return result.stdout

def collect_failed_units():
    result = subprocess.run(["systemctl", "--failed", "--no-legend"],
                            capture_output=True, text=True)
    return result.stdout or "None"

def main():
    containers = collect_docker_state()
    exited = [c for c in containers if c.get("State") == "exited"]
    running = [c for c in containers if c.get("State") == "running"]

    disk_usage = collect_disk_usage()
    failed_units = collect_failed_units()

    context = f"""
Docker containers running: {len(running)}
Docker containers exited/stopped: {len(exited)}
Exited container names: {[c['Names'] for c in exited]}

Disk usage:
{disk_usage}

Failed systemd units:
{failed_units}
"""

    system_prompt = """You are a homelab health assistant. Given system state,
produce a concise morning digest in plain English. Lead with any critical issues
(exited containers, disk >85% full, failed units). If everything looks healthy,
say so briefly. Max 5 bullet points. Be direct."""

    summary = ask_claude(system_prompt, context, model="claude-haiku-4-5-20251001")

    requests.post(NTFY_URL, data=summary.encode("utf-8"),
                  headers={"Title": "Homelab Daily Digest", "Tags": "computer"})
    print(summary)

if __name__ == "__main__":
    main()

The key insight here is that Claude doesn’t just list facts — it interprets them. When three containers are stopped and disk is at 87%, it tells me “your Jellyfin stack is down and you’re close to the disk limit on /data — check logs before it cascades.” That’s worth more than raw metrics.


Routine #2: Container Crash Responder

This one runs every 5 minutes. When it detects a container that has exited unexpectedly, it fetches the last 50 lines of logs, asks Claude to diagnose the crash, and either auto-restarts the container (if it’s on the safe-restart allowlist) or sends me a detailed alert with Claude’s analysis.

#!/usr/bin/env python3
# crash_responder.py

import subprocess, json, requests, os, sys
from claude_helper import ask_claude

NTFY_URL = "https://ntfy.sh/my-homelab-alerts"
SAFE_RESTART = {"uptime-kuma", "vaultwarden", "homepage", "actual-budget"}

def get_exited_containers():
    result = subprocess.run(
        ["docker", "ps", "-a", "--filter", "status=exited", "--format", "json"],
        capture_output=True, text=True
    )
    return [json.loads(l) for l in result.stdout.strip().split('\n') if l]

def get_container_logs(name: str, tail: int = 50) -> str:
    result = subprocess.run(
        ["docker", "logs", "--tail", str(tail), name],
        capture_output=True, text=True
    )
    return result.stdout + result.stderr

def restart_container(name: str):
    subprocess.run(["docker", "restart", name], check=True)

def main():
    crashed = get_exited_containers()
    if not crashed:
        sys.exit(0)

    for container in crashed:
        name = container["Names"]
        logs = get_container_logs(name)

        system_prompt = """You are a Docker troubleshooting expert.
Analyze the container logs and identify the root cause of the crash.
Be specific: name the error, the likely cause, and one recommended fix.
Keep it under 100 words."""

        analysis = ask_claude(system_prompt, f"Container: {name}\n\nLogs:\n{logs}",
                              model="claude-haiku-4-5-20251001")

        if name in SAFE_RESTART:
            restart_container(name)
            msg = f"Auto-restarted {name}.\n\nDiagnosis: {analysis}"
            tags = "arrows_counterclockwise"
        else:
            msg = f"Container {name} crashed — manual action needed.\n\nDiagnosis: {analysis}"
            tags = "rotating_light"

        requests.post(NTFY_URL, data=msg.encode("utf-8"),
                      headers={"Title": f"Container Alert: {name}", "Tags": tags})

if __name__ == "__main__":
    main()

The SAFE_RESTART allowlist is intentional. I only let Claude auto-restart stateless services where an unexpected restart carries no data-loss risk. Anything with a database or persistent writes gets escalated to me with the analysis — I act on it, but Claude did the diagnostics.


Routine #3: Disk Space Guardian

Running local LLMs with Ollama means model files can silently bloat your storage. This routine checks disk usage on my NAS mount and Docker volumes, and asks Claude to recommend what to clean up when things get tight.

#!/usr/bin/env python3
# disk_guardian.py

import subprocess, requests, os
from claude_helper import ask_claude

NTFY_URL = "https://ntfy.sh/my-homelab-alerts"
WARN_THRESHOLD = 80  # percent

def get_disk_usage():
    result = subprocess.run(
        ["df", "-h", "--output=source,size,used,avail,pcent,target"],
        capture_output=True, text=True
    )
    return result.stdout

def get_docker_volume_sizes():
    result = subprocess.run(
        ["docker", "system", "df", "-v"],
        capture_output=True, text=True
    )
    return result.stdout

def parse_high_usage_mounts(df_output: str, threshold: int) -> list:
    lines = df_output.strip().split('\n')[1:]
    high = []
    for line in lines:
        parts = line.split()
        if len(parts) >= 5:
            pct_str = parts[4].replace('%', '')
            try:
                if int(pct_str) >= threshold:
                    high.append(line)
            except ValueError:
                pass
    return high

def main():
    df_out = get_disk_usage()
    vol_out = get_docker_volume_sizes()
    high_mounts = parse_high_usage_mounts(df_out, WARN_THRESHOLD)

    if not high_mounts:
        return  # all good, stay quiet

    context = f"""High disk usage detected on the following mounts:
{chr(10).join(high_mounts)}

Docker volume sizes:
{vol_out[:2000]}"""

    system_prompt = """You are a homelab storage advisor. Given disk usage data,
recommend specific cleanup actions — Docker image pruning, log rotation targets,
directories to review. Be concrete and actionable. Max 5 bullet points."""

    advice = ask_claude(system_prompt, context, model="claude-haiku-4-5-20251001")

    msg = f"Disk usage alert.\n\n{advice}"
    requests.post(NTFY_URL, data=msg.encode("utf-8"),
                  headers={"Title": "Disk Space Warning", "Tags": "floppy_disk,warning"})

if __name__ == "__main__":
    main()

Routine #4: Weekly Config Drift Checker

This is my favorite one. Once a week, it diffs my current docker-compose files against the versions committed in my Git repo, and asks Claude to flag anything that looks like unintentional drift — a port exposure I added for debugging and forgot to remove, an environment variable that shouldn’t be in production, a volume mount that bypasses my NAS.

#!/usr/bin/env python3
# drift_checker.py

import subprocess, os, requests
from claude_helper import ask_claude

NTFY_URL = "https://ntfy.sh/my-homelab-alerts"
COMPOSE_DIR = "/opt/homelab"

def get_git_diff():
    result = subprocess.run(
        ["git", "diff", "HEAD", "--", "*.yml", "*.yaml"],
        capture_output=True, text=True,
        cwd=COMPOSE_DIR
    )
    return result.stdout

def main():
    diff = get_git_diff()

    if not diff or len(diff.strip()) == 0:
        return  # no drift, stay quiet

    system_prompt = """You are a homelab security and configuration auditor.
Review this git diff of docker-compose files. Flag:
1. Security concerns (exposed ports, privileged containers, missing network isolation)
2. Likely debug artifacts (debug flags, extra volume mounts, commented-out sections)
3. Anything that looks unintentional

Be specific about file and line. Keep response under 200 words."""

    analysis = ask_claude(system_prompt, diff, model="claude-opus-4-8")

    msg = f"Weekly config drift report:\n\n{analysis}"
    requests.post(NTFY_URL, data=msg.encode("utf-8"),
                  headers={"Title": "Config Drift Alert", "Tags": "mag,warning"})

if __name__ == "__main__":
    main()

I use claude-opus-4-8 here because security review deserves the most capable reasoning — the cost of missing a real issue is higher than a few extra cents per week.


Wiring It All Together with Cron

All routines live in /opt/homelab/routines/ and run under a dedicated homelab-bot user with only the permissions it needs:

# crontab for homelab-bot
# Daily digest at 7am
0 7 * * * /opt/homelab/routines/venv/bin/python /opt/homelab/routines/daily_digest.py

# Container crash responder every 5 minutes
*/5 * * * * /opt/homelab/routines/venv/bin/python /opt/homelab/routines/crash_responder.py

# Disk guardian every 6 hours
0 */6 * * * /opt/homelab/routines/venv/bin/python /opt/homelab/routines/disk_guardian.py

# Weekly drift check on Sunday at 9am
0 9 * * 0 /opt/homelab/routines/venv/bin/python /opt/homelab/routines/drift_checker.py

Each script logs to /var/log/homelab-routines/ so I can audit what Claude decided and why. This is important — you want a paper trail when automation is making decisions about your infrastructure.


Cost Reality Check

A common concern with AI automation is cost. Here’s my actual monthly spend running these four routines:

  • Daily digest: ~30 Haiku calls/month → < $0.02
  • Crash responder: Varies, typically 0–10 real crashes/month → < $0.01
  • Disk guardian: ~4 calls/month when triggered → negligible
  • Drift checker: ~4 Opus calls/month → ~$0.10

Total: under $0.15/month. Less than the electricity to run a single Raspberry Pi for an hour.


What’s Next: Extending the System

The routines above are my production setup, but I’ve been experimenting with two extensions:

MCP integration: Connecting Claude directly to my infrastructure via the Model Context Protocol so it can query Prometheus metrics in real-time during a conversation, not just in batch routines. If you’re not familiar with MCP yet, the pattern is the same as what’s covered in the Ansible automation guide — define what Claude is allowed to touch, then let it act within that boundary.

Event-driven triggers via webhooks: Instead of polling every 5 minutes, hooking the crash responder directly into Docker’s event stream so it fires the moment a container exits. Combined with Claude’s fast response times on Haiku, this gets you near-real-time diagnosis without burning cron slots.


Key Takeaways

Building a self-managing homelab with Claude routines isn’t about replacing your judgment — it’s about handling the tedious diagnostic work so you can focus on the interesting stuff. A few principles that kept me out of trouble:

  • Restrict the action surface: Claude only executes from a pre-approved list. Everything else is an alert.
  • Stay quiet when healthy: Routines that fire notifications every time they run create alert fatigue fast. Only notify on anomalies.
  • Match model to task: Haiku for simple triage, Opus for security and root-cause analysis. Cost follows capability.
  • Log everything: When automation makes a decision, write it to a log. Future you will want to know why the container was restarted at 3am.
  • Start small: One routine, fully trusted, is worth more than five half-tested ones.

The homelab is supposed to be fun. Making it smart enough to handle its own fires means more time for the parts that actually are fun — deploying new services, experimenting, breaking things on purpose. That’s the whole point.

Enjoying this post?

Get more guides like this delivered straight to your inbox. No spam, just tech and trails.