Local LLM Agents (ITAR / CMMC)
A working recipe for driving O.D.I.N. with AI agents without a single outbound packet — no Anthropic API, no OpenAI, no cloud. Every prompt, tool call, and response stays inside your compliance boundary.
This guide pairs ITAR / CMMC mode on the backend with a locally-hosted LLM and the MCP server on the agent host.
Topology
┌─ Operator workstation (air-gapped LAN) ──────────────────────┐
│ │
│ Ollama / llama.cpp Agent client │
│ (Qwen2.5-32B-Instruct) ◄──► (Claude Desktop / Cline / │
│ │ OpenClaw / Continue) │
│ │ │ │
│ │ │ stdio │
│ │ ▼ │
│ │ odin-print-farm-mcp@2 │
│ │ │ │
│ └──────────────────────────────┤ HTTP │
│ ▼ │
│ O.D.I.N. backend │
│ (ODIN_ITAR_MODE=1) │
│ │ │
│ ▼ LAN │
│ Printers │
│ │
└──────────────────────────────────────────────────────────────┘
Zero packets leave this box.
Hardware baseline
| Component | Minimum | Comfortable |
|---|---|---|
| GPU VRAM | 24 GB (single 3090/4090) | 48 GB (2× 3090) |
| System RAM | 32 GB | 64 GB |
| Disk | 50 GB for model weights | 200 GB (multiple models) |
| CPU | 8 cores | 16 cores |
A 32B-parameter model at Q4 quantization fits on one 24 GB card and runs ~20–30 tokens/sec. That's enough for one operator driving the farm interactively.
Model selection
| Model | Params | Quant | VRAM | Notes |
|---|---|---|---|---|
| Qwen2.5-32B-Instruct | 32 B | Q4_K_M | ~20 GB | Recommended. Strong tool use, solid planning, Apache 2.0. |
| Qwen2.5-72B-Instruct | 72 B | Q4_K_M | ~44 GB | Better reasoning; needs 2 GPUs. |
| Llama 3.3 70B-Instruct | 70 B | Q4_K_M | ~42 GB | Meta license, same footprint class as Qwen 72B. |
| Mistral-Small-Instruct-2409 | 22 B | Q5_K_M | ~16 GB | Fits on a 3090 with overhead for other tasks. |
| Qwen2.5-Coder-32B-Instruct | 32 B | Q4_K_M | ~20 GB | Best if you also want coding help on the same box. |
Avoid sub-14B models for driving writes — they hallucinate tool arguments. Use them only for reads (farm_summary, list_jobs) or keep them in dry-run mode.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama serve # listens on 127.0.0.1:11434
Verify:
curl http://127.0.0.1:11434/api/tags
For an air-gapped install, pull the model on a connected staging box, copy ~/.ollama/models/ to the target, then ollama serve with no network.
Wire the MCP server
Mint an agent:write token in O.D.I.N. (Settings → API Tokens → New Token). Then configure your MCP client.
Claude Desktop
~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"odin": {
"command": "npx",
"args": ["-y", "odin-print-farm-mcp@2"],
"env": {
"ODIN_BASE_URL": "http://127.0.0.1:8000",
"ODIN_API_KEY": "odin_xxxxxxxxxxxxxxxxxxxx"
}
}
}
}
Claude Desktop with a local model backend (via proxies like LiteLLM or LM Studio) keeps the inference on-box. If your threat model forbids Anthropic even for the agent loop, use a local-first client instead: