Local LLM Agents (ITAR / CMMC)

A working recipe for driving O.D.I.N. with AI agents without a single outbound packet — no Anthropic API, no OpenAI, no cloud. Every prompt, tool call, and response stays inside your compliance boundary.

This guide pairs ITAR / CMMC mode on the backend with a locally-hosted LLM and the MCP server on the agent host.

Topology

┌─ Operator workstation (air-gapped LAN) ──────────────────────┐
│                                                              │
│   Ollama / llama.cpp           Agent client                  │
│   (Qwen2.5-32B-Instruct)  ◄──► (Claude Desktop / Cline /     │
│            │                    OpenClaw / Continue)         │
│            │                              │                  │
│            │                              │ stdio            │
│            │                              ▼                  │
│            │                    odin-print-farm-mcp@2        │
│            │                              │                  │
│            └──────────────────────────────┤ HTTP             │
│                                           ▼                  │
│                                 O.D.I.N. backend             │
│                                 (ODIN_ITAR_MODE=1)           │
│                                           │                  │
│                                           ▼ LAN              │
│                                      Printers                │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Zero packets leave this box.

Hardware baseline

Component	Minimum	Comfortable
GPU VRAM	24 GB (single 3090/4090)	48 GB (2× 3090)
System RAM	32 GB	64 GB
Disk	50 GB for model weights	200 GB (multiple models)
CPU	8 cores	16 cores

A 32B-parameter model at Q4 quantization fits on one 24 GB card and runs ~20–30 tokens/sec. That's enough for one operator driving the farm interactively.

Model selection

Model	Params	Quant	VRAM	Notes
Qwen2.5-32B-Instruct	32 B	Q4_K_M	~20 GB	Recommended. Strong tool use, solid planning, Apache 2.0.
Qwen2.5-72B-Instruct	72 B	Q4_K_M	~44 GB	Better reasoning; needs 2 GPUs.
Llama 3.3 70B-Instruct	70 B	Q4_K_M	~42 GB	Meta license, same footprint class as Qwen 72B.
Mistral-Small-Instruct-2409	22 B	Q5_K_M	~16 GB	Fits on a 3090 with overhead for other tasks.
Qwen2.5-Coder-32B-Instruct	32 B	Q4_K_M	~20 GB	Best if you also want coding help on the same box.

Avoid sub-14B models for driving writes — they hallucinate tool arguments. Use them only for reads (farm_summary, list_jobs) or keep them in dry-run mode.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

ollama pull qwen2.5:32b-instruct-q4_K_M
ollama serve  # listens on 127.0.0.1:11434

Verify:

curl http://127.0.0.1:11434/api/tags

For an air-gapped install, pull the model on a connected staging box, copy ~/.ollama/models/ to the target, then ollama serve with no network.

Wire the MCP server

Mint an agent:write token in O.D.I.N. (Settings → API Tokens → New Token). Then configure your MCP client.

Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "odin": {
      "command": "npx",
      "args": ["-y", "odin-print-farm-mcp@2"],
      "env": {
        "ODIN_BASE_URL": "http://127.0.0.1:8000",
        "ODIN_API_KEY": "odin_xxxxxxxxxxxxxxxxxxxx"
      }
    }
  }
}

Claude Desktop with a local model backend (via proxies like LiteLLM or LM Studio) keeps the inference on-box. If your threat model forbids Anthropic even for the agent loop, use a local-first client instead:

Continue (VS Code)

.continue/config.json:

{
  "models": [
    {
      "title": "Qwen2.5 32B",
      "provider": "ollama",
      "model": "qwen2.5:32b-instruct-q4_K_M"
    }
  ],
  "mcpServers": {
    "odin": {
      "command": "npx",
      "args": ["-y", "odin-print-farm-mcp@2"],
      "env": {
        "ODIN_BASE_URL": "http://127.0.0.1:8000",
        "ODIN_API_KEY": "odin_xxxxxxxxxxxxxxxxxxxx"
      }
    }
  }
}

Cline / OpenClaw

Same mcpServers block pasted into their settings UI. Both drive local Ollama models directly — no external API traffic.

Verify the air gap

Before putting this into production, prove no packets leave the box.

1. Packet capture

sudo tcpdump -i any -n 'not (src net 192.168.0.0/16 or src net 10.0.0.0/8 or src net 127.0.0.0/8)' -w /tmp/odin-egress.pcap

Drive a full agent loop — read the dashboard, queue a job, cancel it, approve another. Stop the capture. The file should be empty (or only contain broadcast/mDNS noise, nothing to public addresses).

2. ITAR boot audit

docker compose logs odin | grep ITAR

Expected line:

ODIN_ITAR_MODE=1 boot audit passed — all configured URLs are private.

If the container refused to boot, the log lists every violating URL. Fix each one (strip public SMTP, webhooks, OIDC providers, etc.) and restart.

3. Blocked-egress smoke test

Add a public webhook in Settings → Webhooks → New Webhook (URL: https://example.com/hook). O.D.I.N. refuses to save it. In ODIN_ITAR_MODE=0 deployments the save succeeds — that's your sanity check that enforcement is actually on.

4. Firewall backstop

Defense-in-depth: block outbound at the network layer too. Example nftables:

table inet odin_filter {
  chain output {
    type filter hook output priority filter; policy drop;
    oif lo accept
    ip daddr 192.168.0.0/16 accept
    ip daddr 10.0.0.0/8 accept
    ip daddr 172.16.0.0/12 accept
    ct state established,related accept
  }
}

If the app-layer guard misses anything, the network layer catches it.

Prompting agents for safe operation

Local 32B models are good but not great at planning. Two tricks go a long way:

Force dry-run first. In the system prompt: "For any write operation (queue_job, cancel_job, pause_printer, etc.), always call with dry_run: true first. Show me the preview. Only call without dry_run after I explicitly confirm."

Use structured error codes. When a tool errors, don't retry blindly. The system prompt should say: "If a tool returns error.code = 'scope_denied', stop and ask for a broader token. If quota_exceeded, stop and report the quota. Never retry a validation_failed error — fix the args."

These two rules eliminate ~90% of "helpful agent destroys the print queue" failure modes.

CMMC Level 2 considerations

Audit logging. O.D.I.N. writes every API call (including agent tool calls) to the audit_log table with actor, IP, action, and timestamp. Export to a SIEM via the /admin/audit endpoint or the Postgres replica.
MFA for operators. Enable TOTP MFA on every human account. Agent tokens are separate — mint them with minimal scopes and rotate.
Scope separation. Use agent:read for reporting agents and agent:write for operational agents. Don't share tokens between.
Session timeout. Configure short session_timeout_minutes in settings. Agent tokens have their own expires_at — set it to the minimum usable window (7–30 days).
Local license verification. O.D.I.N. licenses verify locally via Ed25519. No phone-home, works in air-gap.

Topology​

Hardware baseline​

Model selection​

Install Ollama​

Wire the MCP server​

Claude Desktop​

Continue (VS Code)​

Cline / OpenClaw​

Verify the air gap​

1. Packet capture​

2. ITAR boot audit​

3. Blocked-egress smoke test​

4. Firewall backstop​

Prompting agents for safe operation​

CMMC Level 2 considerations​

See Also​