A deep-dive into how the demo works, what it demonstrates, and how the concepts map to real AI governance frameworks.
This is a working multi-agent AI system built to demonstrate security controls for agentic AI. It uses real Claude API calls for the normal operation scenario, and scripted simulations for the attack scenarios — ensuring the guardrail failures are always visible and repeatable in a presentation context.
The tool was built from a Technology Controls perspective: every component maps to a control objective, and every demo scenario illustrates a concrete risk that emerges when agents are deployed without adequate governance.
All five scenarios run in under 60 seconds. The attack scenarios are fully scripted — they do not make live API calls, which means they are deterministic, fast, and safe to demonstrate in front of an audience.
The system uses a two-tier agent architecture: a capable orchestrator that plans and delegates, and lightweight specialist workers that execute individual tasks.
The orchestrator runs an agentic loop: it receives a task, decides which tools to call, dispatches workers, receives results, and repeats until it produces a final answer. This loop is what makes it "agentic" — it is not simply answering a question, it is planning and taking sequential actions.
The loop terminates when the model returns stop_reason: end_turn instead of tool_use. Every iteration is logged.
Workers are deliberately simple: each makes a single Claude API call with a specialist system prompt and returns the result. Using Haiku for workers keeps costs low while reserving Opus for the complex reasoning at the orchestration layer.
Every audit event is pushed to the browser in real time over a WebSocket. The orchestrator runs in a background thread; events are passed through an asyncio.Queue to the async WebSocket handler. This is what makes the audit trail appear live rather than all at once.
Understanding why agentic AI introduces new risks requires understanding how the loop works. Unlike a simple chatbot that takes input and produces output, an agent operates as follows:
| Step | What happens | Risk if uncontrolled |
|---|---|---|
| 1. Receive task | User submits a natural-language request | No input validation — injection possible |
| 2. Plan | Orchestrator decides which tools to call and in what order | Unbounded planning — could decide to use any tool |
| 3. Dispatch | Tool call sent to a worker with inputs derived from previous context | Inputs may contain injected instructions from earlier steps |
| 4. Execute | Worker makes API call, retrieves data, or takes action | No allowlist — can call any tool including destructive ones |
| 5. Observe | Result returned to orchestrator, added to context | Result may contain new injection payloads from external content |
| 6. Repeat | Orchestrator uses result to plan next step | No budget cap — loop may run indefinitely |
Key insight: In step 5, external content (a web page, a document, an email) enters the agent's context. If that content contains instructions, the agent has no way to distinguish them from legitimate system instructions — unless explicit controls check for this pattern.
The guardrails layer (guardrails.py) enforces four preventive controls before every tool call. They run in order — if any check fails, the tool call is blocked and a GUARDRAIL_BLOCK event is logged.
| Control | What it does | Demonstrated in |
|---|---|---|
| Pre-execution review | Blocks all tool execution. The orchestrator plans and reasons as normal, but no workers run. Every planned action is captured in the audit trail and held for human review and approval before anything executes — a human-in-the-loop control equivalent to a four-eyes check or change advisory board. | Pre-execution Review scenario |
| Tool allowlist | Only tools explicitly listed at session start may execute. Any attempt to call an unlisted tool raises a violation. Enforces principle of least privilege. | Unauthorized Tool scenario |
| Call budget | Hard cap on total tool invocations per session. Once the budget is exhausted, all further calls are blocked. Limits blast radius of runaway agents. | Budget Exceeded scenario |
| Prompt injection scan | Scans all string inputs to tool calls for known override patterns (e.g. "ignore previous instructions", "you are now"). Blocks the call if detected. | Prompt Injection scenario |
These four controls are deliberately simple — they represent the minimum viable control set for a production agentic deployment. Dry-run mode is the mechanism that enables human-in-the-loop workflows: the agent plans its actions, a human reviews and approves, then execution proceeds. A full production stack would add semantic similarity scanning, rate limiting, and output filtering on top.
Every session writes a JSON-lines file to audit_logs/. Each line is a structured event. From a Technology Controls perspective, this answers the key governance questions:
ORCHESTRATOR_RESPONSE events capture each reasoning step, stop reason, and token counts.
TOOL_CALL events capture tool name, call ID, and full inputs — before execution.
TOOL_RESULT events capture what each worker returned, result length, and latency.
GUARDRAIL_BLOCK events capture which control fired, the reason, and the blocked inputs.
The audit log is the primary evidence trail for any post-incident investigation. It is also the input for monitoring — you could pipe these events to a SIEM or alerting system to detect anomalous patterns in production.
The controls demonstrated map directly to emerging AI governance obligations:
| Control | Framework / Obligation |
|---|---|
| Audit trail (full event log) | FCA PS24/16 — record-keeping for AI-assisted decisions; MiFID II audit trail requirements; internal model risk governance (SS1/23) |
| Tool allowlist (least privilege) | Principle of least privilege (NCSC); change management controls; FCA operational resilience obligations |
| Call budget (blast radius cap) | Operational resilience — limiting impact of failures; Consumer Duty — preventing harmful automated actions |
| Prompt injection scan | OWASP LLM Top 10 — LLM01: Prompt Injection; NCSC AI security guidance; NIST AI RMF — GOVERN 1.2 |
| Pre-execution review (human-in-the-loop) | Human oversight before irreversible AI-driven transactions; four-eyes control equivalent; FCA operational resilience obligations; MiFID II best execution — no autonomous execution of high-value trades without human sign-off |
OWASP LLM01 — Indirect Prompt Injection: The injection scenario specifically demonstrates indirect injection via a third-party web source fetched autonomously by the agent. This attack vector does not exist in non-agentic AI — it only emerges when agents are given tools to retrieve external content.
The architecture is designed to be pointed at other agentic workflows to validate their security posture. To add a new worker:
Register it in orchestrator.py and add it to the TOOLS list. The guardrails and audit trail apply automatically — no changes needed to the control layer.
To test a new scenario:
SCENARIOS object in index.htmlbuildSimulations() for the ON and OFF casessimulation: false in the scenario config| Component | Technology | Why |
|---|---|---|
| Orchestrator model | Claude Opus 4.6 | Most capable for multi-step planning and tool selection |
| Worker models | Claude Haiku 4.5 | Cost-efficient for single-task execution |
| API | Anthropic Messages API | Native tool use support with stop_reason: tool_use |
| Web framework | FastAPI + Uvicorn | Async-native, WebSocket support, minimal overhead |
| Streaming | WebSocket + asyncio.Queue | Bridges sync orchestrator thread to async browser stream |
| Deployment | Docker + EC2 + CloudFront | Reproducible, HTTPS via CloudFront, stop/start friendly |
| CI/CD | GitHub Actions | Tests on every push, manual deploy trigger |