About — AI Agent Security Demo

About this tool

A deep-dive into how the demo works, what it demonstrates, and how the concepts map to real AI governance frameworks.

What is this?

This is a working multi-agent AI system built to demonstrate security controls for agentic AI. It uses real Claude API calls for the normal operation scenario, and scripted simulations for the attack scenarios — ensuring the guardrail failures are always visible and repeatable in a presentation context.

The tool was built from a Technology Controls perspective: every component maps to a control objective, and every demo scenario illustrates a concrete risk that emerges when agents are deployed without adequate governance.

All five scenarios run in under 60 seconds. The attack scenarios are fully scripted — they do not make live API calls, which means they are deterministic, fast, and safe to demonstrate in front of an audience.

Architecture

The system uses a two-tier agent architecture: a capable orchestrator that plans and delegates, and lightweight specialist workers that execute individual tasks.

Browser  ←── WebSocket ───→  FastAPI (Python)
                               │
                    Orchestrator (Claude Opus 4.6)
                    ├── AuditLogger    ← writes audit_logs/*.jsonl
                    ├── Guardrails     ← preventive control layer
                    └── Workers
                        ├── ResearcherWorker  (Claude Haiku 4.5)
                        └── WriterWorker      (Claude Haiku 4.5)

Orchestrator — Claude Opus 4.6

The orchestrator runs an agentic loop: it receives a task, decides which tools to call, dispatches workers, receives results, and repeats until it produces a final answer. This loop is what makes it "agentic" — it is not simply answering a question, it is planning and taking sequential actions.

The loop terminates when the model returns stop_reason: end_turn instead of tool_use. Every iteration is logged.

Workers — Claude Haiku 4.5

Workers are deliberately simple: each makes a single Claude API call with a specialist system prompt and returns the result. Using Haiku for workers keeps costs low while reserving Opus for the complex reasoning at the orchestration layer.

Streaming

Every audit event is pushed to the browser in real time over a WebSocket. The orchestrator runs in a background thread; events are passed through an asyncio.Queue to the async WebSocket handler. This is what makes the audit trail appear live rather than all at once.

The agentic loop explained

Understanding why agentic AI introduces new risks requires understanding how the loop works. Unlike a simple chatbot that takes input and produces output, an agent operates as follows:

Step	What happens	Risk if uncontrolled
1. Receive task	User submits a natural-language request	No input validation — injection possible
2. Plan	Orchestrator decides which tools to call and in what order	Unbounded planning — could decide to use any tool
3. Dispatch	Tool call sent to a worker with inputs derived from previous context	Inputs may contain injected instructions from earlier steps
4. Execute	Worker makes API call, retrieves data, or takes action	No allowlist — can call any tool including destructive ones
5. Observe	Result returned to orchestrator, added to context	Result may contain new injection payloads from external content
6. Repeat	Orchestrator uses result to plan next step	No budget cap — loop may run indefinitely

Key insight: In step 5, external content (a web page, a document, an email) enters the agent's context. If that content contains instructions, the agent has no way to distinguish them from legitimate system instructions — unless explicit controls check for this pattern.

The four controls

The guardrails layer (guardrails.py) enforces four preventive controls before every tool call. They run in order — if any check fails, the tool call is blocked and a GUARDRAIL_BLOCK event is logged.

1. Pre-execution review 2. Tool allowlist 3. Call budget 4. Injection scan

Control	What it does	Demonstrated in
Pre-execution review	Blocks all tool execution. The orchestrator plans and reasons as normal, but no workers run. Every planned action is captured in the audit trail and held for human review and approval before anything executes — a human-in-the-loop control equivalent to a four-eyes check or change advisory board.	Pre-execution Review scenario
Tool allowlist	Only tools explicitly listed at session start may execute. Any attempt to call an unlisted tool raises a violation. Enforces principle of least privilege.	Unauthorized Tool scenario
Call budget	Hard cap on total tool invocations per session. Once the budget is exhausted, all further calls are blocked. Limits blast radius of runaway agents.	Budget Exceeded scenario
Prompt injection scan	Scans all string inputs to tool calls for known override patterns (e.g. "ignore previous instructions", "you are now"). Blocks the call if detected.	Prompt Injection scenario

These four controls are deliberately simple — they represent the minimum viable control set for a production agentic deployment. Dry-run mode is the mechanism that enables human-in-the-loop workflows: the agent plans its actions, a human reviews and approves, then execution proceeds. A full production stack would add semantic similarity scanning, rate limiting, and output filtering on top.

The audit trail

Every session writes a JSON-lines file to audit_logs/. Each line is a structured event. From a Technology Controls perspective, this answers the key governance questions:

What did it decide?

ORCHESTRATOR_RESPONSE events capture each reasoning step, stop reason, and token counts.

What did it do?

TOOL_CALL events capture tool name, call ID, and full inputs — before execution.

What did it receive?

TOOL_RESULT events capture what each worker returned, result length, and latency.

What was blocked?

GUARDRAIL_BLOCK events capture which control fired, the reason, and the blocked inputs.

The audit log is the primary evidence trail for any post-incident investigation. It is also the input for monitoring — you could pipe these events to a SIEM or alerting system to detect anomalous patterns in production.

Regulatory and framework mapping

The controls demonstrated map directly to emerging AI governance obligations:

Control	Framework / Obligation
Audit trail (full event log)	FCA PS24/16 — record-keeping for AI-assisted decisions; MiFID II audit trail requirements; internal model risk governance (SS1/23)
Tool allowlist (least privilege)	Principle of least privilege (NCSC); change management controls; FCA operational resilience obligations
Call budget (blast radius cap)	Operational resilience — limiting impact of failures; Consumer Duty — preventing harmful automated actions
Prompt injection scan	OWASP LLM Top 10 — LLM01: Prompt Injection; NCSC AI security guidance; NIST AI RMF — GOVERN 1.2
Pre-execution review (human-in-the-loop)	Human oversight before irreversible AI-driven transactions; four-eyes control equivalent; FCA operational resilience obligations; MiFID II best execution — no autonomous execution of high-value trades without human sign-off

OWASP LLM01 — Indirect Prompt Injection: The injection scenario specifically demonstrates indirect injection via a third-party web source fetched autonomously by the agent. This attack vector does not exist in non-agentic AI — it only emerges when agents are given tools to retrieve external content.

Extending and adapting this tool

The architecture is designed to be pointed at other agentic workflows to validate their security posture. To add a new worker:

from workers.base_worker import BaseWorker

class EmailWorker(BaseWorker):
    name = "send_client_email"
    system_prompt = "You draft and send client emails..."

Register it in orchestrator.py and add it to the TOOLS list. The guardrails and audit trail apply automatically — no changes needed to the control layer.

To test a new scenario:

Add a scenario entry to the SCENARIOS object in index.html
Add scripted event sequences to buildSimulations() for the ON and OFF cases
For live runs (normal operation), set simulation: false in the scenario config

Technology stack

Component	Technology	Why
Orchestrator model	Claude Opus 4.6	Most capable for multi-step planning and tool selection
Worker models	Claude Haiku 4.5	Cost-efficient for single-task execution
API	Anthropic Messages API	Native tool use support with stop_reason: tool_use
Web framework	FastAPI + Uvicorn	Async-native, WebSocket support, minimal overhead
Streaming	WebSocket + asyncio.Queue	Bridges sync orchestrator thread to async browser stream
Deployment	Docker + EC2 + CloudFront	Reproducible, HTTPS via CloudFront, stop/start friendly
CI/CD	GitHub Actions	Tests on every push, manual deploy trigger