I Built a Multi-Agent AI System on AWS Without LangChain

Every time I needed to orchestrate multiple AI agents, the default answer was: install LangChain, wire up LangGraph, add a few abstractions, and hope that the framework doesn't change its API next week. I've seen this pattern in dozens of projects. It works. But it also means your production system depends on a Python library maintained by a startup, not on the AWS primitives you're already paying for and already understand.

So I built a production-ready multi-agent orchestration platform using 100% native AWS services. No LangChain. No LangGraph. No third-party orchestration layer — intentionally.

romanceresnak/aws-bedrock-multi-agent

Production-ready multi-agent system · Terraform + Python · fully deployable

⭐ View on GitHub

The Problem

Manual AI orchestration is painful. You have:

Routing logic living in application code that nobody documents
No quality gate between the AI and the user — bad answers go through
No human fallback for edge cases
Vendor lock-in to Python frameworks that version-bump without mercy

And image generation is usually bolted on as an afterthought — a different API call, a different auth context, a different error surface. This system handles all of it through a single entry point.

The Solution

A fully-native AWS multi-agent system where:

Bedrock Agents handle orchestration via the Supervisor pattern
OpenSearch Serverless is the vector store for RAG
Lambda handles compute — no containers, no clusters
Amazon A2I gives humans a review loop on low-confidence answers
Nova Canvas generates images on demand through an agent action group

The entire infrastructure is reproducible via Terraform. The Bedrock resources (Knowledge Base, agents) are provisioned via Python scripts because the Terraform AWS provider doesn't fully support Bedrock Agents yet.

Architecture: 13-Step Flow

One query. One entry point. Multiple specialists. Automated grading. Human review. All native.

How It Works

Query Rewriting

Before the Supervisor even sees the query, I clean it up. Claude Haiku rewrites it — removing pronouns, expanding abbreviations, resolving ambiguity:

# query_rewrite/handler.py:34-52
response = bedrock.invoke_model(
    modelId="anthropic.claude-haiku-20240307-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "messages": [{
            "role": "user",
            "content": f"Rewrite this query for maximum clarity. Return ONLY the rewritten query.\n\nQuery: {original_query}"
        }]
    })
)

Haiku is cheap. This step costs fractions of a cent and massively improves downstream retrieval quality.

Supervisor Pattern

The Supervisor is Claude 3.5 Sonnet v2 with SUPERVISOR orchestration type. It knows about two collaborators:

# 03_create_supervisor.py:98-115
agent_collaborators=[
    {
        "agentDescriptor": {"aliasArn": rag_alias_arn},
        "collaboratorName": "rag-specialist",
        "collaborationInstruction": "Use for any query requiring factual information from company documents, policies, or reports.",
        "relayConversationHistory": "TO_COLLABORATOR"
    },
    {
        "agentDescriptor": {"aliasArn": image_alias_arn},
        "collaboratorName": "image-specialist",
        "collaborationInstruction": "Use when the user explicitly requests an image, visual, diagram, or picture.",
        "relayConversationHistory": "TO_COLLABORATOR"
    }
]

relayConversationHistory: TO_COLLABORATOR is the key detail — sub-agents see the full conversation context, not just the current turn.

All three agents provisioned and in Prepared state after running the setup scripts:

RAG Specialist

The RAG Specialist is Claude 3 Sonnet associated with a Bedrock Knowledge Base. Citations are mandatory — the instruction enforces it:

Always cite the source document when providing information.
Format citations as: [Source: s3://bucket/prefix/filename]
Do not answer from general knowledge. Only use retrieved documents.

The Knowledge Base uses semantic chunking (max 300 tokens, 95% breakpoint threshold) instead of fixed-size chunks. This means chunk boundaries follow natural language structure rather than arbitrary token counts.

Image Generation

The Image Specialist has an action group that calls a Lambda. That Lambda invokes Nova Canvas:

# image_generation_action/handler.py:61-82
response = bedrock_runtime.invoke_model(
    modelId="amazon.nova-canvas-v1:0",
    body=json.dumps({
        "taskType": "TEXT_IMAGE",
        "textToImageParams": {
            "text": prompt,
            "negativeText": "low quality, blurry, distorted, pixelated"
        },
        "imageGenerationConfig": {
            "width": 1024,
            "height": 1024,
            "cfgScale": 7.0
        }
    })
)

The image lands in S3 under generated/ with a 7-day lifecycle policy. The response includes both a presigned URL (1-hour expiry) and base64 inline.

The Backend

OpenSearch Serverless

Three policy types required — encryption, network, and data access:

# opensearch.tf:54-72 — data access for all principals
resource "aws_opensearchserverless_access_policy" "kb_access" {
  name  = "${local.name_prefix}-kb-access"
  type  = "data"
  policy = jsonencode([{
    Rules = [
      { Resource = ["index/${collection_name}/*"], Permission = ["aoss:*"], ResourceType = "index" },
      { Resource = ["collection/${collection_name}"], Permission = ["aoss:*"], ResourceType = "collection" }
    ],
    Principal = [bedrock_kb_role_arn, bedrock_agent_role_arn, local.caller_arn]
  }])
}

Note: The vector index (bedrock-knowledge-base-index) is created automatically when the Knowledge Base is provisioned. Don't try to pre-create it in Terraform — it'll conflict.

The Knowledge Base provisioned in the console — Titan Text Embeddings v2, semantic chunking, OpenSearch Serverless backend:

S3

Two buckets with different protection strategies:

Bucket	Strategy	Reason
docs	prevent_destroy = true	Source of truth — never delete
images	force_destroy = true	Ephemeral — presigned URLs expire anyway

The docs bucket has versioning enabled. The images bucket has a 7-day lifecycle rule on the generated/ prefix.

Lambda Functions

Function	Model	Memory	Timeout	Purpose
query-rewrite	Claude Haiku	128 MB	60s	Query disambiguation
grader	Claude Haiku	128 MB	60s	Quality scoring 1-5
image-generation-action	Nova Canvas	512 MB	120s	Image generation + S3 upload
orchestrator	—	256 MB	300s	Top-level entry point
invoke-rag-agent	—	128 MB	60s	Supervisor → RAG agent bridge

The 512 MB for image generation is non-negotiable — base64 encoding a 1024×1024 image in Lambda will OOM at 128 MB.

IAM

Three distinct roles, minimal permissions:

Bedrock Agent Role: FM access, S3 read on docs bucket, OpenSearch read, Lambda invoke (action groups), A2I submit
Bedrock KB Role: Titan embeddings invoke, S3 read on docs, OpenSearch write
Lambda Execution Role: Bedrock invoke, S3 write on images bucket, invoke peer Lambdas

No wildcard bedrock:*. Each role gets exactly the model ARNs it needs.

Quality Gate: The Grader

Every response goes through a grader before reaching the user. Claude Haiku evaluates on a 1-5 rubric:

# grader/handler.py:41-70
GRADING_PROMPT = """
Evaluate this AI response on a scale of 1-5.

Query: {query}
Response: {response}

Scoring criteria:
- 5: Accurate, complete, well-cited, directly answers the question
- 4: Accurate and complete, minor gaps
- 3: Mostly accurate, some missing context
- 2: Partially accurate or incomplete — RETRY RECOMMENDED
- 1: Incorrect, hallucinated, or unhelpful — RETRY REQUIRED

Return ONLY valid JSON:
{{"score": , "reasoning": "", "should_retry": }}
"""

Retry logic in the orchestrator:

# orchestrator.py:55-73
for attempt in range(MAX_RETRIES):  # MAX_RETRIES = 3
    response = invoke_supervisor(rewritten_query, session_id)
    grade = invoke_grader(query, response)

    if not grade["should_retry"]:
        break

    if attempt == MAX_RETRIES - 1:
        logger.warning("Max retries reached, proceeding with best response")
        break

If JSON parsing fails, the grader falls back to score 3 — good enough to proceed, not good enough to call a success.

Human Review: Amazon A2I

Responses with score ≥ 3 are submitted to A2I for async human review:

# orchestrator.py:94-119
a2i_response = a2i_client.start_human_loop(
    HumanLoopName=f"review-{session_id}-{int(time.time())}",
    FlowDefinitionArn=os.environ["A2I_FLOW_DEFINITION_ARN"],
    HumanLoopInput={
        "InputContent": json.dumps({
            "query": original_query,
            "response": agent_response,
            "auto_grade": grade["score"],
            "reasoning": grade["reasoning"]
        })
    },
    DataAttributes={"ContentClassifiers": ["FreeOfPII"]}
)

The API returns HTTP 202 immediately. The human review happens asynchronously. The HumanLoopArn is returned in the response metadata so callers can poll status if needed.

A2I Flow Definition is created manually in the AWS Console — there's no Terraform resource for it. The ARN goes into .env after creation.

It Actually Works

The RAG Specialist answers directly from the Knowledge Base and cites the source document path:

Test showing RAG specialist response with citations

Image generation tested directly in the Bedrock console — the agent returns a presigned S3 URL and the orchestration trace shows 2 steps.

Cost Analysis

Component	Pricing Model	Estimated Monthly
OpenSearch Serverless	2 OCU minimum	~$350
Claude 3.5 Sonnet v2 (Supervisor)	Per token	~$20–60
Claude 3 Sonnet (Sub-agents)	Per token	~$10–30
Claude Haiku (Rewrite + Grader)	Per token	~$2–5
Nova Canvas	$0.01–$0.08/image	Depends on usage
Lambda	Per invocation	~$0 (free tier)
API Gateway	Per request	~$1
S3	Per GB + requests	~$2
Total at low volume		~$390–450/month

OpenSearch Serverless dominates. If cost is a blocker, consider switching to Aurora Serverless v2 with pgvector — it scales to zero when idle. The trade-off is manual embedding management.

The model tier strategy is intentional:

Haiku for rewrite and grading — cheap, fast, deterministic tasks
Sonnet for specialists — balanced quality vs. cost
Sonnet 3.5 v2 only at the top level — one reasoning step, highest quality where it counts

Deployment

Infrastructure first, Bedrock resources second. Order matters because the Python scripts need Terraform outputs.

Step 1 — Infrastructure

terraform init
terraform apply
# Outputs: bucket names, OpenSearch collection ARN, Lambda ARNs, role ARNs
# Copy to .env

Step 2 — Knowledge Base

python 01_create_knowledge_base.py
# Creates KB with Titan Embed v2
# Runs initial ingestion from S3
# Appends KB_ID to .env

Step 3 — Sub-agents

python 02_create_subagents.py
# Creates rag-specialist-agent + alias
# Creates image-generation-agent + alias
# Appends alias ARNs to .env

Step 4 — Supervisor

python 03_create_supervisor.py
# Creates multi-agent-supervisor-v2
# Associates collaborators
# Appends supervisor alias ARN to .env

Step 5 — A2I Flow

Create manually in the AWS Console under Augmented AI. Add the ARN to .env.

Step 6 — Test

python 04_test_invoke.py --query "What equipment does the company provide for remote work?"

Expected output:

{
  "response": "Based on the company's remote work policy, employees receive: ...",
  "grade": {"score": 5, "should_retry": false},
  "human_loop_arn": "arn:aws:sagemaker:eu-west-1:...",
  "trace_events": ["rag-specialist-agent invoked"]
}

What's Next

Streaming responses — the current architecture is synchronous; adding WebSocket support via API Gateway v2 is the natural next step
Multi-modal RAG — extend the Knowledge Base to index images alongside documents
Agent memory — persistent conversation state across sessions using DynamoDB
Custom A2I task UI — the default review interface works, but a custom template with side-by-side query/response layout improves reviewer throughput
Cost dashboard — tag-based Cost Explorer views using the Project=multi-agent-bedrock tag are already in place; a proper FinOps report is the logical next step

Try It

All infrastructure and application code follows the same deployment order every time. The only manual step is the A2I Flow Definition — everything else is automated.

terraform apply → KB → sub-agents → supervisor → A2I (manual) → test

The .env.example file documents every required variable. Terraform outputs populate most of them automatically. The Python scripts append generated IDs after each step.

If you want to test RAG without standing up the full pipeline, the Knowledge Base can be queried directly via the Bedrock console — the test interface in the AWS Console works out of the box once ingestion completes.

DR. Roman Čerešňák