Autonomous Multi-Agent Orchestration

Every multi-agent framework uses the context window as long-term memory. That works for a 10-minute demo. It does not work for a 9-hour autonomous run.

We watched a sliding-window orchestrator collapse: by round 25, the Architect's speed dropped from 96 tokens/second to 2. By round 30, agents were contradicting their own decisions from 20 minutes earlier — because those decisions had been evicted from context.

The fix was simple: stop putting conversation history in the context window. Only the last round goes in. Everything else is a tool call to recall(). The context stays at 3–9K tokens forever. Speed stays constant. Knowledge grows unbounded.

This is what AgentAZAll was built for.

Memory-First Architecture

Three design decisions that make 9-hour runs possible

1. Context Stays Small

Each agent sees only three things: its system prompt (~800 tokens), the current phase instruction (~200 tokens), and what every agent said in the previous round (~2–6K tokens).

No conversation history. No sliding window. No memory injection. The context is 3–9K tokens on every turn — whether it's round 1 or round 199.

Context-first collapsed at 2 tok/s by round 25. Memory-first sustained 100+ tok/s through round 199.

# What each agent sees every turn:
[System prompt — role, tools]     ~800 tok
[Phase instruction]               ~200 tok
[Last round's messages]         ~2-6K tok
                              ────────────
                        Total:  3-9K tok  ← always

2. Memory Grows Forever

Agents call recall() and remember() as tools — the LLM decides when to store and retrieve, not the orchestrator.

Memories are plain text files on the filesystem. No database. No vector store. No embeddings. You can cat, grep, and git commit them.

402 memories stored over 199 rounds. 77 by the Architect, 93 by the Developer, 154 by the Reviewer. The distribution emerged naturally from their roles.

agents/
  architect/remember/
    db-choice.txt
    phase2_decisions.txt
    auth-decisions.txt
    ...                    77 files
  developer/remember/
    cosgn00c-overview.txt
    customer-model.txt
    ...                    93 files
  reviewer/remember/
    field-mappings.txt
    cics-transaction.txt
    ...                   154 files

3. Roles Enforce Discipline

Only the Developer can write files. The Architect designs. The Reviewer validates. Tool access is role-gated — calling write_file as the Architect returns an error, not a result.

This prevents the chaos of multiple agents overwriting each other's work. The Developer made 153 write_file calls to produce 52 unique files — that's iteration, exactly what a human developer does.

# Role-based tool access

Architect:
  recall, remember, read_file, list_files

Developer:
  recall, remember, read_file, list_files,
  write_file, run_python  ← exclusive

Reviewer:
  recall, remember, read_file, list_files

The Experiment

COBOL-to-Python migration of AWS CardDemo

Three identical NVIDIA Nemotron-3-Nano-30B-A3B models (30B total, 3B active per step). Same weights, different role prompts. Running on an AMD EPYC server with 8 GPUs (242 GB VRAM).

Source material: AWS CardDemo — a real open-source COBOL/CICS credit card system. 29 programs, 29 copybooks, ~50,000 lines. EXEC CICS READ, PIC 9(09) COMP-3, pseudo-conversational patterns. The real thing.

ModelNemotron-3-Nano-30B-A3B Q8 × 3

Active params3B per inference (MoE)

GPUs used6 of 8 (~120 GB VRAM)

Inferencellama.cpp, CUDA 12.6, flash attention

Context config65,536 tokens per instance

Context used2–9K (memory-first)

Phases7 migration phases + open discussion

Completion tok3,510,906

Prompt tok20,788,680

Tool calls1,599

Errors0

Speed: Start vs. End

The whole point of memory-first — speed doesn't degrade

Metric	Context-First (v2)	Memory-First (v3)
Context at round 10	~15K tokens	~4K tokens
Context at round 50	~60K+ (overflow)	~5K tokens
Context at round 199	impossible	~6K tokens
Speed at round 10	95 tok/s	98 tok/s
Speed at round 50	2 tok/s	104 tok/s
Speed at round 199	—	101 tok/s

What They Built

The agents produced a real project structure: SQLAlchemy models from COBOL copybooks, FastAPI routes from CICS programs, Alembic migrations, Pydantic schemas, batch processors, and their own test suite.

The customer.py model includes a from_cobol_record() classmethod that maps every COBOL field to its Python equivalent — preserving leading zeros on PIC 9(09), parsing YYYYMMDD to date, and converting Y/N indicators to bool.

The acpt_persistence.py (9.8 KB) mirrors CICS READ/WRITE/REWRITE semantics, including the ws_change_has_occurred flag that gates updates — a direct translation of COBOL working-storage patterns.

Without being told to, the agents also produced reconcile_outputs.py (12 KB) — a script that compares COBOL batch output against Python batch output field-by-field. This is exactly what a real migration cutover requires.

output/
  main.py
  scheduler.py
  batch_record_builder.py
  models/
    customer.py, account.py
    user_security.py, transaction_detail.py
    billing_statement.py, audit_log.py
  services/
    account_update.py
    account_update_validator.py
    transaction_manager.py
    pseudo_conversational_validator.py
  repositories/
    account_repo.py
    transaction_detail_repo.py
    acct/acpt_persistence.py  ← 9.8 KB
  tests/
    test_batch_performance.py
    test_batch_layout.py
    test_batch_rollback.py
    test_scheduler_cron.py
    test_decimal_precision.py
  app/routes/auth.py
  app/schemas/auth.py
  alembic/versions/  ← 3 migrations
  scripts/
    reconcile_outputs.py  ← 12 KB
  ... 52 files total

20 Lines to Get Started

Define agents, set a task, run. Memory-first by default.

# pip install agentazall
from azclaw import Agent, Orchestrator

endpoint = "http://localhost:8080/v1/chat/completions"

architect = Agent("architect",
    role="Design the solution. Be specific about file structure and APIs.",
    endpoint=endpoint)

developer = Agent("developer",
    role="Write the code. Follow the Architect's design exactly.",
    endpoint=endpoint, can_write=True)

reviewer  = Agent("reviewer",
    role="Review for bugs, security issues, and design violations.",
    endpoint=endpoint)

orch = Orchestrator(agents=[architect, developer, reviewer])
orch.set_task("Build a FastAPI REST API for a todo app with SQLite")
orch.run(max_rounds=30)

print(f"Files: {orch.stats.files_written}")
print(f"Memories: {orch.stats.memories_stored}")

Works with any OpenAI-compatible endpoint: llama.cpp, vLLM, Ollama, LM Studio, OpenRouter. No openai SDK required — pure stdlib urllib.

Per-Agent Breakdown

From a single uninterrupted 8h 46m run on 2026-03-17/18

Agent	Tokens	Avg Speed	Tool Calls	Memories
Architect	1,048,820	97 tok/s	379	77
Developer	1,267,555	137 tok/s	606	93
Reviewer	1,194,531	104 tok/s	614	154

The Context Window Is Not Memory

Install

pip install agentazall

# Store a memory
agentazall remember --text "PostgreSQL chosen" --title "db-choice"

# Recall it tomorrow
agentazall recall --query "database"

Zero dependencies. Works offline. Memories are text files you own.

Verify

# Download the 9-hour run results
curl -O https://agentazall.ai/experiments/\
carddemo-cobol-migration/\
carddemo-agentazall-results.zip

# 52 Python files, 402 memories, full log
unzip carddemo-agentazall-results.zip

Every number on this page comes from a single uninterrupted run. Verify it yourself.

AutonomousMulti-AgentOrchestration