AutoEvals: Building a Multi-Agent System for Automated AI Agent Evaluation

As AI agents become more sophisticated, evaluating their capabilities becomes increasingly challenging. How do you systematically test an agent that can write code, search the web, manage files, and make complex decisions? This is the problem AutoEvals solves - it's an AI-driven deep research system that automatically analyzes agent codebases and generates comprehensive evaluation suites.

In this post, we'll dive deep into how AutoEvals works, exploring its innovative skills-based architecture, deep agents framework, and the orchestration system that brings everything together.

The Problem: Evaluating AI Agents at Scale

Traditional software testing approaches fall short when applied to AI agents. Agents exhibit emergent behaviors, make multi-step decisions, and interact with external tools in ways that are difficult to predict. Manual evaluation is time-consuming and doesn't scale.

AutoEvals addresses this by using AI to analyze AI - deploying a team of specialized agents that:

Analyze the target agent's architecture and code structure
Identify behavioral patterns and capabilities
Design 20-50 rigorous evaluation test cases
Generate appropriate graders (code-based, model-based, or human)

Architecture Overview

Before diving into the components, let's visualize how everything fits together:

Quick Start Guide

Prerequisites

Python 3.11+
uv package manager
Either Anthropic API key or AWS credentials (see options below)

Step 1: Install

git clone <repository-url>
cd AutoEvals
uv sync

Step 2: Choose Your Model Provider

Option A: Anthropic API (Default)

export ANTHROPIC_API_KEY=your-api-key

# Run analysis
uv run python -m auto_evals --target /path/to/agent

Option B: AWS Bedrock (Auto-configured)

# Configure AWS credentials (if not already configured)
export AWS_REGION=us-east-1
aws configure

# Run with --aws-sandbox flag (automatically uses Bedrock model)
uv run python -m auto_evals --target /path/to/agent --aws-sandbox

# AWS Bedrock model is auto-selected: bedrock:us.anthropic.claude-sonnet-4-5-20250929-v1:0

Step 3: Run Analysis

Analyze a local codebase:

uv run python -m auto_evals --target /path/to/your/agent

Analyze a GitHub repository:

uv run python -m auto_evals --target https://github.com/owner/repo

# For private repos, set:
export GITHUB_TOKEN=your-github-token

Use AWS Bedrock Code Interpreter:

# Automatically uses Bedrock model + AWS Code Interpreter sandbox
uv run python -m auto_evals --target https://github.com/owner/repo --aws-sandbox

# Optional: Specify Code Interpreter role
export CODE_INTERPRETER_ROLE_ARN=arn:aws:iam::123456789:role/CodeInterpreterRole

Custom configuration:

# Specify output file
uv run python -m auto_evals --target /path/to/agent --output my_evals.json

# Enable debug logging
uv run python -m auto_evals --target /path/to/agent --debug

# Explicitly set model (overrides auto-selection)
uv run python -m auto_evals --target /path/to/agent --model anthropic:claude-opus-4-5-20251101

Step 4: Review Results

The tool generates eval_suite.json containing:

Findings: Research insights organized by category and importance
Eval Cases: 20-50 evaluation test cases with graders
Metadata: Runtime, model used, configuration

View statistics:

# Count eval cases
cat eval_suite.json | jq '.eval_suite.eval_cases | length'

# View findings by importance
cat eval_suite.json | jq '.findings[] | select(.importance=="critical") | .title'

# List eval case categories
cat eval_suite.json | jq '.eval_suite.eval_cases[] | .category' | sort | uniq -c

Example output structure:

{
  "target_path": "https://github.com/owner/repo",
  "generated_at": "2026-02-07T10:30:00Z",
  "findings": [
    {
      "category": "architecture",
      "title": "Agent uses LangGraph with ReAct pattern",
      "importance": "high",
      "eval_relevance": "Should test multi-step reasoning and tool usage"
    }
  ],
  "eval_suite": {
    "name": "Eval Suite for owner/repo",
    "eval_cases": [
      {
        "task_id": "capability_001",
        "name": "Multi-step file search",
        "category": "capability",
        "input_prompt": "Find all Python files that import 'langgraph'",
        "expected_behavior": "Agent should search recursively and return file list",
        "graders": [...]
      }
    ]
  },
  "metadata": {
    "runtime_seconds": 45.2,
    "model": "bedrock:us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  }
}

Deep Agents: More Than Just LLM Wrappers

The core innovation in AutoEvals is the Deep Agents framework. Unlike simple LLM wrappers that make a single call, each agent in AutoEvals is a full autonomous system capable of multi-step reasoning and tool usage.

What Makes an Agent "Deep"?

A deep agent has:

Persistent State: Using LangGraph's AgentState with message accumulation
Tool Binding: The LLM can call tools and receive results
Iterative Execution: A workflow loop that continues until the task is complete
Planning Capability: The ability to break down complex tasks

Here's the core workflow:

The Agent State

Each agent maintains state using LangGraph's TypedDict:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]

The add_messages reducer ensures that tool results and agent responses accumulate properly across iterations.

Creating Deep Agents

The create_deep_agent() function is the heart of the system:

def create_deep_agent(
    model: str,
    tools: list,
    system_prompt: str,
    subagents: dict = None,
    skills: list = None,
    name: str = "agent"
) -> CompiledGraph:
    # Parse model string (e.g., "anthropic:claude-sonnet-4-5-20250929")
    # Extract tools from skills
    # Combine skill contexts with system prompt
    # Create LangGraph workflow
    # Return compiled agent

The key insight is that subagents are themselves deep agents. When the orchestrator creates subagents, each one gets its own full LangGraph workflow with tool access and multi-turn capability.

Skills: The Cognitive Load Solution

One of the biggest challenges with tool-using agents is cognitive overload. Give an agent 20+ tools, and it struggles to pick the right one.

AutoEvals solves this with the Skills abstraction:

What is a Skill?

A Skill is a Pydantic model that encapsulates:

class Skill(BaseModel):
    name: str                    # Skill identifier
    description: str             # High-level capability description
    tools: list[Callable]        # Underlying tool functions
    context: str = ""            # Detailed usage instructions

    def get_tools(self) -> list:
        """Returns tools for agent binding"""

    def get_context(self) -> str:
        """Returns formatted context for system prompt"""

Benefits of Skills

Reduced Cognitive Load: Agents see 4 skills instead of 20 tools
Progressive Disclosure: Detailed instructions only load when needed
Token Efficiency: Context added only for used skills
Composability: New skills added without cluttering agent context

The Four Core Skills

1. Code Analysis Skill

Provides tools for understanding codebases:

list_files: Directory traversal with filtering
read_file: Content access with size limits
analyze_code_structure: AST parsing for Python files
search_code: Regex pattern matching
find_patterns: Detect common patterns (error handling, async, state management)

2. GitHub Analysis Skill

Remote repository analysis:

get_repository_info: Fetch metadata via GitHub API
get_file_from_github: Raw content access
clone_repository: Local cloning with GitPython
list_repository_files: API-based file listing

3. Sandbox Execution Skill

Safe code execution for analysis:

run_python_code: Execute Python with AST validation
run_code_analysis: Automated structure/pattern analysis
generate_grader_code: Create evaluation graders

4. Dynamic Agent Spawner Skill

On-demand agent creation:

spawn_agent: Create specialized agents for unique analysis needs

Folder-Based Skills

Beyond programmatic skills, AutoEvals supports loading skills from markdown files:

skills/
  code-analysis/
    SKILL.md           # Contains YAML frontmatter + markdown context
  pattern-detection/
    SKILL.md
  eval-design/
    SKILL.md

Each SKILL.md contains:

YAML frontmatter: name, description, triggers
Markdown content: Detailed instructions and examples

This makes it easy to extend the system without writing Python code.

The Orchestrator: Coordinating the Team

The orchestrator is the conductor of this multi-agent symphony. It's a deep agent with four subagents, each specialized for a different aspect of analysis.

Orchestrator Workflow

The Four Subagents

Architecture Analyst

Focuses on project structure:

Entry points and main modules
Dependency analysis
Configuration patterns
Module organization

Pattern Analyst

Identifies code patterns:

Error handling strategies
Async/await usage
State management approaches
Tool definitions and LLM calls

Behavior Analyst

Understands capabilities:

Input/output formats
Decision-making logic
External integrations
Edge case handling

Eval Designer

Creates the evaluation suite:

Capability tests (what the agent should do)
Regression tests (things that shouldn't break)
Edge case tests (boundary conditions)

Task Delegation

The orchestrator uses a special task() tool to delegate work:

@tool
def task(name: str, task_description: str) -> str:
    """Delegate a task to a subagent."""
    subagent = subagents[name]
    result = subagent.invoke({
        "messages": [HumanMessage(content=task_description)]
    })
    return result["messages"][-1].content

This creates true hierarchical agent coordination - the orchestrator plans, delegates, and synthesizes, while subagents execute specialized analysis.

Tools: The Foundation Layer

At the bottom of the stack are the actual tools that interact with code and files.

Code Tools

Built with LangChain's @tool decorator:

@tool
def analyze_code_structure(file_path: str) -> CodeStructure:
    """Analyze Python file structure using AST parsing."""
    # Returns: classes, functions, imports, decorators, complexity

Key outputs are Pydantic models for type safety:

FileInfo: Path, name, extension, size
CodeStructure: Classes, functions, imports, complexity
PatternMatch: Pattern name, location, code snippet, confidence

Sandbox Execution

The sandbox provides safe code execution:

Local Sandbox (CodeExecutor):

AST-based validation blocks dangerous operations
Allowlist of safe modules (json, re, pathlib, etc.)
Restricted execution context

AWS Bedrock Sandbox:

Managed sandboxing via AWS API
Used when --aws-sandbox flag is set
Better isolation for production use

Data Flow: End to End

Let's trace a complete execution:

flowchart TB
    subgraph Input
        cli["CLI: uv run python -m auto_evals --target /path"]
    end

    subgraph Setup
        parse["Parse Arguments"]
        model["Select Model<br/>(Anthropic or Bedrock)"]
        skills["Build Skills"]
        agents["Create Subagents"]
        orch["Create Orchestrator"]
    end

    subgraph Execution
        invoke["orchestrator.invoke()"]
        delegate["Delegate to Subagents"]
        tools["Execute Tools"]
        synthesize["Synthesize Findings"]
        design["Design Eval Cases"]
    end

    subgraph Output
        parse_out["Parse JSON Response"]
        validate["Pydantic Validation"]
        save["Save eval_suite.json"]
    end

    cli --> parse --> model --> skills --> agents --> orch
    orch --> invoke --> delegate --> tools --> synthesize --> design
    design --> parse_out --> validate --> save

The Evaluation Suite Output

The final output is a comprehensive evaluation suite:

{
  "name": "Agent Evaluation Suite",
  "target_agent": "/path/to/agent",
  "eval_cases": [
    {
      "task_id": "cap-001",
      "name": "File Creation Capability",
      "category": "capability",
      "input_prompt": "Create a new file called test.txt with 'Hello World'",
      "expected_behavior": "Agent creates file with correct content",
      "graders": [
        {
          "grader_type": "CODE_BASED",
          "implementation": "assert os.path.exists('test.txt')"
        }
      ],
      "metric": "BINARY",
      "num_trials": 5
    }
  ]
}

Each eval case includes:

Category: capability, regression, or edge_case
Graders: CODE_BASED (deterministic), MODEL_BASED (LLM judgment), or HUMAN
Metrics: pass@k, pass^k, partial_credit, binary
Trial count: For statistical validity

Key Architectural Innovations

1. Skills as First-Class Citizens

Rather than drowning agents in tools, skills provide semantic groupings with contextual knowledge. This dramatically improves tool selection accuracy.

2. Deep Agents at Every Level

Subagents aren't simple functions - they're full agents capable of multi-step reasoning. This enables complex analysis that couldn't be done in a single LLM call.

3. Dynamic Agent Spawning

The orchestrator can create specialized agents on-demand, adapting to unique codebase characteristics without predefined subagent explosion.

4. Provider Abstraction

The provider:model format allows seamless switching between Anthropic API and AWS Bedrock, supporting different deployment scenarios.

5. Hierarchical Coordination

True task delegation where the orchestrator plans and synthesizes while specialists execute - mirroring how human teams work.

6. Smart Context Management for Large Repositories

One of the biggest challenges in agent-based code analysis is context overflow - when analyzing repositories with thousands of files (especially those containing node_modules or other large dependency directories), the raw file list alone can exceed the model's context window.

AutoEvals solves this with a multi-layered approach:

Automatic Directory Exclusion

The list_files tool automatically excludes common large directories that don't contain relevant source code:

EXCLUDED_DIRECTORIES = {
    "node_modules",    # JavaScript dependencies
    ".git",            # Git internals
    "__pycache__",     # Python bytecode
    ".venv", "venv",   # Virtual environments
    "dist", "build",   # Build outputs
    ".next", ".nuxt",  # Framework caches
    "coverage",        # Test coverage reports
    "target",          # Rust/Java builds
    # ... and more
}

This prevents agents from getting overwhelmed by directories that can contain tens of thousands of files.

File Limits with Smart Defaults

The list_files tool enforces a default limit of 500 files per query. When this limit is reached, the agent receives a warning suggesting to use extension filtering or additional exclusions:

@tool
def list_files(
    directory: str,
    extensions: Optional[list[str]] = None,  # Filter by file type
    max_files: Optional[int] = None,          # Default: 500
    exclude_dirs: Optional[list[str]] = None, # Additional exclusions
) -> list[FileInfo]:

Prompt-Guided Progressive Exploration

The subagent prompts are designed to guide agents toward efficient exploration strategies:

Progressive Discovery: Start with top-level structure, then drill down
- First: List files with extensions=[".py"] to focus on source code
- Then: Read README.md and key entry points
- Finally: Explore specific modules of interest
Code Execution for Heavy Analysis: For large repositories, agents write Python scripts that:
- Analyze the codebase structure
- Filter and summarize results
- Return only relevant information
Targeted Reading: Read specific files rather than listing all:
- Start with: README.md, main.py, pyproject.toml
- Use search_code() to find specific patterns
- Use find_patterns() for pattern detection

Example exploration code that agents can execute in the sandbox:

from pathlib import Path
from collections import defaultdict

def summarize_project(path):
    """Efficient project summarization for large repos."""
    py_files = []
    for f in Path(path).rglob("*.py"):
        # Skip excluded directories
        if any(excl in f.parts for excl in
               ["node_modules", "__pycache__", ".venv"]):
            continue
        py_files.append(f)

    return {
        "total_py_files": len(py_files),
        "top_level_dirs": [d.name for d in Path(path).iterdir()
                          if d.is_dir() and not d.name.startswith(".")],
        "key_files": [str(f.relative_to(path)) for f in py_files[:20]]
    }

This approach provides a 98%+ reduction in context usage compared to naive file listing, enabling analysis of repositories with 10,000+ files without hitting context limits.

Running AutoEvals

Getting started is straightforward:

# Install dependencies
uv sync

# Analyze a local agent
uv run python -m auto_evals --target /path/to/your/agent

# Analyze a GitHub repository
uv run python -m auto_evals --target https://github.com/user/agent-repo

# Use AWS Bedrock sandbox
uv run python -m auto_evals --target /path/to/agent --aws-sandbox

Conclusion

AutoEvals demonstrates how multi-agent systems can tackle complex analysis tasks that would overwhelm a single agent. By combining:

Skills-based architecture for cognitive efficiency
Deep agents for multi-step reasoning
Hierarchical orchestration for coordination
Specialized subagents for focused analysis

The system can automatically generate comprehensive evaluation suites for AI agents of any complexity.

The key insight is that the same capabilities that make AI agents powerful - tool use, planning, and multi-step execution - can be turned inward to analyze and evaluate other agents. It's AI helping us understand AI.

AutoEvals is built with LangGraph, Pydantic, and supports both Anthropic API and AWS Bedrock for model inference.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
blog		blog
docs		docs
img		img
src/auto_evals		src/auto_evals
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AutoEvals: Building a Multi-Agent System for Automated AI Agent Evaluation

The Problem: Evaluating AI Agents at Scale

Architecture Overview

Quick Start Guide

Prerequisites

Step 1: Install

Step 2: Choose Your Model Provider

Step 3: Run Analysis

Step 4: Review Results

Deep Agents: More Than Just LLM Wrappers

What Makes an Agent "Deep"?

The Agent State

Creating Deep Agents

Skills: The Cognitive Load Solution

What is a Skill?

Benefits of Skills

The Four Core Skills

1. Code Analysis Skill

2. GitHub Analysis Skill

3. Sandbox Execution Skill

4. Dynamic Agent Spawner Skill

Folder-Based Skills

The Orchestrator: Coordinating the Team

Orchestrator Workflow

The Four Subagents

Architecture Analyst

Pattern Analyst

Behavior Analyst

Eval Designer

Task Delegation

Tools: The Foundation Layer

Code Tools

Sandbox Execution

Data Flow: End to End

The Evaluation Suite Output

Key Architectural Innovations

1. Skills as First-Class Citizens

2. Deep Agents at Every Level

3. Dynamic Agent Spawning

4. Provider Abstraction

5. Hierarchical Coordination

6. Smart Context Management for Large Repositories

Automatic Directory Exclusion

File Limits with Smart Defaults

Prompt-Guided Progressive Exploration

Running AutoEvals

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages