Skip to content

madhurprash/AutoEvals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoEvals: Building a Multi-Agent System for Automated AI Agent Evaluation

As AI agents become more sophisticated, evaluating their capabilities becomes increasingly challenging. How do you systematically test an agent that can write code, search the web, manage files, and make complex decisions? This is the problem AutoEvals solves - it's an AI-driven deep research system that automatically analyzes agent codebases and generates comprehensive evaluation suites.

In this post, we'll dive deep into how AutoEvals works, exploring its innovative skills-based architecture, deep agents framework, and the orchestration system that brings everything together.

website

The Problem: Evaluating AI Agents at Scale

Traditional software testing approaches fall short when applied to AI agents. Agents exhibit emergent behaviors, make multi-step decisions, and interact with external tools in ways that are difficult to predict. Manual evaluation is time-consuming and doesn't scale.

AutoEvals addresses this by using AI to analyze AI - deploying a team of specialized agents that:

  1. Analyze the target agent's architecture and code structure
  2. Identify behavioral patterns and capabilities
  3. Design 20-50 rigorous evaluation test cases
  4. Generate appropriate graders (code-based, model-based, or human)

Architecture Overview

Before diving into the components, let's visualize how everything fits together:

arch

Quick Start Guide

Prerequisites

  • Python 3.11+
  • uv package manager
  • Either Anthropic API key or AWS credentials (see options below)

Step 1: Install

git clone <repository-url>
cd AutoEvals
uv sync

Step 2: Choose Your Model Provider

Option A: Anthropic API (Default)

export ANTHROPIC_API_KEY=your-api-key

# Run analysis
uv run python -m auto_evals --target /path/to/agent

Option B: AWS Bedrock (Auto-configured)

# Configure AWS credentials (if not already configured)
export AWS_REGION=us-east-1
aws configure

# Run with --aws-sandbox flag (automatically uses Bedrock model)
uv run python -m auto_evals --target /path/to/agent --aws-sandbox

# AWS Bedrock model is auto-selected: bedrock:us.anthropic.claude-sonnet-4-5-20250929-v1:0

Step 3: Run Analysis

Analyze a local codebase:

uv run python -m auto_evals --target /path/to/your/agent

Analyze a GitHub repository:

uv run python -m auto_evals --target https://github.com/owner/repo

# For private repos, set:
export GITHUB_TOKEN=your-github-token

Use AWS Bedrock Code Interpreter:

# Automatically uses Bedrock model + AWS Code Interpreter sandbox
uv run python -m auto_evals --target https://github.com/owner/repo --aws-sandbox

# Optional: Specify Code Interpreter role
export CODE_INTERPRETER_ROLE_ARN=arn:aws:iam::123456789:role/CodeInterpreterRole

Custom configuration:

# Specify output file
uv run python -m auto_evals --target /path/to/agent --output my_evals.json

# Enable debug logging
uv run python -m auto_evals --target /path/to/agent --debug

# Explicitly set model (overrides auto-selection)
uv run python -m auto_evals --target /path/to/agent --model anthropic:claude-opus-4-5-20251101

Step 4: Review Results

The tool generates eval_suite.json containing:

  • Findings: Research insights organized by category and importance
  • Eval Cases: 20-50 evaluation test cases with graders
  • Metadata: Runtime, model used, configuration

View statistics:

# Count eval cases
cat eval_suite.json | jq '.eval_suite.eval_cases | length'

# View findings by importance
cat eval_suite.json | jq '.findings[] | select(.importance=="critical") | .title'

# List eval case categories
cat eval_suite.json | jq '.eval_suite.eval_cases[] | .category' | sort | uniq -c

Example output structure:

{
  "target_path": "https://github.com/owner/repo",
  "generated_at": "2026-02-07T10:30:00Z",
  "findings": [
    {
      "category": "architecture",
      "title": "Agent uses LangGraph with ReAct pattern",
      "importance": "high",
      "eval_relevance": "Should test multi-step reasoning and tool usage"
    }
  ],
  "eval_suite": {
    "name": "Eval Suite for owner/repo",
    "eval_cases": [
      {
        "task_id": "capability_001",
        "name": "Multi-step file search",
        "category": "capability",
        "input_prompt": "Find all Python files that import 'langgraph'",
        "expected_behavior": "Agent should search recursively and return file list",
        "graders": [...]
      }
    ]
  },
  "metadata": {
    "runtime_seconds": 45.2,
    "model": "bedrock:us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  }
}

Deep Agents: More Than Just LLM Wrappers

The core innovation in AutoEvals is the Deep Agents framework. Unlike simple LLM wrappers that make a single call, each agent in AutoEvals is a full autonomous system capable of multi-step reasoning and tool usage.

What Makes an Agent "Deep"?

A deep agent has:

  1. Persistent State: Using LangGraph's AgentState with message accumulation
  2. Tool Binding: The LLM can call tools and receive results
  3. Iterative Execution: A workflow loop that continues until the task is complete
  4. Planning Capability: The ability to break down complex tasks

Here's the core workflow:

agent

The Agent State

Each agent maintains state using LangGraph's TypedDict:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]

The add_messages reducer ensures that tool results and agent responses accumulate properly across iterations.

Creating Deep Agents

The create_deep_agent() function is the heart of the system:

def create_deep_agent(
    model: str,
    tools: list,
    system_prompt: str,
    subagents: dict = None,
    skills: list = None,
    name: str = "agent"
) -> CompiledGraph:
    # Parse model string (e.g., "anthropic:claude-sonnet-4-5-20250929")
    # Extract tools from skills
    # Combine skill contexts with system prompt
    # Create LangGraph workflow
    # Return compiled agent

The key insight is that subagents are themselves deep agents. When the orchestrator creates subagents, each one gets its own full LangGraph workflow with tool access and multi-turn capability.

Skills: The Cognitive Load Solution

One of the biggest challenges with tool-using agents is cognitive overload. Give an agent 20+ tools, and it struggles to pick the right one.

AutoEvals solves this with the Skills abstraction:

skills

What is a Skill?

A Skill is a Pydantic model that encapsulates:

class Skill(BaseModel):
    name: str                    # Skill identifier
    description: str             # High-level capability description
    tools: list[Callable]        # Underlying tool functions
    context: str = ""            # Detailed usage instructions

    def get_tools(self) -> list:
        """Returns tools for agent binding"""

    def get_context(self) -> str:
        """Returns formatted context for system prompt"""

Benefits of Skills

  1. Reduced Cognitive Load: Agents see 4 skills instead of 20 tools
  2. Progressive Disclosure: Detailed instructions only load when needed
  3. Token Efficiency: Context added only for used skills
  4. Composability: New skills added without cluttering agent context

The Four Core Skills

1. Code Analysis Skill

Provides tools for understanding codebases:

  • list_files: Directory traversal with filtering
  • read_file: Content access with size limits
  • analyze_code_structure: AST parsing for Python files
  • search_code: Regex pattern matching
  • find_patterns: Detect common patterns (error handling, async, state management)

2. GitHub Analysis Skill

Remote repository analysis:

  • get_repository_info: Fetch metadata via GitHub API
  • get_file_from_github: Raw content access
  • clone_repository: Local cloning with GitPython
  • list_repository_files: API-based file listing

3. Sandbox Execution Skill

Safe code execution for analysis:

  • run_python_code: Execute Python with AST validation
  • run_code_analysis: Automated structure/pattern analysis
  • generate_grader_code: Create evaluation graders

4. Dynamic Agent Spawner Skill

On-demand agent creation:

  • spawn_agent: Create specialized agents for unique analysis needs

Folder-Based Skills

Beyond programmatic skills, AutoEvals supports loading skills from markdown files:

skills/
  code-analysis/
    SKILL.md           # Contains YAML frontmatter + markdown context
  pattern-detection/
    SKILL.md
  eval-design/
    SKILL.md

Each SKILL.md contains:

  • YAML frontmatter: name, description, triggers
  • Markdown content: Detailed instructions and examples

This makes it easy to extend the system without writing Python code.

The Orchestrator: Coordinating the Team

The orchestrator is the conductor of this multi-agent symphony. It's a deep agent with four subagents, each specialized for a different aspect of analysis.

Orchestrator Workflow

sqflow

The Four Subagents

Architecture Analyst

Focuses on project structure:

  • Entry points and main modules
  • Dependency analysis
  • Configuration patterns
  • Module organization

Pattern Analyst

Identifies code patterns:

  • Error handling strategies
  • Async/await usage
  • State management approaches
  • Tool definitions and LLM calls

Behavior Analyst

Understands capabilities:

  • Input/output formats
  • Decision-making logic
  • External integrations
  • Edge case handling

Eval Designer

Creates the evaluation suite:

  • Capability tests (what the agent should do)
  • Regression tests (things that shouldn't break)
  • Edge case tests (boundary conditions)

Task Delegation

The orchestrator uses a special task() tool to delegate work:

@tool
def task(name: str, task_description: str) -> str:
    """Delegate a task to a subagent."""
    subagent = subagents[name]
    result = subagent.invoke({
        "messages": [HumanMessage(content=task_description)]
    })
    return result["messages"][-1].content

This creates true hierarchical agent coordination - the orchestrator plans, delegates, and synthesizes, while subagents execute specialized analysis.

Tools: The Foundation Layer

At the bottom of the stack are the actual tools that interact with code and files.

Code Tools

Built with LangChain's @tool decorator:

@tool
def analyze_code_structure(file_path: str) -> CodeStructure:
    """Analyze Python file structure using AST parsing."""
    # Returns: classes, functions, imports, decorators, complexity

Key outputs are Pydantic models for type safety:

  • FileInfo: Path, name, extension, size
  • CodeStructure: Classes, functions, imports, complexity
  • PatternMatch: Pattern name, location, code snippet, confidence

Sandbox Execution

The sandbox provides safe code execution:

Local Sandbox (CodeExecutor):

  • AST-based validation blocks dangerous operations
  • Allowlist of safe modules (json, re, pathlib, etc.)
  • Restricted execution context

AWS Bedrock Sandbox:

  • Managed sandboxing via AWS API
  • Used when --aws-sandbox flag is set
  • Better isolation for production use

Data Flow: End to End

Let's trace a complete execution:

flowchart TB
    subgraph Input
        cli["CLI: uv run python -m auto_evals --target /path"]
    end

    subgraph Setup
        parse["Parse Arguments"]
        model["Select Model<br/>(Anthropic or Bedrock)"]
        skills["Build Skills"]
        agents["Create Subagents"]
        orch["Create Orchestrator"]
    end

    subgraph Execution
        invoke["orchestrator.invoke()"]
        delegate["Delegate to Subagents"]
        tools["Execute Tools"]
        synthesize["Synthesize Findings"]
        design["Design Eval Cases"]
    end

    subgraph Output
        parse_out["Parse JSON Response"]
        validate["Pydantic Validation"]
        save["Save eval_suite.json"]
    end

    cli --> parse --> model --> skills --> agents --> orch
    orch --> invoke --> delegate --> tools --> synthesize --> design
    design --> parse_out --> validate --> save
Loading

The Evaluation Suite Output

The final output is a comprehensive evaluation suite:

{
  "name": "Agent Evaluation Suite",
  "target_agent": "/path/to/agent",
  "eval_cases": [
    {
      "task_id": "cap-001",
      "name": "File Creation Capability",
      "category": "capability",
      "input_prompt": "Create a new file called test.txt with 'Hello World'",
      "expected_behavior": "Agent creates file with correct content",
      "graders": [
        {
          "grader_type": "CODE_BASED",
          "implementation": "assert os.path.exists('test.txt')"
        }
      ],
      "metric": "BINARY",
      "num_trials": 5
    }
  ]
}

Each eval case includes:

  • Category: capability, regression, or edge_case
  • Graders: CODE_BASED (deterministic), MODEL_BASED (LLM judgment), or HUMAN
  • Metrics: pass@k, pass^k, partial_credit, binary
  • Trial count: For statistical validity

Key Architectural Innovations

1. Skills as First-Class Citizens

Rather than drowning agents in tools, skills provide semantic groupings with contextual knowledge. This dramatically improves tool selection accuracy.

2. Deep Agents at Every Level

Subagents aren't simple functions - they're full agents capable of multi-step reasoning. This enables complex analysis that couldn't be done in a single LLM call.

3. Dynamic Agent Spawning

The orchestrator can create specialized agents on-demand, adapting to unique codebase characteristics without predefined subagent explosion.

4. Provider Abstraction

The provider:model format allows seamless switching between Anthropic API and AWS Bedrock, supporting different deployment scenarios.

5. Hierarchical Coordination

True task delegation where the orchestrator plans and synthesizes while specialists execute - mirroring how human teams work.

6. Smart Context Management for Large Repositories

One of the biggest challenges in agent-based code analysis is context overflow - when analyzing repositories with thousands of files (especially those containing node_modules or other large dependency directories), the raw file list alone can exceed the model's context window.

AutoEvals solves this with a multi-layered approach:

Automatic Directory Exclusion

The list_files tool automatically excludes common large directories that don't contain relevant source code:

EXCLUDED_DIRECTORIES = {
    "node_modules",    # JavaScript dependencies
    ".git",            # Git internals
    "__pycache__",     # Python bytecode
    ".venv", "venv",   # Virtual environments
    "dist", "build",   # Build outputs
    ".next", ".nuxt",  # Framework caches
    "coverage",        # Test coverage reports
    "target",          # Rust/Java builds
    # ... and more
}

This prevents agents from getting overwhelmed by directories that can contain tens of thousands of files.

File Limits with Smart Defaults

The list_files tool enforces a default limit of 500 files per query. When this limit is reached, the agent receives a warning suggesting to use extension filtering or additional exclusions:

@tool
def list_files(
    directory: str,
    extensions: Optional[list[str]] = None,  # Filter by file type
    max_files: Optional[int] = None,          # Default: 500
    exclude_dirs: Optional[list[str]] = None, # Additional exclusions
) -> list[FileInfo]:

Prompt-Guided Progressive Exploration

The subagent prompts are designed to guide agents toward efficient exploration strategies:

  1. Progressive Discovery: Start with top-level structure, then drill down

    • First: List files with extensions=[".py"] to focus on source code
    • Then: Read README.md and key entry points
    • Finally: Explore specific modules of interest
  2. Code Execution for Heavy Analysis: For large repositories, agents write Python scripts that:

    • Analyze the codebase structure
    • Filter and summarize results
    • Return only relevant information
  3. Targeted Reading: Read specific files rather than listing all:

    • Start with: README.md, main.py, pyproject.toml
    • Use search_code() to find specific patterns
    • Use find_patterns() for pattern detection

Example exploration code that agents can execute in the sandbox:

from pathlib import Path
from collections import defaultdict

def summarize_project(path):
    """Efficient project summarization for large repos."""
    py_files = []
    for f in Path(path).rglob("*.py"):
        # Skip excluded directories
        if any(excl in f.parts for excl in
               ["node_modules", "__pycache__", ".venv"]):
            continue
        py_files.append(f)

    return {
        "total_py_files": len(py_files),
        "top_level_dirs": [d.name for d in Path(path).iterdir()
                          if d.is_dir() and not d.name.startswith(".")],
        "key_files": [str(f.relative_to(path)) for f in py_files[:20]]
    }

This approach provides a 98%+ reduction in context usage compared to naive file listing, enabling analysis of repositories with 10,000+ files without hitting context limits.

Running AutoEvals

Getting started is straightforward:

# Install dependencies
uv sync

# Analyze a local agent
uv run python -m auto_evals --target /path/to/your/agent

# Analyze a GitHub repository
uv run python -m auto_evals --target https://github.com/user/agent-repo

# Use AWS Bedrock sandbox
uv run python -m auto_evals --target /path/to/agent --aws-sandbox

Conclusion

AutoEvals demonstrates how multi-agent systems can tackle complex analysis tasks that would overwhelm a single agent. By combining:

  • Skills-based architecture for cognitive efficiency
  • Deep agents for multi-step reasoning
  • Hierarchical orchestration for coordination
  • Specialized subagents for focused analysis

The system can automatically generate comprehensive evaluation suites for AI agents of any complexity.

The key insight is that the same capabilities that make AI agents powerful - tool use, planning, and multi-step execution - can be turned inward to analyze and evaluate other agents. It's AI helping us understand AI.


AutoEvals is built with LangGraph, Pydantic, and supports both Anthropic API and AWS Bedrock for model inference.

About

A framework to generate automated evals for your agentic application.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages