As AI agents become more sophisticated, evaluating their capabilities becomes increasingly challenging. How do you systematically test an agent that can write code, search the web, manage files, and make complex decisions? This is the problem AutoEvals solves - it's an AI-driven deep research system that automatically analyzes agent codebases and generates comprehensive evaluation suites.
In this post, we'll dive deep into how AutoEvals works, exploring its innovative skills-based architecture, deep agents framework, and the orchestration system that brings everything together.
Traditional software testing approaches fall short when applied to AI agents. Agents exhibit emergent behaviors, make multi-step decisions, and interact with external tools in ways that are difficult to predict. Manual evaluation is time-consuming and doesn't scale.
AutoEvals addresses this by using AI to analyze AI - deploying a team of specialized agents that:
- Analyze the target agent's architecture and code structure
- Identify behavioral patterns and capabilities
- Design 20-50 rigorous evaluation test cases
- Generate appropriate graders (code-based, model-based, or human)
Before diving into the components, let's visualize how everything fits together:
- Python 3.11+
- uv package manager
- Either Anthropic API key or AWS credentials (see options below)
git clone <repository-url>
cd AutoEvals
uv syncOption A: Anthropic API (Default)
export ANTHROPIC_API_KEY=your-api-key
# Run analysis
uv run python -m auto_evals --target /path/to/agentOption B: AWS Bedrock (Auto-configured)
# Configure AWS credentials (if not already configured)
export AWS_REGION=us-east-1
aws configure
# Run with --aws-sandbox flag (automatically uses Bedrock model)
uv run python -m auto_evals --target /path/to/agent --aws-sandbox
# AWS Bedrock model is auto-selected: bedrock:us.anthropic.claude-sonnet-4-5-20250929-v1:0Analyze a local codebase:
uv run python -m auto_evals --target /path/to/your/agentAnalyze a GitHub repository:
uv run python -m auto_evals --target https://github.com/owner/repo
# For private repos, set:
export GITHUB_TOKEN=your-github-tokenUse AWS Bedrock Code Interpreter:
# Automatically uses Bedrock model + AWS Code Interpreter sandbox
uv run python -m auto_evals --target https://github.com/owner/repo --aws-sandbox
# Optional: Specify Code Interpreter role
export CODE_INTERPRETER_ROLE_ARN=arn:aws:iam::123456789:role/CodeInterpreterRoleCustom configuration:
# Specify output file
uv run python -m auto_evals --target /path/to/agent --output my_evals.json
# Enable debug logging
uv run python -m auto_evals --target /path/to/agent --debug
# Explicitly set model (overrides auto-selection)
uv run python -m auto_evals --target /path/to/agent --model anthropic:claude-opus-4-5-20251101The tool generates eval_suite.json containing:
- Findings: Research insights organized by category and importance
- Eval Cases: 20-50 evaluation test cases with graders
- Metadata: Runtime, model used, configuration
View statistics:
# Count eval cases
cat eval_suite.json | jq '.eval_suite.eval_cases | length'
# View findings by importance
cat eval_suite.json | jq '.findings[] | select(.importance=="critical") | .title'
# List eval case categories
cat eval_suite.json | jq '.eval_suite.eval_cases[] | .category' | sort | uniq -cExample output structure:
{
"target_path": "https://github.com/owner/repo",
"generated_at": "2026-02-07T10:30:00Z",
"findings": [
{
"category": "architecture",
"title": "Agent uses LangGraph with ReAct pattern",
"importance": "high",
"eval_relevance": "Should test multi-step reasoning and tool usage"
}
],
"eval_suite": {
"name": "Eval Suite for owner/repo",
"eval_cases": [
{
"task_id": "capability_001",
"name": "Multi-step file search",
"category": "capability",
"input_prompt": "Find all Python files that import 'langgraph'",
"expected_behavior": "Agent should search recursively and return file list",
"graders": [...]
}
]
},
"metadata": {
"runtime_seconds": 45.2,
"model": "bedrock:us.anthropic.claude-sonnet-4-5-20250929-v1:0"
}
}The core innovation in AutoEvals is the Deep Agents framework. Unlike simple LLM wrappers that make a single call, each agent in AutoEvals is a full autonomous system capable of multi-step reasoning and tool usage.
A deep agent has:
- Persistent State: Using LangGraph's
AgentStatewith message accumulation - Tool Binding: The LLM can call tools and receive results
- Iterative Execution: A workflow loop that continues until the task is complete
- Planning Capability: The ability to break down complex tasks
Here's the core workflow:
Each agent maintains state using LangGraph's TypedDict:
class AgentState(TypedDict):
messages: Annotated[Sequence[BaseMessage], add_messages]The add_messages reducer ensures that tool results and agent responses accumulate properly across iterations.
The create_deep_agent() function is the heart of the system:
def create_deep_agent(
model: str,
tools: list,
system_prompt: str,
subagents: dict = None,
skills: list = None,
name: str = "agent"
) -> CompiledGraph:
# Parse model string (e.g., "anthropic:claude-sonnet-4-5-20250929")
# Extract tools from skills
# Combine skill contexts with system prompt
# Create LangGraph workflow
# Return compiled agentThe key insight is that subagents are themselves deep agents. When the orchestrator creates subagents, each one gets its own full LangGraph workflow with tool access and multi-turn capability.
One of the biggest challenges with tool-using agents is cognitive overload. Give an agent 20+ tools, and it struggles to pick the right one.
AutoEvals solves this with the Skills abstraction:
A Skill is a Pydantic model that encapsulates:
class Skill(BaseModel):
name: str # Skill identifier
description: str # High-level capability description
tools: list[Callable] # Underlying tool functions
context: str = "" # Detailed usage instructions
def get_tools(self) -> list:
"""Returns tools for agent binding"""
def get_context(self) -> str:
"""Returns formatted context for system prompt"""- Reduced Cognitive Load: Agents see 4 skills instead of 20 tools
- Progressive Disclosure: Detailed instructions only load when needed
- Token Efficiency: Context added only for used skills
- Composability: New skills added without cluttering agent context
Provides tools for understanding codebases:
list_files: Directory traversal with filteringread_file: Content access with size limitsanalyze_code_structure: AST parsing for Python filessearch_code: Regex pattern matchingfind_patterns: Detect common patterns (error handling, async, state management)
Remote repository analysis:
get_repository_info: Fetch metadata via GitHub APIget_file_from_github: Raw content accessclone_repository: Local cloning with GitPythonlist_repository_files: API-based file listing
Safe code execution for analysis:
run_python_code: Execute Python with AST validationrun_code_analysis: Automated structure/pattern analysisgenerate_grader_code: Create evaluation graders
On-demand agent creation:
spawn_agent: Create specialized agents for unique analysis needs
Beyond programmatic skills, AutoEvals supports loading skills from markdown files:
skills/
code-analysis/
SKILL.md # Contains YAML frontmatter + markdown context
pattern-detection/
SKILL.md
eval-design/
SKILL.md
Each SKILL.md contains:
- YAML frontmatter: name, description, triggers
- Markdown content: Detailed instructions and examples
This makes it easy to extend the system without writing Python code.
The orchestrator is the conductor of this multi-agent symphony. It's a deep agent with four subagents, each specialized for a different aspect of analysis.
Focuses on project structure:
- Entry points and main modules
- Dependency analysis
- Configuration patterns
- Module organization
Identifies code patterns:
- Error handling strategies
- Async/await usage
- State management approaches
- Tool definitions and LLM calls
Understands capabilities:
- Input/output formats
- Decision-making logic
- External integrations
- Edge case handling
Creates the evaluation suite:
- Capability tests (what the agent should do)
- Regression tests (things that shouldn't break)
- Edge case tests (boundary conditions)
The orchestrator uses a special task() tool to delegate work:
@tool
def task(name: str, task_description: str) -> str:
"""Delegate a task to a subagent."""
subagent = subagents[name]
result = subagent.invoke({
"messages": [HumanMessage(content=task_description)]
})
return result["messages"][-1].contentThis creates true hierarchical agent coordination - the orchestrator plans, delegates, and synthesizes, while subagents execute specialized analysis.
At the bottom of the stack are the actual tools that interact with code and files.
Built with LangChain's @tool decorator:
@tool
def analyze_code_structure(file_path: str) -> CodeStructure:
"""Analyze Python file structure using AST parsing."""
# Returns: classes, functions, imports, decorators, complexityKey outputs are Pydantic models for type safety:
FileInfo: Path, name, extension, sizeCodeStructure: Classes, functions, imports, complexityPatternMatch: Pattern name, location, code snippet, confidence
The sandbox provides safe code execution:
Local Sandbox (CodeExecutor):
- AST-based validation blocks dangerous operations
- Allowlist of safe modules (json, re, pathlib, etc.)
- Restricted execution context
AWS Bedrock Sandbox:
- Managed sandboxing via AWS API
- Used when
--aws-sandboxflag is set - Better isolation for production use
Let's trace a complete execution:
flowchart TB
subgraph Input
cli["CLI: uv run python -m auto_evals --target /path"]
end
subgraph Setup
parse["Parse Arguments"]
model["Select Model<br/>(Anthropic or Bedrock)"]
skills["Build Skills"]
agents["Create Subagents"]
orch["Create Orchestrator"]
end
subgraph Execution
invoke["orchestrator.invoke()"]
delegate["Delegate to Subagents"]
tools["Execute Tools"]
synthesize["Synthesize Findings"]
design["Design Eval Cases"]
end
subgraph Output
parse_out["Parse JSON Response"]
validate["Pydantic Validation"]
save["Save eval_suite.json"]
end
cli --> parse --> model --> skills --> agents --> orch
orch --> invoke --> delegate --> tools --> synthesize --> design
design --> parse_out --> validate --> save
The final output is a comprehensive evaluation suite:
{
"name": "Agent Evaluation Suite",
"target_agent": "/path/to/agent",
"eval_cases": [
{
"task_id": "cap-001",
"name": "File Creation Capability",
"category": "capability",
"input_prompt": "Create a new file called test.txt with 'Hello World'",
"expected_behavior": "Agent creates file with correct content",
"graders": [
{
"grader_type": "CODE_BASED",
"implementation": "assert os.path.exists('test.txt')"
}
],
"metric": "BINARY",
"num_trials": 5
}
]
}Each eval case includes:
- Category: capability, regression, or edge_case
- Graders: CODE_BASED (deterministic), MODEL_BASED (LLM judgment), or HUMAN
- Metrics: pass@k, pass^k, partial_credit, binary
- Trial count: For statistical validity
Rather than drowning agents in tools, skills provide semantic groupings with contextual knowledge. This dramatically improves tool selection accuracy.
Subagents aren't simple functions - they're full agents capable of multi-step reasoning. This enables complex analysis that couldn't be done in a single LLM call.
The orchestrator can create specialized agents on-demand, adapting to unique codebase characteristics without predefined subagent explosion.
The provider:model format allows seamless switching between Anthropic API and AWS Bedrock, supporting different deployment scenarios.
True task delegation where the orchestrator plans and synthesizes while specialists execute - mirroring how human teams work.
One of the biggest challenges in agent-based code analysis is context overflow - when analyzing repositories with thousands of files (especially those containing node_modules or other large dependency directories), the raw file list alone can exceed the model's context window.
AutoEvals solves this with a multi-layered approach:
The list_files tool automatically excludes common large directories that don't contain relevant source code:
EXCLUDED_DIRECTORIES = {
"node_modules", # JavaScript dependencies
".git", # Git internals
"__pycache__", # Python bytecode
".venv", "venv", # Virtual environments
"dist", "build", # Build outputs
".next", ".nuxt", # Framework caches
"coverage", # Test coverage reports
"target", # Rust/Java builds
# ... and more
}This prevents agents from getting overwhelmed by directories that can contain tens of thousands of files.
The list_files tool enforces a default limit of 500 files per query. When this limit is reached, the agent receives a warning suggesting to use extension filtering or additional exclusions:
@tool
def list_files(
directory: str,
extensions: Optional[list[str]] = None, # Filter by file type
max_files: Optional[int] = None, # Default: 500
exclude_dirs: Optional[list[str]] = None, # Additional exclusions
) -> list[FileInfo]:The subagent prompts are designed to guide agents toward efficient exploration strategies:
-
Progressive Discovery: Start with top-level structure, then drill down
- First: List files with
extensions=[".py"]to focus on source code - Then: Read README.md and key entry points
- Finally: Explore specific modules of interest
- First: List files with
-
Code Execution for Heavy Analysis: For large repositories, agents write Python scripts that:
- Analyze the codebase structure
- Filter and summarize results
- Return only relevant information
-
Targeted Reading: Read specific files rather than listing all:
- Start with: README.md, main.py, pyproject.toml
- Use
search_code()to find specific patterns - Use
find_patterns()for pattern detection
Example exploration code that agents can execute in the sandbox:
from pathlib import Path
from collections import defaultdict
def summarize_project(path):
"""Efficient project summarization for large repos."""
py_files = []
for f in Path(path).rglob("*.py"):
# Skip excluded directories
if any(excl in f.parts for excl in
["node_modules", "__pycache__", ".venv"]):
continue
py_files.append(f)
return {
"total_py_files": len(py_files),
"top_level_dirs": [d.name for d in Path(path).iterdir()
if d.is_dir() and not d.name.startswith(".")],
"key_files": [str(f.relative_to(path)) for f in py_files[:20]]
}This approach provides a 98%+ reduction in context usage compared to naive file listing, enabling analysis of repositories with 10,000+ files without hitting context limits.
Getting started is straightforward:
# Install dependencies
uv sync
# Analyze a local agent
uv run python -m auto_evals --target /path/to/your/agent
# Analyze a GitHub repository
uv run python -m auto_evals --target https://github.com/user/agent-repo
# Use AWS Bedrock sandbox
uv run python -m auto_evals --target /path/to/agent --aws-sandboxAutoEvals demonstrates how multi-agent systems can tackle complex analysis tasks that would overwhelm a single agent. By combining:
- Skills-based architecture for cognitive efficiency
- Deep agents for multi-step reasoning
- Hierarchical orchestration for coordination
- Specialized subagents for focused analysis
The system can automatically generate comprehensive evaluation suites for AI agents of any complexity.
The key insight is that the same capabilities that make AI agents powerful - tool use, planning, and multi-step execution - can be turned inward to analyze and evaluate other agents. It's AI helping us understand AI.
AutoEvals is built with LangGraph, Pydantic, and supports both Anthropic API and AWS Bedrock for model inference.




