Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -176,11 +176,11 @@ cache
*.csv
materials

embodichain/toolkits/outputs/*
embodichain/toolkits/outputs/*

embodichain/database/tmp/*
embodichain/database/train/*
embodichain/database/agent_generated_content/

3rdparty/
Log/
Expand All @@ -200,4 +200,8 @@ wandb/
embodichain/VERSION

# benchmark results
scripts/benchmark/rl/reports/*
scripts/benchmark/rl/reports/*

# generated scene assets
*_gym_project/
backup_*/
3 changes: 1 addition & 2 deletions configs/gym/pour_water/gym_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -278,8 +278,7 @@
"use_videos": true
}
}
},
"control_parts": ["left_arm", "left_eef", "right_arm", "right_eef"]
}
},
"robot": {
"uid": "CobotMagic",
Expand Down
5 changes: 2 additions & 3 deletions configs/gym/pour_water/gym_config_simple.json
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@
"mode": "modify",
"name": "robot/qpos",
"params": {
"joint_ids": [6, 13]
"joint_ids": [12, 13, 14, 15]
}
}
},
Expand All @@ -227,8 +227,7 @@
"use_videos": true
}
}
},
"control_parts": ["left_arm", "left_eef", "right_arm", "right_eef"]
}
},
"robot": {
"uid": "CobotMagic",
Expand Down
150 changes: 150 additions & 0 deletions docs/source/features/failure_monitor_recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Failure-Monitor-Recovery Graph

This document describes the graph-based monitor and recovery pipeline in
EmbodiChain. The runtime assumes failures come from the outside world or the
environment. The agent does not execute error-injection functions. Error
descriptions are prompt context only, used by the recovery agent to choose
useful monitors and recovery policies.

The core flow is:

```text
TaskAgent
-> nominal atomic-action graph
RecoveryAgent
-> lightweight recovery bindings
CompileAgent
-> explicit compiled recovery graph
AgentTaskGraph
-> execute nominal edges, switch to recovery edges when monitors trigger
```

## Agent Roles

`TaskAgent` produces one deterministic nominal graph. It writes
`agent_task_graph.json` with semantic nodes, executable edges, a start node, and
a goal node.

`RecoveryAgent` writes `agent_recovery_spec.json`. This is the LLM-facing file.
It contains only `recovery_bindings`: compact mappings from a nominal edge to a
failure monitor template and an ordered recovery policy. It does not contain
recovery node ids, recovery edge ids, recovery branches, or Python code.

`CompileAgent` expands the task graph and recovery spec into
`agent_compiled_graph.json`. The compiled artifact contains:

- `task_graph`: the nominal graph from `TaskAgent`
- `recovery_spec`: the compact authoring spec from `RecoveryAgent`
- `recovery_graph`: explicit recovery nodes, recovery edges, and recovery
branches generated by the compiler
- `metadata`: schema and recovery flags

Before expansion, `CompileAgent` runs a deterministic normalization pass over
the recovery spec. It canonicalizes common aliases, fills simple fields from
the nominal edge when unambiguous, and only uses one LLM canonicalization call
when the compact spec still contains unresolved semantic intent.

`AgentTaskGraph` executes the compiled graph through `drive(...)`.

## Recovery Spec

The LLM-facing recovery spec is intentionally small:

```json
{
"task": "dual_pour_water",
"recovery_bindings": [
{
"edge_id": "e01_grasp_objects",
"failure_name": "object_moved_before_grasp",
"monitors": [
{"type": "object_moved", "objects": ["cup", "bottle"], "threshold": 0.02}
],
"recovery": [
{
"type": "regrasp_both",
"arms": {"left_arm": "cup", "right_arm": "bottle"},
"pre_grasp_dis": 0.1,
"force_valid": true
}
],
"merge": "target",
"repeat_until_success": true
}
]
}
```

Supported monitor templates are:

- `object_moved`: for stationary objects that unexpectedly shift
- `hold_lost`: for objects that should already be held by an arm

Supported recovery steps are:

- `regrasp`
- `regrasp_both`
- `retry_failed_edge`
- `replay_edge`
- `action`, for a direct atomic-action spec when no template is enough

The compiler converts these bindings into explicit graph topology. It generates
stable recovery nodes and edges, attaches monitor sequences, and adds reusable
self-recovery branches when `repeat_until_success` is true.

## Runtime Semantics

`AgentTaskGraph.run(...)` follows the single nominal start-to-goal chain. Each
edge is executed by one `drive(...)` call.

During any monitored edge, nominal or recovery, `drive(...)` checks monitor
sequences after each environment step. Monitor functions use unified semantics:
`True` means the failure condition has triggered.

If no monitor triggers, the graph moves to the edge target. If a monitor
triggers, the graph switches to the matching recovery branch and executes its
recovery edges.

Recovery is repeatable. The compiler can make a recovery edge monitor itself and
point back to itself. That creates a reusable recovery edge: the same recovery
behavior repeats until its monitor no longer triggers or the graph reaches
`max_transitions`.

Recovery-edge monitors are chosen by the compiler from the recovery step type.
For example, a `hold_lost` binding that recovers by `regrasp` will not monitor
the regrasp edge with `hold_lost`, because the object may not be held until the
grasp completes. The compiler uses object-displacement monitors for regrasp
steps, then uses hold-loss monitors again for movement/retry steps where the
object should already be held.

For hold-loss failures inside a later recovery step, the compiler can insert a
small repair edge, such as regrasping the cup, then retry the failed recovery
edge and resume the remaining recovery path.

## Validation Rules

`graph_spec.py` validates the compiled graph before execution:

- the nominal graph is one deterministic start-to-goal chain
- every edge id is unique
- every node reference is defined
- every nominal and recovery edge has at least one arm action
- recovery branches reference existing nominal and recovery edges
- branch recovery edges are path-contiguous
- a branch merges only to the failed edge source or target
- executable error functions and legacy recovery action lists are rejected

`expand_recovery_spec(...)` separately validates the LLM-facing spec. It rejects
compiled graph keys such as `recovery_nodes`, `recovery_edges`, and
`recovery_branches`, because those are compiler output, not LLM output.

## Current Artifacts

For a generated task such as `DualPourWater`, the graph workflow produces:

- `agent_task_graph.json`
- `agent_recovery_spec.json`
- `agent_compiled_graph.json`

The old Python-code generation artifact is no longer part of the active
pipeline.
Loading
Loading