fix(csa-acp,hooks): meaningful-activity timeout + post-merge target cleanup (#854)#958
Conversation
…#854) Issue #757 added an initial-response stall detector to the legacy-CLI codex-exec transport. The ACP codex path had no parity, so a stuck codex-acp session hung until the outer wall-clock timeout (~1800s) killed it. Port the stall classifier + retry to the codex-acp transport: - Stall signal = no AgentMessageChunk / AgentThought / PlanUpdate / ToolCall* event within the configured stall window after ACP init. - Retry budget reuses [tools.codex] initial_response_stall_secs. Closes #854. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…854 round 2) Round-1 classifier matched any exit=137 / empty-output / timeout-shaped-summary / no-events result. Deterministic codex startup failures (auth missing, config error) emit stderr before the outer watchdog fires, so they were falsely retried as stalls and had their summaries overwritten to `codex_acp_initial_stall`, hiding the real cause. Strip the synthetic timeout footer from stderr and require the remainder be empty before classifying as stall - same pattern as the gemini classifier (#912). Closes #854. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…round 3) Round-1 narrowed the initial-response timeout to only reset on "eligible" ACP events (AgentMessageChunk / AgentThought / PlanUpdate / ToolCall*). stderr output still refreshes last_activity, but the watchdog ignored it, so a codex startup that prints auth/setup text on stderr before its first AgentMessage got force-killed after the deadline — then the round-2 stall classifier refused to re-classify (because stderr is non-empty), producing a spurious hard failure. Separate the two signals: timeout expiry uses last_activity (any output, stderr OR eligible event), while the eligible-event filter remains in place for state transitions and stall classification. Closes #854. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leanup (#854 round 4) csa-acp — Round-3 used last_activity for the pre-first-response timeout, which fixed stderr liveness but accepted protocol-only ACP notifications (AvailableCommandsUpdate etc.) as activity. A codex ACP process that emits only protocol chatter before its first AgentMessage kept the timer permanently fresh, so the retry/ classification path never engaged. Track a separate last_meaningful_activity timestamp, refreshed only by stderr bytes and eligible ACP events (AgentMessageChunk / AgentThought / PlanUpdate / ToolCall*). Protocol-only notifications are ignored for timeout purposes but still counted for normal liveness after the first eligible event arrives. hooks — After just install succeeds in post-merge-rebuild.sh, delete the resolved ./target contents to reclaim space on the mirror disk while preserving the ./target symlink itself. Scoped via find -mindepth 1 so the symlink stays intact. Closes #854. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge audit trail — parser FP + user-directive overrideFinal review session
Decision: merge per severity-tiered criteria (findings=0, severity=0). Rule 010 tension: Per CLAUDE.md instruction hierarchy (User's explicit instructions > Superpowers skills > Default rules), the user's explicit directive to "删除 ./target/* 但不删除软链接" takes priority. Context: user's Merging with Closes #854. |
There was a problem hiding this comment.
Code Review
This pull request implements a more granular activity tracking mechanism for ACP sessions to support initial response timeouts, particularly for the 'codex' tool. It introduces a last_meaningful_activity timestamp that is updated only by specific agent events (like messages or tool calls), allowing the system to distinguish between protocol-level chatter and actual progress. The changes also include logic in the executor to detect and retry initial stalls for codex. Review feedback correctly identifies code duplication where the event filtering logic was copied into the executor crate; it is recommended to make the original function public and reuse it to maintain a single source of truth.
| Other(String), | ||
| } | ||
|
|
||
| pub(crate) fn event_counts_as_initial_response(event: &SessionEvent) -> bool { |
There was a problem hiding this comment.
To allow this function to be reused in csa-executor and avoid code duplication, please make it public. You may also need to ensure the client module is public.
| pub(crate) fn event_counts_as_initial_response(event: &SessionEvent) -> bool { | |
| pub fn event_counts_as_initial_response(event: &SessionEvent) -> bool { |
| fn event_counts_as_codex_acp_initial_response(event: &SessionEvent) -> bool { | ||
| matches!( | ||
| event, | ||
| SessionEvent::AgentMessage(_) | ||
| | SessionEvent::AgentThought(_) | ||
| | SessionEvent::PlanUpdate(_) | ||
| | SessionEvent::ToolCallStarted { .. } | ||
| | SessionEvent::ToolCallCompleted { .. } | ||
| ) | ||
| } |
There was a problem hiding this comment.
| || result | ||
| .events | ||
| .iter() | ||
| .any(event_counts_as_codex_acp_initial_response) |
Summary
Two related changes on this branch:
1. csa-acp — codex initial-response stall detection (Closes #854)
Port initial-response stall classifier to codex-acp transport. Iterated through 4 rounds of review:
c54e86bf— ported stall detector + retry6be39199— gated classifier on empty pre-timeout stderr (symmetric to gemini gemini ACP initial-stall classifier can overwrite specific stderr failure #912)562c2205— switched watchdog timer tolast_activity(respect stderr liveness)9c53af31— introduced separatelast_meaningful_activitythat excludes protocol-only notifications (AvailableCommandsUpdate etc.) but includes stderr + eligible eventsNet: codex ACP stalls are now classifiable and retryable without spurious kills on live-but-slow startups.
2. hooks — post-merge target cleanup
After
just installsucceeds inscripts/hooks/post-merge-rebuild.sh, clear./target/contents viafind target/ -mindepth 1 -delete. The./targetsymlink itself (pointing to secondary disk on obj host) is preserved; only the resolved contents are removed. Per explicit user directive to reclaim mirror-disk space post-merge.Review trail
01KPNPT8PD41CJ5P8H62Q4SWFYon9c53af31:decision=failbutfindings=[]andseverity_countsall zero (known parser FP pattern).Test plan
cargo test --workspacepassed in round-4 commit sessionjust pre-commitpassed🤖 Generated with Claude Code