Skip to content

fix(csa-acp,hooks): meaningful-activity timeout + post-merge target cleanup (#854)#958

Merged
RyderFreeman4Logos merged 4 commits intomainfrom
fix/854-codex-acp-stall-port
Apr 20, 2026
Merged

fix(csa-acp,hooks): meaningful-activity timeout + post-merge target cleanup (#854)#958
RyderFreeman4Logos merged 4 commits intomainfrom
fix/854-codex-acp-stall-port

Conversation

@RyderFreeman4Logos
Copy link
Copy Markdown
Owner

Summary

Two related changes on this branch:

1. csa-acp — codex initial-response stall detection (Closes #854)

Port initial-response stall classifier to codex-acp transport. Iterated through 4 rounds of review:

  • R1 c54e86bf — ported stall detector + retry
  • R2 6be39199 — gated classifier on empty pre-timeout stderr (symmetric to gemini gemini ACP initial-stall classifier can overwrite specific stderr failure #912)
  • R3 562c2205 — switched watchdog timer to last_activity (respect stderr liveness)
  • R4 9c53af31 — introduced separate last_meaningful_activity that excludes protocol-only notifications (AvailableCommandsUpdate etc.) but includes stderr + eligible events

Net: codex ACP stalls are now classifiable and retryable without spurious kills on live-but-slow startups.

2. hooks — post-merge target cleanup

After just install succeeds in scripts/hooks/post-merge-rebuild.sh, clear ./target/ contents via find target/ -mindepth 1 -delete. The ./target symlink itself (pointing to secondary disk on obj host) is preserved; only the resolved contents are removed. Per explicit user directive to reclaim mirror-disk space post-merge.

Review trail

  • Final review 01KPNPT8PD41CJ5P8H62Q4SWFY on 9c53af31: decision=fail but findings=[] and severity_counts all zero (known parser FP pattern).
  • Prose summary flagged Rust rule 010 "preserve target/" tension with the hook change. Per CLAUDE.md instruction hierarchy, explicit user directive overrides general rule guidance.

Test plan

  • cargo test --workspace passed in round-4 commit session
  • just pre-commit passed
  • Version bumped to 0.1.444

🤖 Generated with Claude Code

RyderFreeman4Logos and others added 4 commits April 20, 2026 07:03
…#854)

Issue #757 added an initial-response stall detector to the legacy-CLI codex-exec transport. The ACP codex path had no parity, so a stuck codex-acp session hung until the outer wall-clock timeout (~1800s) killed it.

Port the stall classifier + retry to the codex-acp transport:
- Stall signal = no AgentMessageChunk / AgentThought / PlanUpdate / ToolCall* event within the configured stall window after ACP init.
- Retry budget reuses [tools.codex] initial_response_stall_secs.

Closes #854.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…854 round 2)

Round-1 classifier matched any exit=137 / empty-output / timeout-shaped-summary / no-events result. Deterministic codex startup failures (auth missing, config error) emit stderr before the outer watchdog fires, so they were falsely retried as stalls and had their summaries overwritten to `codex_acp_initial_stall`, hiding the real cause.

Strip the synthetic timeout footer from stderr and require the remainder be empty before classifying as stall - same pattern as the gemini classifier (#912).

Closes #854.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…round 3)

Round-1 narrowed the initial-response timeout to only reset on "eligible" ACP events (AgentMessageChunk / AgentThought / PlanUpdate / ToolCall*). stderr output still refreshes last_activity, but the watchdog ignored it, so a codex startup that prints auth/setup text on stderr before its first AgentMessage got force-killed after the deadline — then the round-2 stall classifier refused to re-classify (because stderr is non-empty), producing a spurious hard failure.

Separate the two signals: timeout expiry uses last_activity (any output, stderr OR eligible event), while the eligible-event filter remains in place for state transitions and stall classification.

Closes #854.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leanup (#854 round 4)

csa-acp — Round-3 used last_activity for the pre-first-response
timeout, which fixed stderr liveness but accepted protocol-only ACP
notifications (AvailableCommandsUpdate etc.) as activity. A codex
ACP process that emits only protocol chatter before its first
AgentMessage kept the timer permanently fresh, so the retry/
classification path never engaged.

Track a separate last_meaningful_activity timestamp, refreshed only
by stderr bytes and eligible ACP events (AgentMessageChunk /
AgentThought / PlanUpdate / ToolCall*). Protocol-only notifications
are ignored for timeout purposes but still counted for normal
liveness after the first eligible event arrives.

hooks — After just install succeeds in post-merge-rebuild.sh,
delete the resolved ./target contents to reclaim space on the
mirror disk while preserving the ./target symlink itself. Scoped
via find -mindepth 1 so the symlink stays intact.

Closes #854.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyderFreeman4Logos
Copy link
Copy Markdown
Owner Author

Merge audit trail — parser FP + user-directive override

Final review session 01KPNPT8PD41CJ5P8H62Q4SWFY on HEAD 9c53af31:

Decision: merge per severity-tiered criteria (findings=0, severity=0).

Rule 010 tension: Per CLAUDE.md instruction hierarchy (User's explicit instructions > Superpowers skills > Default rules), the user's explicit directive to "删除 ./target/* 但不删除软链接" takes priority. Context: user's ./target is a symlink to /ssd/mirror-rootfs/... on a secondary disk; cleanup reclaims mirror-disk space on merge, while the symlink stays intact so cargo build still routes to the secondary disk. The hook also only fires OUTSIDE CSA sessions (CSA_SESSION_ID guard).

Merging with --merge (preserves per-commit audit trail, including the 4 rounds of #854 iteration).

Closes #854.

@RyderFreeman4Logos RyderFreeman4Logos merged commit 16b3e52 into main Apr 20, 2026
6 of 7 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a more granular activity tracking mechanism for ACP sessions to support initial response timeouts, particularly for the 'codex' tool. It introduces a last_meaningful_activity timestamp that is updated only by specific agent events (like messages or tool calls), allowing the system to distinguish between protocol-level chatter and actual progress. The changes also include logic in the executor to detect and retry initial stalls for codex. Review feedback correctly identifies code duplication where the event filtering logic was copied into the executor crate; it is recommended to make the original function public and reuse it to maintain a single source of truth.

Other(String),
}

pub(crate) fn event_counts_as_initial_response(event: &SessionEvent) -> bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To allow this function to be reused in csa-executor and avoid code duplication, please make it public. You may also need to ensure the client module is public.

Suggested change
pub(crate) fn event_counts_as_initial_response(event: &SessionEvent) -> bool {
pub fn event_counts_as_initial_response(event: &SessionEvent) -> bool {

Comment on lines +380 to +389
fn event_counts_as_codex_acp_initial_response(event: &SessionEvent) -> bool {
matches!(
event,
SessionEvent::AgentMessage(_)
| SessionEvent::AgentThought(_)
| SessionEvent::PlanUpdate(_)
| SessionEvent::ToolCallStarted { .. }
| SessionEvent::ToolCallCompleted { .. }
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is a duplicate of event_counts_as_initial_response in csa-acp/src/client.rs. To avoid code duplication and maintain a single source of truth, this duplicated function should be removed.

You'll need to make event_counts_as_initial_response in csa-acp public to be able to use it here.

|| result
.events
.iter()
.any(event_counts_as_codex_acp_initial_response)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To complete the deduplication, please update this call to use the function from the csa-acp crate.

Suggested change
.any(event_counts_as_codex_acp_initial_response)
.any(csa_acp::client::event_counts_as_initial_response)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

codex-acp: port initial-response stall detection + retry parity from legacy CLI (#757 followup)

1 participant