fix(session): atomic reconciliation path, cleanup on failure (#654)#655
Conversation
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45c7c0d42c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if read_session_daemon_pid(session_dir).is_some_and(is_process_alive) { | ||
| return true; |
There was a problem hiding this comment.
Validate daemon.pid ownership before skipping reconciliation
This branch now treats any live PID in daemon.pid as proof the session is still running, but daemon.pid can outlive the original daemon and later refer to a reused PID from an unrelated process. In that case reconciliation is skipped forever, leaving a dead Active session unreconciled (no synthetic result and no retirement). The previous ToolLiveness::has_live_process path guarded against this by checking session context; this helper should preserve that validation.
Useful? React with 👍 / 👎.
| let Ok(content) = fs::read_to_string(&path) else { | ||
| return false; | ||
| }; | ||
| extract_lock_pid(&content).is_some_and(is_process_alive) |
There was a problem hiding this comment.
Filter lock-file PIDs by session context and staleness
The lock-file path now considers a session alive whenever any PID parsed from locks/*.lock currently exists, without checking that the PID still belongs to this session or that the lock file is recent. Stale lock files combined with PID reuse can therefore suppress reconciliation indefinitely for dead sessions. csa_process::tool_liveness::find_session_pid previously included context/recency checks, and this replacement should keep equivalent safeguards.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request updates the workspace version to 0.1.260 and introduces a robust reconciliation locking mechanism in the cli-sub-agent crate. Key changes include the implementation of advisory file locking to prevent concurrent session reconciliation, atomic state persistence using temporary files, and improved liveness detection for tool processes. Feedback was provided regarding the brittle manual string parsing used to extract PIDs from lock files, suggesting a more robust JSON deserialization approach instead.
| fn extract_lock_pid(lock_content: &str) -> Option<u32> { | ||
| let pid_key_pos = lock_content.find("\"pid\"")?; | ||
| let tail = &lock_content[pid_key_pos..]; | ||
| let colon_pos = tail.find(':')?; | ||
| let number = tail[colon_pos + 1..] | ||
| .chars() | ||
| .skip_while(|ch| ch.is_ascii_whitespace()) | ||
| .take_while(|ch| ch.is_ascii_digit()) | ||
| .collect::<String>(); | ||
| number.parse::<u32>().ok() | ||
| } |
There was a problem hiding this comment.
The current implementation for extracting the PID from lock file content is brittle as it relies on string searching. This can fail if the JSON is formatted differently or if another field's value contains the string \"pid\".
A more robust approach is to parse the content as JSON using serde_json, which is already a dependency. This can be done cleanly by defining a local struct for deserialization.
fn extract_lock_pid(lock_content: &str) -> Option<u32> {
#[derive(serde::Deserialize)]
struct LockFileContent {
pid: u32,
}
serde_json::from_str::<LockFileContent>(lock_content).ok().map(|data| data.pid)
}There was a problem hiding this comment.
Code Review
This pull request increments the workspace version to 0.1.260 and introduces a robust session reconciliation mechanism in the cli-sub-agent crate. Key improvements include the implementation of advisory file locking on Unix to prevent concurrent reconciliation, atomic persistence of session states and results using temporary files, and enhanced liveness detection for tool processes. The review feedback highlights several areas for improvement: the current process liveness and file locking implementations are Linux-specific and will fail or allow race conditions on Windows, and the manual string parsing used to extract PIDs from lock files should be replaced with a more robust JSON deserialization approach.
| #[cfg(not(unix))] | ||
| { | ||
| std::path::Path::new(&format!("/proc/{pid}/stat")).exists() | ||
| } |
There was a problem hiding this comment.
The non-Unix implementation of is_process_alive relies on /proc/{pid}/stat, which is specific to Linux and not available on other non-Unix platforms like Windows. This will cause liveness checks to fail on Windows.
To ensure cross-platform compatibility, you should use platform-specific APIs. For Windows, you can use OpenProcess from windows-sys.
#[cfg(windows)]
{
use windows_sys::Win32::System::Threading::{OpenProcess, PROCESS_QUERY_LIMITED_INFORMATION};
use windows_sys::Win32::Foundation::{CloseHandle, GetLastError, ERROR_ACCESS_DENIED};
// SAFETY: Calling OpenProcess is safe. We check the handle and close it.
let handle = unsafe { OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, 0, pid) };
if !handle.is_null() {
// Process exists.
unsafe { CloseHandle(handle) };
return true;
}
// If OpenProcess fails, check if it's because we don't have access,
// which still implies the process exists.
if unsafe { GetLastError() } == ERROR_ACCESS_DENIED {
return true;
}
false
}
#[cfg(not(any(unix, windows)))]
{
// The /proc check is Linux-specific. For other non-unix, non-windows
// platforms, we don't have a generic way to check process liveness.
false
}| #[cfg(not(unix))] | ||
| { | ||
| let _ = trigger; | ||
| Ok(Some((session_dir, ReconcileLock { file }))) | ||
| } |
There was a problem hiding this comment.
The acquire_reconcile_lock function does not implement any locking mechanism on non-Unix platforms. This could lead to race conditions between concurrent reconciliation processes on Windows. To ensure atomicity, consider implementing file locking for Windows using LockFileEx from the windows-sys crate.
| fn extract_lock_pid(lock_content: &str) -> Option<u32> { | ||
| let pid_key_pos = lock_content.find("\"pid\"")?; | ||
| let tail = &lock_content[pid_key_pos..]; | ||
| let colon_pos = tail.find(':')?; | ||
| let number = tail[colon_pos + 1..] | ||
| .chars() | ||
| .skip_while(|ch| ch.is_ascii_whitespace()) | ||
| .take_while(|ch| ch.is_ascii_digit()) | ||
| .collect::<String>(); | ||
| number.parse::<u32>().ok() | ||
| } |
There was a problem hiding this comment.
This implementation for parsing the PID from the lock file content is brittle as it relies on string searching. A small change in the lock file format (e.g., extra whitespace) could break it. Since serde_json is already a dependency in this crate, consider using it for a more robust parsing approach.
fn extract_lock_pid(lock_content: &str) -> Option<u32> {
#[derive(serde::Deserialize)]
struct LockFile {
pid: u32,
}
serde_json::from_str::<LockFile>(lock_content).ok().map(|l| l.pid)
}Addresses PR #655 round-2 findings: - codex P1 x2: restore session context + staleness check for daemon.pid / lock-file PID - gemini-code-assist high: parse lock file via serde_json, not substring search Choice A (restore helpers) was taken. Sub-agent logic in session_cmds_reconcile.rs was replaced with ToolLiveness::has_live_process from csa-process. ToolLiveness was strengthened to: 1. Use serde_json for robust PID extraction. 2. Require BOTH context match AND staleness for lock files (&& instead of ||). 3. Exclude reconcile.lock from tool discovery. Tests added for PID reuse (via spawned sleep), stale locks, and complex JSON. Tests moved to separate session_cmds_reconcile_tests.rs to satisfy 800-line limit. Fixed existing tests in csa-process and cli-sub-agent broken by stronger context check.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request increments the workspace version to 0.1.260 and introduces a file-based locking mechanism using libc::flock to synchronize session reconciliation. It refactors session state and result persistence to use atomic file operations via temporary files and replaces manual PID parsing with serde_json for better reliability. Review feedback identifies a critical regression in liveness detection logic where a change from || to && could break functionality on non-Linux platforms or for busy processes. Additionally, the logic for skipping reconciliation locks contains a filename stem mismatch, and the transition to NamedTempFile may unintentionally restrict file permissions to 0600.
| } | ||
|
|
||
| process_matches_session_context(pid, tool_name, session_dir) || recent_file.unwrap_or(false) | ||
| process_matches_session_context(pid, tool_name, session_dir) && recent_file.unwrap_or(true) |
There was a problem hiding this comment.
Changing the logic from || to && here is a regression that breaks liveness detection in two ways:
- Platform Support: On platforms without
/proc(like macOS or Windows),process_matches_session_contextalways returnsfalse, meaningpid_matches_session_contextwill now always returnfalse, breaking liveness detection entirely. - Robustness: On Linux, if a tool is busy and hasn't updated its lock file for more than 60 seconds, it will be incorrectly identified as dead even if the process is still running and matches the context.
It should be reverted to || and use unwrap_or(false) to ensure that matching the process context is sufficient for liveness, while falling back to the 'recent' check when context matching is unavailable or inconclusive.
| process_matches_session_context(pid, tool_name, session_dir) && recent_file.unwrap_or(true) | |
| process_matches_session_context(pid, tool_name, session_dir) || recent_file.unwrap_or(false) |
| if stem.is_some_and(|s| s == "reconcile") { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
There are two issues with this check:
- Stem Mismatch: For a file named
.reconcile.lock,path.file_stem()returns.reconcile. The checks == "reconcile"will fail to match. - Directory Mismatch:
find_session_pidonly iterates over thelocks/subdirectory, butacquire_reconcile_lockcreates the lock file in the parentsession_dir.
To correctly ignore the reconciliation lock, this check should be moved or updated to look for .reconcile.lock in the appropriate directory.
| if stem.is_some_and(|s| s == "reconcile") { | |
| continue; | |
| } | |
| if stem.is_some_and(|s| s == ".reconcile") { | |
| continue; | |
| } |
| fn persist_session_state_atomically(session_dir: &Path, session: &MetaSessionState) -> Result<()> { | ||
| let state_path = session_dir.join("state.toml"); | ||
| let contents = toml::to_string_pretty(session).context("Failed to serialize session state")?; | ||
| let mut temp_file = tempfile::NamedTempFile::new_in(session_dir).with_context(|| { |
There was a problem hiding this comment.
Using tempfile::NamedTempFile for atomic persistence changes the file permissions to 0600 (private to the user). While this is more secure, it is a change from the previous behavior where state.toml and result.toml likely followed the default umask (e.g., 0644). If these files need to be readable by other users or processes in a shared environment, you should explicitly set the permissions before persisting.
… permissions (#655) Addresses PR #655 round-3 narrow findings: - fix(csa-process): revert && to || in pid_matches_session_context to restore non-Linux support and busy-tool fallback. - fix(csa-process): delete dead reconcile lock skip in find_session_pid (option C). reconcile.lock lives in session_dir, while loop iterates locks/. - fix(cli-sub-agent): set explicit 0644 permissions before persisting state.toml and result.toml to avoid 0600 regression from tempfile. - Update tests to age lock files when testing context-match failure.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the workspace version to 0.1.260 and implements a reconciliation locking mechanism to prevent race conditions. It introduces atomic session state persistence using temporary files and refactors lock file PID extraction in the csa-process crate to use serde_json. Feedback identifies several cross-platform compatibility issues, specifically noting that the new locking implementation and multiple test cases rely on Unix-specific utilities and syscalls, which will cause failures or bypass protections on Windows.
| #[test] | ||
| fn ensure_terminal_result_for_dead_active_session_reconciles_even_with_reused_pid_in_daemon_pid() { |
There was a problem hiding this comment.
This test uses the sleep command (line 334), which is typically not available on Windows. Additionally, subsequent tests in this file use backdate_tree, which is explicitly defined only for Unix (line 94). These tests will fail to compile or run on Windows. Consider marking them with #[cfg(unix)].
| #[test] | |
| fn ensure_terminal_result_for_dead_active_session_reconciles_even_with_reused_pid_in_daemon_pid() { | |
| #[cfg(unix)] | |
| #[test] | |
| fn ensure_terminal_result_for_dead_active_session_reconciles_even_with_reused_pid_in_daemon_pid() { |
| #[test] | ||
| fn ensure_terminal_result_for_dead_active_session_reconciles_even_with_stale_lock_file() { |
There was a problem hiding this comment.
This test calls backdate_tree, which is only defined for Unix platforms. This will cause a compilation error on Windows. Please restrict this test to Unix.
| #[test] | |
| fn ensure_terminal_result_for_dead_active_session_reconciles_even_with_stale_lock_file() { | |
| #[cfg(unix)] | |
| #[test] | |
| fn ensure_terminal_result_for_dead_active_session_reconciles_even_with_stale_lock_file() { |
| #[cfg(not(unix))] | ||
| { | ||
| let _ = trigger; | ||
| Ok(Some((session_dir, ReconcileLock { file }))) | ||
| } |
There was a problem hiding this comment.
The reconciliation lock is currently a no-op on non-Unix platforms. This leaves Windows users vulnerable to the race conditions between concurrent reconciliation processes that this PR aims to address. Since fd-lock is already a workspace dependency, consider using it for a cross-platform advisory lock instead of a platform-specific flock implementation.
| let mut child = std::process::Command::new("sh") | ||
| .arg("-c") | ||
| .arg("sleep 60 # codex") | ||
| .spawn() | ||
| .unwrap(); |
There was a problem hiding this comment.
Using sh -c in tests makes them dependent on a Unix-like environment, causing failures on Windows. Since these tests are intended to verify the new context-aware liveness logic, consider marking them with #[cfg(unix)] or using a more portable way to spawn a dummy process with a specific command line.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the workspace version to 0.1.260 and introduces a cross-platform advisory locking mechanism using fd-lock to prevent concurrent session reconciliation. It also improves the robustness of session state persistence by making it atomic via temporary files and refactors PID extraction to use serde_json. A high-severity issue was identified in the locking implementation where the use of Box::leak causes permanent memory and file descriptor leaks, potentially leading to resource exhaustion in long-running processes.
| // Cross-platform advisory locking via fd-lock. | ||
| // Use Box::leak to provide the 'static lifetime for the guard since ReconcileLock | ||
| // needs to own it and be returned to the caller. | ||
| let lock = Box::leak(Box::new(fd_lock::RwLock::new(file))); |
There was a problem hiding this comment.
The use of Box::leak here creates a permanent memory and file descriptor leak every time a reconciliation lock is successfully acquired. Since the fd_lock::RwLock (which owns the fs::File) is leaked to provide a 'static lifetime for the guard, the underlying file descriptor is never closed. In a long-running process or daemon, this will eventually lead to file descriptor exhaustion and increased memory usage.
Consider using a locking implementation that allows for an owned guard or doesn't require a separate lock object with a longer lifetime. For example, the fs4 crate (already present in the workspace) provides a lock_exclusive method directly on std::fs::File which is released when the file is dropped, avoiding the need for Box::leak and guards tied to a separate lock object.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the project version to 0.1.260 and introduces a file-based locking mechanism using flock on Unix to prevent concurrent session reconciliation. The reconciliation logic has been refactored to ensure atomic state updates via temporary files and improved error handling that rolls back synthetic results if state persistence fails. Additionally, PID extraction from lock files was modernized to use JSON deserialization. Review feedback highlights an inconsistency in how different rollback failures are reported to the caller and a logic discrepancy in the Windows stub for the reconciliation lock where the code skips reconciliation entirely despite comments suggesting it should proceed unprotected.
| result_path.display() | ||
| ) | ||
| })?; | ||
| return Ok(DeadActiveSessionReconciliation::NoChange); |
There was a problem hiding this comment.
There is an inconsistency in how failures are handled during the reconciliation rollback. If apply_phase_event fails (line 175), the function returns Ok(DeadActiveSessionReconciliation::NoChange) after cleaning up the synthetic result. However, if persist_session fails (line 192), it returns an Err after the same cleanup. Since both cases result in a successful rollback to the original Active state (durably), it would be more consistent to return Ok(NoChange) for both, or Err for both if any unexpected failure during the reconciliation process should be reported as an error to the caller.
| #[cfg(not(unix))] | ||
| { | ||
| // Windows reconcile is unprotected against concurrent races. Acceptable | ||
| // because Windows is not a CI target for this project. If Windows support | ||
| // becomes a goal, replace this stub with a proper LockFileEx-based impl. | ||
| Ok(None) | ||
| } |
There was a problem hiding this comment.
The implementation for non-Unix platforms returns Ok(None), which causes the caller to skip reconciliation entirely (returning DeadActiveSessionReconciliation::NoChange). This contradicts the comment stating that reconciliation is "unprotected against concurrent races" on Windows. If the intention is to allow reconciliation to proceed without locking on Windows, this should return Ok(Some(...)) with a dummy lock. If the intention is to disable it (perhaps because process liveness checks are not yet implemented for Windows), the comment should be updated to reflect that reconciliation is currently unsupported on Windows.
…oncile - Find 1: Make rollback error handling consistent by propagating error on phase transition failure (Option Y). - Find 2: Fix non-Unix lock stub to return Some(no-op guard) so reconciliation proceeds on Windows (Option A). - Update tests to reflect error propagation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a reconciliation lock mechanism to prevent concurrent reconciliation of dead active sessions and implements atomic session state persistence. It also refactors PID extraction from lock files to use serde_json and updates liveness tests to use actual spawned processes for more realistic validation. Feedback suggests ensuring the new lock file is ignored by liveness checks to prevent false activity signals and recommends using the fd-lock crate instead of raw libc::flock for more idiomatic and cross-platform advisory locking.
| _trigger: &str, | ||
| ) -> Result<Option<(PathBuf, ReconcileLock)>> { | ||
| let session_dir = get_session_dir(project_root, session_id)?; | ||
| let lock_path = session_dir.join(".reconcile.lock"); |
There was a problem hiding this comment.
The PR description mentions that .reconcile.lock should be ignored by liveness checks. However, ToolLiveness::has_recent_session_write_signal (in crates/csa-process/src/tool_liveness.rs) currently scans the session directory and will pick up modifications to this lock file as a liveness signal. This could cause sessions to be incorrectly perceived as active during reconciliation, potentially interfering with idle timeouts. Consider updating has_recent_session_write_signal to explicitly ignore this lock file, similar to how it handles .liveness.snapshot.
| let fd = file.as_raw_fd(); | ||
| // SAFETY: fd is a valid open file descriptor owned by `file`. | ||
| let rc = unsafe { libc::flock(fd, libc::LOCK_EX | libc::LOCK_NB) }; |
There was a problem hiding this comment.
…itch to fd-lock closure pattern
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the workspace version to 0.1.260 and introduces several improvements to session reconciliation and file persistence. Key changes include the implementation of a file-based lock (fd-lock) to prevent concurrent reconciliation of the same session, and the adoption of atomic write patterns for session state and result files using temporary files. Additionally, PID extraction from lock files was refactored to use serde_json for better reliability, and ToolLiveness checks were updated to ignore reconciler artifacts. I have no feedback to provide as there are no review comments to address.
Summary
Comprehensive hardening of the dead-Active reconciliation path in
session_cmds_reconcile.rs, driven by issue #654 (syntheticresult.tomlnot cleaned up whenapply_phase_event(Retired)fails).Instead of the narrow round-4 finding, this PR enumerates all partial-state windows and makes the path atomic with rollback:
result.tomlunlessstate.tomlhas been durably persisted asRetiredwith the matchingtermination_reason.apply_phase_eventfailure previously leftresult.toml+Active; fix: remove synthetic result on transition failure; skip cooldown marker on failure.result.toml+ stale state; fix: atomically persiststate.tomlbefore cooldown marker, remove synthetic result on persistence failure..reconcile.lock, ignored by liveness checks.Review (gemini-cli, tier-4-critical): PASS on first run (session
01KNX87SS2DZQDQED0D139FHPM).Closes #654.
Test plan
cargo test -p cli-sub-agentgreen (new failure-injection tests)just pre-commitgreencsa review --tier tier-4-criticalverdict PASS (first run)🤖 Generated with Claude Code