Skip to content

perf: optimize checkpoint flow with 9 targeted improvements#1292

Open
svarlamov wants to merge 4 commits into
mainfrom
ckpt-perf-may-5-opts-follow-up
Open

perf: optimize checkpoint flow with 9 targeted improvements#1292
svarlamov wants to merge 4 commits into
mainfrom
ckpt-perf-may-5-opts-follow-up

Conversation

@svarlamov
Copy link
Copy Markdown
Member

@svarlamov svarlamov commented May 6, 2026

Summary

  • Add parent-dir keyed repo discovery cache to avoid redundant directory walks for sibling files
  • Hoist create_dir_all out of per-file async loop and skip existing blob writes (content-addressed dedup)
  • Wrap dirty_files in Arc to avoid deep-cloning all file contents on every checkpoint
  • Eliminate redundant second Myers diff by deriving line stats from attribution diff ops (with CRLF normalization)
  • Avoid entries.clone() by moving entries into Checkpoint and iterating checkpoint.entries for metrics
  • JSONL append-only instead of full read/prune/rewrite on every checkpoint (lazy compaction at 200KB)
  • Short-circuit hash migration when no 7-char hashes exist (common case for modern data)
  • Reverse iteration in build_previous_file_state_maps to skip cloning attributions for overwritten entries

Test plan

  • Full test suite passes (4,549 tests, 0 failures)
  • All CRLF-specific tests pass (7 tests validating line stats aren't inflated)
  • task lint and task fmt clean
  • Checkpoint perf tests validate p95 < 200ms SLA

🤖 Generated with Claude Code


Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

svarlamov and others added 3 commits May 11, 2026 22:45
- Add parent-dir keyed repo discovery cache to avoid redundant directory
  walks for sibling files in build_checkpoint_files
- Hoist create_dir_all out of per-file async loop in save_current_file_states
- Skip blob write when content-addressed file already exists
- Wrap dirty_files in Arc to avoid deep-cloning all file contents
- Eliminate redundant second Myers diff by deriving line stats from
  attribution diff ops (with CRLF normalization)
- Avoid entries.clone() by moving entries into Checkpoint and iterating
  checkpoint.entries for metrics
- JSONL append-only instead of full read/prune/rewrite on every checkpoint
  (lazy compaction at 200KB threshold)
- Short-circuit hash migration when no 7-char hashes exist (common case)
- Reverse iteration in build_previous_file_state_maps to avoid cloning
  attributions for entries that would be overwritten

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The positional comparison in Replace blocks would produce inflated stats
when CRLF changes co-occurred with insertions/deletions that shifted
line positions. Replace with a sub-diff on normalized slices within the
Replace block so the diff algorithm correctly identifies which lines
actually changed regardless of line-ending differences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Proves that the sub-diff approach in Replace blocks correctly handles
CRLF→LF conversion combined with deletions/insertions that shift line
positions — the exact scenario flagged by Devin's review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@svarlamov svarlamov force-pushed the ckpt-perf-may-5-opts-follow-up branch from 71370f7 to f715011 Compare May 11, 2026 22:47
devin-ai-integration[bot]

This comment was marked as resolved.

When read_all_checkpoints() fails during compaction (e.g. corrupted
JSON line from interrupted write), unwrap_or_default() would return
an empty Vec and write_all_checkpoints would erase all data. Now we
skip compaction entirely on read failure, preserving existing data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant