Skip to content

fix: force C locale for internal git#1294

Open
nalajcie wants to merge 1 commit into
git-ai-project:mainfrom
nalajcie:fix/internal-git-c-locale
Open

fix: force C locale for internal git#1294
nalajcie wants to merge 1 commit into
git-ai-project:mainfrom
nalajcie:fix/internal-git-c-locale

Conversation

@nalajcie
Copy link
Copy Markdown

@nalajcie nalajcie commented May 7, 2026

Summary

Force LC_ALL=C on internal git subprocesses so parser-facing git output is stable across user locales, and normalize C-quoted --numstat path tokens before matching them.

Problem

Some internal flows parse git stderr to distinguish expected conditions from real failures. One example is authorship notes fetch handling, where a missing remote refs/notes/ai ref should be treated as "no remote notes yet" instead of a command failure.

That parsing currently assumes English git output. I observed local test failures when git emitted localized Polish stderr for the missing notes ref, e.g. fatal: nie znaleziono zdalnej referencji refs/notes/ai, which did not match the existing English missing-ref checks.

Forcing LC_ALL=C fixes the localized stderr problem, but it also makes Git C-quote non-ASCII filenames in git diff --numstat output. The unified diff parsers already normalize quoted Git path tokens via unescape_git_path; the numstat parsers did not, so non-ASCII filenames could fail status pathspec post-filtering or ignore matching in status/range stats.

Fix

  • Set LC_ALL=C centrally for internal git command execution in src/git/repository.rs.
  • Keep GIT_EXTERNAL_DIFF and GIT_DIFF_OPTS cleanup in the same helper so internal git command environment normalization remains in one place.
  • Unescape --numstat filename tokens before status pathspec post-filtering.
  • Unescape --numstat filename tokens before status and range-authorship ignore matching.
  • Add focused regression tests for C-quoted non-ASCII --numstat paths.

I kept the current line-oriented --numstat parsing instead of switching to -z, because the existing parser structure only needed the same Git path-token normalization already used by unified diff parsing. A NUL-separated parser would be a larger behavioral rewrite for the same issue.

Testing

The new focused regression tests failed before the unescape fix:

  • commands::status::tests::numstat_post_filter_matches_c_quoted_non_ascii_paths
  • commands::status::tests::numstat_ignore_matcher_matches_c_quoted_non_ascii_paths
  • authorship::range_authorship::tests::numstat_ignore_matcher_matches_c_quoted_non_ascii_paths

Validated after the fix with:

  • task test TEST_FILTER=numstat_ NO_CAPTURE=true
  • task test TEST_FILTER=status_ignore NO_CAPTURE=true
  • task test TEST_FILTER=range_authorship NO_CAPTURE=true
  • task test TEST_FILTER=utf8_filenames NO_CAPTURE=true (101/102 passed; one daemon completion timeout passed on exact retry)
  • task test TEST_FILTER=test_rtl_with_ltr_mixed_filename_in_worktree NO_CAPTURE=true
  • task fmt
  • task lint

Earlier locale-specific validation:

  • task test TEST_FILTER=internal_git_command_forces_c_locale NO_CAPTURE=true
  • task test TEST_FILTER=fetch_notes::test_fetch_notes_no_remote_notes NO_CAPTURE=true
  • task test TEST_FILTER=fetch_notes NO_CAPTURE=true
  • task test TEST_FILTER=ci_local_skip_fetch NO_CAPTURE=true
  • task test TEST_FILTER=internal_machine_commands::test_fetch_and_push_authorship_notes_internal_commands_json NO_CAPTURE=true

Open in Devin Review

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 7, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@svarlamov svarlamov assigned heapwolf and svarlamov and unassigned heapwolf May 7, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

@nalajcie
Copy link
Copy Markdown
Author

nalajcie commented May 8, 2026

the windows CI seems a little bit flaky, those changes passed in my fork:

Note: this is a side issue I've found during implementing submodule support (nalajcie#2), but I'll test it locally a little bit before submitting official PR. On this PR - the windows (daemon) tests failed (the same issue: lost connection with the worker), but passed upon rerun.

@nalajcie nalajcie force-pushed the fix/internal-git-c-locale branch 3 times, most recently from 3fb6997 to 28dd32a Compare May 8, 2026 11:45
Set LC_ALL=C on internal git subprocesses so parser-facing output
stays stable across user locales.

C locale makes git --numstat C-quote non-ASCII pathnames, so
unescape numstat filename tokens before pathspec filtering and
ignore matching in status and range stats.

This keeps missing remote notes detection and other git stderr parsing
on existing English code paths without matching translated messages.

Co-Authored-By: GitHub Copilot <copilot@github.com>
@nalajcie nalajcie force-pushed the fix/internal-git-c-locale branch from 28dd32a to bab2f02 Compare May 8, 2026 13:33
@svarlamov
Copy link
Copy Markdown
Member

Thank you @nalajcie -- I do think that Devin finding was a real issue so good to see that patched with regression tests. Windows CI is indeed a bit flaky, but if you rebase on latest main it should be mostly addressed

@svarlamov svarlamov assigned heapwolf and unassigned svarlamov May 12, 2026
@svarlamov
Copy link
Copy Markdown
Member

Thanks for the contribution again @nalajcie

Added @heapwolf to help take a review pass and should go out probably in the release after next

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment on lines +2611 to +2620
} else {
// Unquoted token: consume until whitespace
while let Some(&ch) = chars.peek() {
if ch.is_whitespace() {
break;
}
token.push(ch);
chars.next();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 parse_two_path_tokens splits on spaces within unquoted filenames, breaking diff --git header parsing for files with spaces

The new parse_two_path_tokens function splits unquoted tokens on any whitespace character. Git does NOT C-quote paths that only contain spaces (space 0x20 is in the printable ASCII range and doesn't trigger quote_c_style), so a file named hello world.txt produces the header diff --git a/hello world.txt b/hello world.txt. The old code line.split(" b/").last() correctly extracted hello world.txt, but parse_two_path_tokens("a/hello world.txt b/hello world.txt") returns tokens ("a/hello", "world.txt"), yielding the wrong path world.txt. This causes current_diff_file in run_diff_tree_with_hunks (src/authorship/rebase_authorship.rs:2837) to be set incorrectly, so hunks are tracked against a nonexistent file path, leading to silently lost attribution data during rebase for any file whose name contains spaces.

Prompt for agents
The parse_two_path_tokens function in src/authorship/rebase_authorship.rs:2579-2633 incorrectly splits unquoted path tokens on whitespace. Git only C-quotes paths containing non-printable characters, backslashes, double-quotes, or non-ASCII bytes — NOT spaces. So filenames with spaces (e.g. 'hello world.txt') appear unquoted in diff --git headers.

The diff --git header format is inherently ambiguous for unquoted paths with spaces. Git itself resolves this by relying on the --- and +++ lines, not the diff --git line. One approach to fix this:

1. For quoted paths (starting with "), the current logic is correct.
2. For unquoted paths, you cannot simply split on whitespace. Instead, leverage the fact that in diff --git headers, the a-side and b-side paths are identical (for non-rename cases). You can try to find the split point by looking for the pattern where the second half starts with 'b/' and the remainder equals the first half with 'a/' prefix replaced. For example, scan for ' b/' separators and check if a/X matches b/X.
3. Alternatively, fall back to the old approach of splitting on ' b/' for the unquoted case only, while keeping the quoted-path parsing for C-quoted headers.

The function is called from parse_diff_git_new_path at line 2567, which is used in run_diff_tree_with_hunks at line 2837 to set current_diff_file for hunk tracking during rebase.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants