fix: force C locale for internal git#1294
Conversation
|
the windows CI seems a little bit flaky, those changes passed in my fork: Note: this is a side issue I've found during implementing submodule support (nalajcie#2), but I'll test it locally a little bit before submitting official PR. On this PR - the windows (daemon) tests failed (the same issue: lost connection with the worker), but passed upon rerun. |
3fb6997 to
28dd32a
Compare
Set LC_ALL=C on internal git subprocesses so parser-facing output stays stable across user locales. C locale makes git --numstat C-quote non-ASCII pathnames, so unescape numstat filename tokens before pathspec filtering and ignore matching in status and range stats. This keeps missing remote notes detection and other git stderr parsing on existing English code paths without matching translated messages. Co-Authored-By: GitHub Copilot <copilot@github.com>
28dd32a to
bab2f02
Compare
|
Thank you @nalajcie -- I do think that Devin finding was a real issue so good to see that patched with regression tests. Windows CI is indeed a bit flaky, but if you rebase on latest |
| } else { | ||
| // Unquoted token: consume until whitespace | ||
| while let Some(&ch) = chars.peek() { | ||
| if ch.is_whitespace() { | ||
| break; | ||
| } | ||
| token.push(ch); | ||
| chars.next(); | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 parse_two_path_tokens splits on spaces within unquoted filenames, breaking diff --git header parsing for files with spaces
The new parse_two_path_tokens function splits unquoted tokens on any whitespace character. Git does NOT C-quote paths that only contain spaces (space 0x20 is in the printable ASCII range and doesn't trigger quote_c_style), so a file named hello world.txt produces the header diff --git a/hello world.txt b/hello world.txt. The old code line.split(" b/").last() correctly extracted hello world.txt, but parse_two_path_tokens("a/hello world.txt b/hello world.txt") returns tokens ("a/hello", "world.txt"), yielding the wrong path world.txt. This causes current_diff_file in run_diff_tree_with_hunks (src/authorship/rebase_authorship.rs:2837) to be set incorrectly, so hunks are tracked against a nonexistent file path, leading to silently lost attribution data during rebase for any file whose name contains spaces.
Prompt for agents
The parse_two_path_tokens function in src/authorship/rebase_authorship.rs:2579-2633 incorrectly splits unquoted path tokens on whitespace. Git only C-quotes paths containing non-printable characters, backslashes, double-quotes, or non-ASCII bytes — NOT spaces. So filenames with spaces (e.g. 'hello world.txt') appear unquoted in diff --git headers.
The diff --git header format is inherently ambiguous for unquoted paths with spaces. Git itself resolves this by relying on the --- and +++ lines, not the diff --git line. One approach to fix this:
1. For quoted paths (starting with "), the current logic is correct.
2. For unquoted paths, you cannot simply split on whitespace. Instead, leverage the fact that in diff --git headers, the a-side and b-side paths are identical (for non-rename cases). You can try to find the split point by looking for the pattern where the second half starts with 'b/' and the remainder equals the first half with 'a/' prefix replaced. For example, scan for ' b/' separators and check if a/X matches b/X.
3. Alternatively, fall back to the old approach of splitting on ' b/' for the unquoted case only, while keeping the quoted-path parsing for C-quoted headers.
The function is called from parse_diff_git_new_path at line 2567, which is used in run_diff_tree_with_hunks at line 2837 to set current_diff_file for hunk tracking during rebase.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Force
LC_ALL=Con internal git subprocesses so parser-facing git output is stable across user locales, and normalize C-quoted--numstatpath tokens before matching them.Problem
Some internal flows parse git stderr to distinguish expected conditions from real failures. One example is authorship notes fetch handling, where a missing remote
refs/notes/airef should be treated as "no remote notes yet" instead of a command failure.That parsing currently assumes English git output. I observed local test failures when git emitted localized Polish stderr for the missing notes ref, e.g.
fatal: nie znaleziono zdalnej referencji refs/notes/ai, which did not match the existing English missing-ref checks.Forcing
LC_ALL=Cfixes the localized stderr problem, but it also makes Git C-quote non-ASCII filenames ingit diff --numstatoutput. The unified diff parsers already normalize quoted Git path tokens viaunescape_git_path; the numstat parsers did not, so non-ASCII filenames could fail status pathspec post-filtering or ignore matching in status/range stats.Fix
LC_ALL=Ccentrally for internal git command execution insrc/git/repository.rs.GIT_EXTERNAL_DIFFandGIT_DIFF_OPTScleanup in the same helper so internal git command environment normalization remains in one place.--numstatfilename tokens before status pathspec post-filtering.--numstatfilename tokens before status and range-authorship ignore matching.--numstatpaths.I kept the current line-oriented
--numstatparsing instead of switching to-z, because the existing parser structure only needed the same Git path-token normalization already used by unified diff parsing. A NUL-separated parser would be a larger behavioral rewrite for the same issue.Testing
The new focused regression tests failed before the unescape fix:
commands::status::tests::numstat_post_filter_matches_c_quoted_non_ascii_pathscommands::status::tests::numstat_ignore_matcher_matches_c_quoted_non_ascii_pathsauthorship::range_authorship::tests::numstat_ignore_matcher_matches_c_quoted_non_ascii_pathsValidated after the fix with:
task test TEST_FILTER=numstat_ NO_CAPTURE=truetask test TEST_FILTER=status_ignore NO_CAPTURE=truetask test TEST_FILTER=range_authorship NO_CAPTURE=truetask test TEST_FILTER=utf8_filenames NO_CAPTURE=true(101/102 passed; one daemon completion timeout passed on exact retry)task test TEST_FILTER=test_rtl_with_ltr_mixed_filename_in_worktree NO_CAPTURE=truetask fmttask lintEarlier locale-specific validation:
task test TEST_FILTER=internal_git_command_forces_c_locale NO_CAPTURE=truetask test TEST_FILTER=fetch_notes::test_fetch_notes_no_remote_notes NO_CAPTURE=truetask test TEST_FILTER=fetch_notes NO_CAPTURE=truetask test TEST_FILTER=ci_local_skip_fetch NO_CAPTURE=truetask test TEST_FILTER=internal_machine_commands::test_fetch_and_push_authorship_notes_internal_commands_json NO_CAPTURE=true