Harden code-review offline scorer and judge candidate gating by gggdttt · Pull Request #681 · microsoft/BC-Bench

gggdttt · 2026-06-24T08:51:28Z

AB#640100

macro_precision now averages only over tasks where the agent commented, so silence is no longer scored as perfect precision (recall/F1 still penalize silence)
unknown severities log a loud warning instead of silently coercing to MEDIUM
match_comments supports file-only candidate gating (line_tolerance=None); the real pipeline now lets the LLM judge be the semantic gate with line distance as a tiebreak
leaderboard bootstraps the macro-F1 CI over tasks (per_task_f1) so a single run yields a meaningful interval

- macro_precision now averages only over tasks where the agent commented, so silence is no longer scored as perfect precision (recall/F1 still penalize silence) - unknown severities log a loud warning instead of silently coercing to MEDIUM - match_comments supports file-only candidate gating (line_tolerance=None); the real pipeline now lets the LLM judge be the semantic gate with line distance as a tiebreak - leaderboard bootstraps the macro-F1 CI over tasks (per_task_f1) so a single run yields a meaningful interval

- new codereview_judge_calibration module: 16 hand-labeled AL/BC (expected, candidate) pairs with human match/non-match verdicts; non-match cases share file+line so the judge must discriminate on the issue, not location - score_calibration() reports judge precision/recall/accuracy; run_calibration() runs the live judge over the set - new 'bcbench evaluate judge-calibration' CLI command for CI; fails if accuracy drops below a threshold - expose judge_verdicts() (per-pair verdicts) and harden the judge subprocess to decode output as utf-8 (was crashing on non-cp1252 bytes on Windows) - regression tests: pure scoring (always run) + opt-in live judge accuracy check (BCBENCH_RUN_JUDGE_CALIBRATION). Measured judge today: P=1.0 R=1.0 acc=1.0 over the set

…issing

…hema

haoranpb · 2026-06-24T14:27:01Z

I will covert this to draft explicitly given the title @gggdttt

Add 8 adversarial pairs (4 match / 4 non-match) where expected and candidate share file+line and are semantically adjacent, to stress the judge's discrimination. Live judge (gpt-5.3-codex) still scores precision=recall=accuracy=1.000 (TP=13 FP=0 TN=11 FN=0).

haoranpb

Great idea with the calibration!

haoranpb · 2026-06-25T06:21:30Z

+        alias = _SEVERITY_ALIASES.get(normalized)
+        if alias is not None:
+            return alias
+        logger.warning("Unknown severity %r; defaulting to %s. Use one of %s or a known alias.", value, cls.MEDIUM.value, [s.value for s in cls])


Instead of having a default, raise and error out might be the better option, then you have better observability on the potential underlying problem. Or you might realize that this doesn't happen at all in practice

haoranpb · 2026-06-25T06:22:20Z

                context.entry.expected_comments,
                generated_comments,
-                context.entry.match_line_tolerance,
+                line_tolerance=None,


Do we then want to remove the line_tolerance from the dataset?

haoranpb · 2026-06-25T06:28:01Z

+    model: str = JUDGE_MODEL,
+) -> list[bool]:


Consider moving JUDGE_RESULT_FILE as an input parameter, and move it to the config.py

haoranpb · 2026-06-25T06:30:14Z

+from pathlib import Path
+
+from bcbench.dataset.codereview import ReviewComment, Severity
+from bcbench.evaluate.codereview_judge import JUDGE_MODEL, judge_verdicts


JUDGE_MODEL can also be moved into somewhere in config.py

haoranpb · 2026-06-25T06:30:49Z

+@dataclass(frozen=True)
+class JudgeCalibrationCase:
+    expected: ReviewComment
+    candidate: ReviewComment
+    should_match: bool
+    note: str
+
+
+@dataclass(frozen=True)
+class JudgeCalibrationReport:
+    total: int
+    true_positives: int
+    false_positives: int
+    true_negatives: int
+    false_negatives: int
+    precision: float
+    recall: float
+    accuracy: float
+    misclassified_notes: list[str]


move to types.py

haoranpb · 2026-06-25T06:31:33Z

+    JudgeCalibrationCase(
+        expected=_c("src/Sales/SalesPost.Codeunit.al", 142, "Calling Commit() inside the repeat loop can leave partial data if a later iteration fails."),
+        candidate=_c("src/Sales/SalesPost.Codeunit.al", 145, "Move the Commit() out of the loop; committing per iteration breaks atomicity."),
+        should_match=True,
+        note="commit-in-loop, paraphrased + different line",
+    ),
+    JudgeCalibrationCase(
+        expected=_c("src/Inventory/ItemAvail.Codeunit.al", 58, "Add SetLoadFields before FindSet so the whole record isn't loaded."),
+        candidate=_c("src/Inventory/ItemAvail.Codeunit.al", 58, "Use SetLoadFields to limit the columns read for this query."),
+        should_match=True,
+        note="setloadfields performance, same issue",
+    ),
+    JudgeCalibrationCase(
+        expected=_c("src/Finance/Payment.Codeunit.al", 77, "Currency code 'USD' is hardcoded; read it from setup instead."),
+        candidate=_c("src/Finance/Payment.Codeunit.al", 77, "Don't hardcode the currency — pull it from the configuration record."),
+        should_match=True,
+        note="hardcoded currency, same issue",


move to a jsonl file or something, then python code can be much more simplified to only load them

haoranpb · 2026-06-25T06:32:22Z

+    precision = tp / (tp + fp) if (tp + fp) else 1.0
+    recall = tp / (tp + fn) if (tp + fn) else 1.0
+    accuracy = (tp + tn) / len(cases) if cases else 0.0


there is an existing function under metrics.py to calculate precision and recall

haoranpb · 2026-06-25T06:33:12Z

+    When ``line_tolerance`` is ``None`` the line distance never blocks a pair: any same-file finding
+    is an eligible candidate and the distance acts only as an assignment tiebreak. This is the mode
+    used ahead of the LLM judge, which is the authoritative semantic gate. A numeric ``line_tolerance``
+    keeps the older hard structural gate (used for judge-free scoring).


I see it is always passed as None now, consider simply drop it. Code below can be simplified because of it

haoranpb · 2026-06-25T06:36:29Z

+        # Bootstrap the equal-weight headline over tasks (pooled per-task F1 across runs) so the CI
+        # is meaningful even with a single run. Fall back to the per-run macro F1 if tasks were not


We probably will always display results from multiple runs (e.g. 5), especially it's relatively cheap to do this (~15 mins per run). Not sure if "CI for single run" is a scenario worth considering to us

haoranpb · 2026-06-25T06:37:01Z

+) -> None:
+    """Run the LLM judge over the hand-labeled calibration set and report its precision/recall.
+
+    Use this in CI to catch judge drift before it silently distorts code-review scores.


Can also be used locally for ad-hoc checks

I also don't think this is called during CI atm

wenjiefan added 2 commits June 24, 2026 10:32

gggdttt changed the title ~~Harden code-review offline scorer and judge candidate gating~~ [Draft]Harden code-review offline scorer and judge candidate gating Jun 24, 2026

wenjiefan added 5 commits June 24, 2026 14:11

Record invalid result instead of aborting batch when review.json is m…

3949a2b

…issing

Revert to failing fast when review.json is missing

67e5b96

Strengthen code-review prompt to mandate review.json with explicit sc…

f434110

…hema

Add critical to code-review severity options in prompt

b49d3c8

Surface copilot stdout/stderr in judge subprocess failures

5d88ea9

haoranpb marked this pull request as draft June 24, 2026 14:27

gggdttt marked this pull request as ready for review June 24, 2026 19:15

gggdttt changed the title ~~[Draft]Harden code-review offline scorer and judge candidate gating~~ Harden code-review offline scorer and judge candidate gating Jun 24, 2026

gggdttt marked this pull request as draft June 24, 2026 19:15

gggdttt marked this pull request as ready for review June 24, 2026 19:16

haoranpb mentioned this pull request Jun 25, 2026

Publish code-review vanilla baseline (claude-haiku-4-5, repeat=5) #684

Open

haoranpb reviewed Jun 25, 2026

View reviewed changes

		# Bootstrap the equal-weight headline over tasks (pooled per-task F1 across runs) so the CI
		# is meaningful even with a single run. Fall back to the per-run macro F1 if tasks were not

Uh oh!

Conversation

gggdttt commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haoranpb commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haoranpb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gggdttt commented Jun 24, 2026 •

edited

Loading

haoranpb commented Jun 24, 2026 •

edited

Loading