Harden code-review offline scorer and judge candidate gating#681
Harden code-review offline scorer and judge candidate gating#681gggdttt wants to merge 8 commits into
Conversation
- macro_precision now averages only over tasks where the agent commented, so silence is no longer scored as perfect precision (recall/F1 still penalize silence) - unknown severities log a loud warning instead of silently coercing to MEDIUM - match_comments supports file-only candidate gating (line_tolerance=None); the real pipeline now lets the LLM judge be the semantic gate with line distance as a tiebreak - leaderboard bootstraps the macro-F1 CI over tasks (per_task_f1) so a single run yields a meaningful interval
- new codereview_judge_calibration module: 16 hand-labeled AL/BC (expected, candidate) pairs with human match/non-match verdicts; non-match cases share file+line so the judge must discriminate on the issue, not location - score_calibration() reports judge precision/recall/accuracy; run_calibration() runs the live judge over the set - new 'bcbench evaluate judge-calibration' CLI command for CI; fails if accuracy drops below a threshold - expose judge_verdicts() (per-pair verdicts) and harden the judge subprocess to decode output as utf-8 (was crashing on non-cp1252 bytes on Windows) - regression tests: pure scoring (always run) + opt-in live judge accuracy check (BCBENCH_RUN_JUDGE_CALIBRATION). Measured judge today: P=1.0 R=1.0 acc=1.0 over the set
|
I will covert this to draft explicitly given the title @gggdttt |
Add 8 adversarial pairs (4 match / 4 non-match) where expected and candidate share file+line and are semantically adjacent, to stress the judge's discrimination. Live judge (gpt-5.3-codex) still scores precision=recall=accuracy=1.000 (TP=13 FP=0 TN=11 FN=0).
haoranpb
left a comment
There was a problem hiding this comment.
Great idea with the calibration!
| alias = _SEVERITY_ALIASES.get(normalized) | ||
| if alias is not None: | ||
| return alias | ||
| logger.warning("Unknown severity %r; defaulting to %s. Use one of %s or a known alias.", value, cls.MEDIUM.value, [s.value for s in cls]) |
There was a problem hiding this comment.
Instead of having a default, raise and error out might be the better option, then you have better observability on the potential underlying problem. Or you might realize that this doesn't happen at all in practice
| context.entry.expected_comments, | ||
| generated_comments, | ||
| context.entry.match_line_tolerance, | ||
| line_tolerance=None, |
There was a problem hiding this comment.
Do we then want to remove the line_tolerance from the dataset?
| model: str = JUDGE_MODEL, | ||
| ) -> list[bool]: |
There was a problem hiding this comment.
Consider moving JUDGE_RESULT_FILE as an input parameter, and move it to the config.py
| from pathlib import Path | ||
|
|
||
| from bcbench.dataset.codereview import ReviewComment, Severity | ||
| from bcbench.evaluate.codereview_judge import JUDGE_MODEL, judge_verdicts |
There was a problem hiding this comment.
JUDGE_MODEL can also be moved into somewhere in config.py
| @dataclass(frozen=True) | ||
| class JudgeCalibrationCase: | ||
| expected: ReviewComment | ||
| candidate: ReviewComment | ||
| should_match: bool | ||
| note: str | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class JudgeCalibrationReport: | ||
| total: int | ||
| true_positives: int | ||
| false_positives: int | ||
| true_negatives: int | ||
| false_negatives: int | ||
| precision: float | ||
| recall: float | ||
| accuracy: float | ||
| misclassified_notes: list[str] |
| JudgeCalibrationCase( | ||
| expected=_c("src/Sales/SalesPost.Codeunit.al", 142, "Calling Commit() inside the repeat loop can leave partial data if a later iteration fails."), | ||
| candidate=_c("src/Sales/SalesPost.Codeunit.al", 145, "Move the Commit() out of the loop; committing per iteration breaks atomicity."), | ||
| should_match=True, | ||
| note="commit-in-loop, paraphrased + different line", | ||
| ), | ||
| JudgeCalibrationCase( | ||
| expected=_c("src/Inventory/ItemAvail.Codeunit.al", 58, "Add SetLoadFields before FindSet so the whole record isn't loaded."), | ||
| candidate=_c("src/Inventory/ItemAvail.Codeunit.al", 58, "Use SetLoadFields to limit the columns read for this query."), | ||
| should_match=True, | ||
| note="setloadfields performance, same issue", | ||
| ), | ||
| JudgeCalibrationCase( | ||
| expected=_c("src/Finance/Payment.Codeunit.al", 77, "Currency code 'USD' is hardcoded; read it from setup instead."), | ||
| candidate=_c("src/Finance/Payment.Codeunit.al", 77, "Don't hardcode the currency — pull it from the configuration record."), | ||
| should_match=True, | ||
| note="hardcoded currency, same issue", |
There was a problem hiding this comment.
move to a jsonl file or something, then python code can be much more simplified to only load them
| precision = tp / (tp + fp) if (tp + fp) else 1.0 | ||
| recall = tp / (tp + fn) if (tp + fn) else 1.0 | ||
| accuracy = (tp + tn) / len(cases) if cases else 0.0 |
There was a problem hiding this comment.
there is an existing function under metrics.py to calculate precision and recall
| When ``line_tolerance`` is ``None`` the line distance never blocks a pair: any same-file finding | ||
| is an eligible candidate and the distance acts only as an assignment tiebreak. This is the mode | ||
| used ahead of the LLM judge, which is the authoritative semantic gate. A numeric ``line_tolerance`` | ||
| keeps the older hard structural gate (used for judge-free scoring). |
There was a problem hiding this comment.
I see it is always passed as None now, consider simply drop it. Code below can be simplified because of it
| # Bootstrap the equal-weight headline over tasks (pooled per-task F1 across runs) so the CI | ||
| # is meaningful even with a single run. Fall back to the per-run macro F1 if tasks were not |
There was a problem hiding this comment.
We probably will always display results from multiple runs (e.g. 5), especially it's relatively cheap to do this (~15 mins per run). Not sure if "CI for single run" is a scenario worth considering to us
| ) -> None: | ||
| """Run the LLM judge over the hand-labeled calibration set and report its precision/recall. | ||
|
|
||
| Use this in CI to catch judge drift before it silently distorts code-review scores. |
There was a problem hiding this comment.
Can also be used locally for ad-hoc checks
There was a problem hiding this comment.
I also don't think this is called during CI atm
AB#640100
macro_precision now averages only over tasks where the agent commented, so silence is no longer scored as perfect precision (recall/F1 still penalize silence)
unknown severities log a loud warning instead of silently coercing to MEDIUM
match_comments supports file-only candidate gating (line_tolerance=None); the real pipeline now lets the LLM judge be the semantic gate with line distance as a tiebreak
leaderboard bootstraps the macro-F1 CI over tasks (per_task_f1) so a single run yields a meaningful interval