Skip to content

Harden code-review offline scorer and judge candidate gating#681

Open
gggdttt wants to merge 8 commits into
mainfrom
private/wenjiefan/code-review-improve-eval-methodology
Open

Harden code-review offline scorer and judge candidate gating#681
gggdttt wants to merge 8 commits into
mainfrom
private/wenjiefan/code-review-improve-eval-methodology

Conversation

@gggdttt

@gggdttt gggdttt commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

AB#640100

  • macro_precision now averages only over tasks where the agent commented, so silence is no longer scored as perfect precision (recall/F1 still penalize silence)

  • unknown severities log a loud warning instead of silently coercing to MEDIUM

  • match_comments supports file-only candidate gating (line_tolerance=None); the real pipeline now lets the LLM judge be the semantic gate with line distance as a tiebreak

  • leaderboard bootstraps the macro-F1 CI over tasks (per_task_f1) so a single run yields a meaningful interval

wenjiefan added 2 commits June 24, 2026 10:32
- macro_precision now averages only over tasks where the agent commented, so silence is no longer scored as perfect precision (recall/F1 still penalize silence)

- unknown severities log a loud warning instead of silently coercing to MEDIUM

- match_comments supports file-only candidate gating (line_tolerance=None); the real pipeline now lets the LLM judge be the semantic gate with line distance as a tiebreak

- leaderboard bootstraps the macro-F1 CI over tasks (per_task_f1) so a single run yields a meaningful interval
- new codereview_judge_calibration module: 16 hand-labeled AL/BC (expected, candidate) pairs with human match/non-match verdicts; non-match cases share file+line so the judge must discriminate on the issue, not location

- score_calibration() reports judge precision/recall/accuracy; run_calibration() runs the live judge over the set

- new 'bcbench evaluate judge-calibration' CLI command for CI; fails if accuracy drops below a threshold

- expose judge_verdicts() (per-pair verdicts) and harden the judge subprocess to decode output as utf-8 (was crashing on non-cp1252 bytes on Windows)

- regression tests: pure scoring (always run) + opt-in live judge accuracy check (BCBENCH_RUN_JUDGE_CALIBRATION). Measured judge today: P=1.0 R=1.0 acc=1.0 over the set
@gggdttt gggdttt changed the title Harden code-review offline scorer and judge candidate gating [Draft]Harden code-review offline scorer and judge candidate gating Jun 24, 2026
@haoranpb

haoranpb commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

I will covert this to draft explicitly given the title @gggdttt

@haoranpb haoranpb marked this pull request as draft June 24, 2026 14:27
@gggdttt gggdttt marked this pull request as ready for review June 24, 2026 19:15
@gggdttt gggdttt changed the title [Draft]Harden code-review offline scorer and judge candidate gating Harden code-review offline scorer and judge candidate gating Jun 24, 2026
@gggdttt gggdttt marked this pull request as draft June 24, 2026 19:15
@gggdttt gggdttt marked this pull request as ready for review June 24, 2026 19:16
Add 8 adversarial pairs (4 match / 4 non-match) where expected and
candidate share file+line and are semantically adjacent, to stress the
judge's discrimination. Live judge (gpt-5.3-codex) still scores
precision=recall=accuracy=1.000 (TP=13 FP=0 TN=11 FN=0).

@haoranpb haoranpb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea with the calibration!

alias = _SEVERITY_ALIASES.get(normalized)
if alias is not None:
return alias
logger.warning("Unknown severity %r; defaulting to %s. Use one of %s or a known alias.", value, cls.MEDIUM.value, [s.value for s in cls])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a default, raise and error out might be the better option, then you have better observability on the potential underlying problem. Or you might realize that this doesn't happen at all in practice

context.entry.expected_comments,
generated_comments,
context.entry.match_line_tolerance,
line_tolerance=None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we then want to remove the line_tolerance from the dataset?

Comment on lines +141 to +142
model: str = JUDGE_MODEL,
) -> list[bool]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving JUDGE_RESULT_FILE as an input parameter, and move it to the config.py

from pathlib import Path

from bcbench.dataset.codereview import ReviewComment, Severity
from bcbench.evaluate.codereview_judge import JUDGE_MODEL, judge_verdicts

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUDGE_MODEL can also be moved into somewhere in config.py

Comment on lines +22 to +40
@dataclass(frozen=True)
class JudgeCalibrationCase:
expected: ReviewComment
candidate: ReviewComment
should_match: bool
note: str


@dataclass(frozen=True)
class JudgeCalibrationReport:
total: int
true_positives: int
false_positives: int
true_negatives: int
false_negatives: int
precision: float
recall: float
accuracy: float
misclassified_notes: list[str]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to types.py

Comment on lines +49 to +65
JudgeCalibrationCase(
expected=_c("src/Sales/SalesPost.Codeunit.al", 142, "Calling Commit() inside the repeat loop can leave partial data if a later iteration fails."),
candidate=_c("src/Sales/SalesPost.Codeunit.al", 145, "Move the Commit() out of the loop; committing per iteration breaks atomicity."),
should_match=True,
note="commit-in-loop, paraphrased + different line",
),
JudgeCalibrationCase(
expected=_c("src/Inventory/ItemAvail.Codeunit.al", 58, "Add SetLoadFields before FindSet so the whole record isn't loaded."),
candidate=_c("src/Inventory/ItemAvail.Codeunit.al", 58, "Use SetLoadFields to limit the columns read for this query."),
should_match=True,
note="setloadfields performance, same issue",
),
JudgeCalibrationCase(
expected=_c("src/Finance/Payment.Codeunit.al", 77, "Currency code 'USD' is hardcoded; read it from setup instead."),
candidate=_c("src/Finance/Payment.Codeunit.al", 77, "Don't hardcode the currency — pull it from the configuration record."),
should_match=True,
note="hardcoded currency, same issue",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to a jsonl file or something, then python code can be much more simplified to only load them

Comment on lines +217 to +219
precision = tp / (tp + fp) if (tp + fp) else 1.0
recall = tp / (tp + fn) if (tp + fn) else 1.0
accuracy = (tp + tn) / len(cases) if cases else 0.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an existing function under metrics.py to calculate precision and recall

Comment on lines +87 to +90
When ``line_tolerance`` is ``None`` the line distance never blocks a pair: any same-file finding
is an eligible candidate and the distance acts only as an assignment tiebreak. This is the mode
used ahead of the LLM judge, which is the authoritative semantic gate. A numeric ``line_tolerance``
keeps the older hard structural gate (used for judge-free scoring).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it is always passed as None now, consider simply drop it. Code below can be simplified because of it

Comment on lines +147 to +148
# Bootstrap the equal-weight headline over tasks (pooled per-task F1 across runs) so the CI
# is meaningful even with a single run. Fall back to the per-run macro F1 if tasks were not

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably will always display results from multiple runs (e.g. 5), especially it's relatively cheap to do this (~15 mins per run). Not sure if "CI for single run" is a scenario worth considering to us

) -> None:
"""Run the LLM judge over the hand-labeled calibration set and report its precision/recall.

Use this in CI to catch judge drift before it silently distorts code-review scores.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also be used locally for ad-hoc checks

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think this is called during CI atm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants