Added evaluation of multiple validators together#66
Added evaluation of multiple validators together#66rkritika1508 wants to merge 3 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdded two new files for multiple validators evaluation: a JSON configuration file specifying dataset paths, output location, and three validators (uli_slur_match, pii_remover, ban_list), and a Python script that loads the config, reads a CSV dataset, calls a Guardrails HTTP API for each row with the configured validators, and outputs predictions with error handling. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as run.py
participant CSV as Input Dataset
participant API as Guardrails HTTP API
participant Output as Output CSV
Script->>Script: Load config.json
Script->>CSV: Read rows from dataset
loop For each row in dataset
Script->>Script: Build request payload<br/>(request_id, validators, text)
Script->>API: POST with auth token
alt Success
API-->>Script: Return {data: {safe_text}}
Script->>Script: Extract safe_text
else HTTP/Network Error
API-->>Script: Error response
Script->>Script: Return REQUEST_ERROR
else JSON Decode Error
Script->>Script: Return JSON_ERROR
end
Script->>Output: Append row with response
end
Script->>Output: Write aggregated results to CSV
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
backend/app/api/routes/guardrails.py (1)
48-53: LGTM – the conditional guard is correct and well-motivated.The
any()short-circuit correctly avoids an unnecessary DB round-trip when allBanListSafetyValidatorConfigvalidators already havebanned_wordspopulated (as in the evaluation script, where they're inlined). The inner_resolve_ban_list_banned_wordsalso retains its own guard (banned_words is not None), so behavior is unchanged for callers that always suppliedbanned_words.One existing gap worth noting:
_resolve_ban_list_banned_wordsis invoked before_validate_with_guard'stry/except, so any exception fromban_list_crud.get(e.g. record not found, DB error) escapes unhandled and falls through to FastAPI's default 500 handler rather than returning a structuredAPIResponse. This pre-dates this PR, but wrapping the call — or moving ban-list resolution inside_validate_with_guard'stryblock — would close that gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/api/routes/guardrails.py` around lines 48 - 53, The call to _resolve_ban_list_banned_words(payload, session) can raise exceptions from ban_list_crud.get which escape the current try/except in _validate_with_guard; wrap the resolution call in the same error handling as _validate_with_guard (or move the call inside _validate_with_guard's try block) so DB errors are caught and converted to the same structured APIResponse/error handling path. Specifically, ensure exceptions from ban_list_crud.get (raised via _resolve_ban_list_banned_words) are caught and handled consistently with _validate_with_guard rather than propagating to FastAPI (reference symbols: _resolve_ban_list_banned_words, _validate_with_guard, ban_list_crud.get, payload.validators, BanListSafetyValidatorConfig).backend/app/evaluation/multi_validator_whatsapp/run.py (1)
16-16:float()conversion at module level will crash on a malformed env var.
TIMEOUT_SECONDS = float(os.getenv("GUARDRAILS_TIMEOUT_SECONDS", "60"))raisesValueErrorat import time if the variable is set to a non-numeric string (e.g. a mistyped value), giving a confusing traceback instead of a meaningful error.🔧 Proposed fix
-TIMEOUT_SECONDS = float(os.getenv("GUARDRAILS_TIMEOUT_SECONDS", "60")) +_raw_timeout = os.getenv("GUARDRAILS_TIMEOUT_SECONDS", "60") +try: + TIMEOUT_SECONDS = float(_raw_timeout) +except ValueError: + raise ValueError( + f"GUARDRAILS_TIMEOUT_SECONDS must be a number, got: {_raw_timeout!r}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/multi_validator_whatsapp/run.py` at line 16, The module-level float conversion of TIMEOUT_SECONDS from GUARDRAILS_TIMEOUT_SECONDS can raise ValueError at import time; change it to a safe parse: read os.getenv("GUARDRAILS_TIMEOUT_SECONDS") into a string, attempt float() inside a try/except (or a small helper like parse_timeout_env/get_timeout_seconds) and on failure fall back to the default 60 (and optionally log a warning) so import won't crash; assign the resulting numeric value to TIMEOUT_SECONDS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/multi_validator_whatsapp/run.py`:
- Around line 28-32: The current "ban_list" template contains a hardcoded
"banned_words": ["sonography"] which restricts use; change run.py to accept a
banned-words list from the CLI (or env var) and build the template dynamically
instead of embedding the literal array. Specifically, add a CLI flag (e.g.,
--banned-words) or read an env var, parse it into a list, then replace the
static "banned_words" entry in the "ban_list" template with that parsed list
before the validator is used; update any code that references the "ban_list"
template to consume the newly built template.
- Around line 58-61: The current code in run.py that extracts safe_text
(safe_text = body.get("data", {}).get("safe_text") and returns "" when missing)
conflates missing/validation-failed responses with legitimately empty-string
outputs; change the return to a clear sentinel instead of "" — e.g., return None
(preferred) or the string "VALIDATION_FAILED" from the call_guardrails flow
where safe_text is absent, and update any CSV/export logic that writes these
results to map that sentinel to a distinct marker (e.g., "VALIDATION_FAILED") so
missing/validation failures are distinguishable from empty valid outputs; ensure
references to safe_text and call_guardrails are updated accordingly.
- Around line 86-90: In main(), instead of raising ValueError when invalid
validator names are found (the block that checks selected_validators or
unknown), call parser.error(...) with the same message so argparse prints a
usage and exits with code 2; use the existing parser object and include the
supported values (f"{', '.join(VALIDATOR_TEMPLATES.keys())}") in the error text
so the behavior for selected_validators/unknown mirrors argparse's idiomatic
validation.
- Around line 43-44: The payload in run.py currently hardcodes "organization_id"
and "project_id" to 1; update main() to accept CLI args (e.g., --organization-id
and --project-id) with env-var fallbacks similar to API_URL and TIMEOUT_SECONDS,
parse them to integers, pass them into call_guardrails, and replace the literal
values in the payload construction (the dictionary with keys "organization_id"
and "project_id") so calls target the correct tenant instead of always tenant 1.
---
Nitpick comments:
In `@backend/app/api/routes/guardrails.py`:
- Around line 48-53: The call to _resolve_ban_list_banned_words(payload,
session) can raise exceptions from ban_list_crud.get which escape the current
try/except in _validate_with_guard; wrap the resolution call in the same error
handling as _validate_with_guard (or move the call inside _validate_with_guard's
try block) so DB errors are caught and converted to the same structured
APIResponse/error handling path. Specifically, ensure exceptions from
ban_list_crud.get (raised via _resolve_ban_list_banned_words) are caught and
handled consistently with _validate_with_guard rather than propagating to
FastAPI (reference symbols: _resolve_ban_list_banned_words,
_validate_with_guard, ban_list_crud.get, payload.validators,
BanListSafetyValidatorConfig).
In `@backend/app/evaluation/multi_validator_whatsapp/run.py`:
- Line 16: The module-level float conversion of TIMEOUT_SECONDS from
GUARDRAILS_TIMEOUT_SECONDS can raise ValueError at import time; change it to a
safe parse: read os.getenv("GUARDRAILS_TIMEOUT_SECONDS") into a string, attempt
float() inside a try/except (or a small helper like
parse_timeout_env/get_timeout_seconds) and on failure fall back to the default
60 (and optionally log a warning) so import won't crash; assign the resulting
numeric value to TIMEOUT_SECONDS.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/app/evaluation/multiple_validators/run.py (3)
86-106: Consider adding a progress indicator for long-running evaluations.For large datasets, the sequential API calls could take a significant amount of time. Adding a simple progress indicator (e.g., using
tqdmor printing progress every N rows) would improve user experience during execution.+from tqdm import tqdm + df = pd.read_csv(dataset_path) rows = [] - for _, row in df.iterrows(): + for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing"):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/multiple_validators/run.py` around lines 86 - 106, The long-running loop over df.iterrows() that calls call_guardrails for each row should show progress; update the loop in run.py (the section building rows and invoking call_guardrails) to use a progress indicator (e.g., wrap the iterator with tqdm or print progress every N rows), ensuring you import tqdm if used and preserve existing behavior of collecting into rows and reading columns "Text", "ID", and "Validators_present"; keep call_guardrails, validators_payload, organization_id, project_id, and args.auth_token usage unchanged while adding the progress display.
42-58: Consider adding retry logic for transient failures.For evaluation runs with many rows, transient network issues or rate limiting could cause individual requests to fail. Adding a simple retry with backoff would improve reliability.
♻️ Suggested implementation with tenacity
+from tenacity import retry, stop_after_attempt, wait_exponential + +@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10)) +def _make_request(url: str, headers: dict, payload: dict, timeout: float) -> httpx.Response: + response = httpx.post(url, headers=headers, json=payload, timeout=timeout) + response.raise_for_status() + return response + def call_guardrails(...) -> str: ... try: - response = httpx.post( - API_URL, - headers=headers, - json=payload, - timeout=TIMEOUT_SECONDS, - ) - response.raise_for_status() + response = _make_request(API_URL, headers, payload, TIMEOUT_SECONDS) body = response.json()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/multiple_validators/run.py` around lines 42 - 58, The HTTP call block using httpx.post(API_URL, headers=headers, json=payload, timeout=TIMEOUT_SECONDS) should be made resilient with retry and backoff: wrap the request/response.raise_for_status()/response.json() logic in a retry loop (or use tenacity) that retries on transient httpx.HTTPError, timeouts, or 5xx status codes with exponential backoff and a limited number of attempts, and only return error strings like "REQUEST_ERROR" or "JSON_ERROR" after retries are exhausted; keep the existing behavior for extracting body.get("data", {}).get("safe_text") and returning "" when safe_text is None. Ensure the retry references the same symbols (API_URL, TIMEOUT_SECONDS, httpx.post, response.raise_for_status) and preserves parsing of JSON and exception handling semantics.
18-20: Consider adding error handling for config loading.If the config file doesn't exist or contains invalid JSON, the error messages from the raw exceptions may not be user-friendly. For a CLI tool, a clearer error message would improve usability.
♻️ Suggested improvement
def load_config(config_path: Path) -> dict: + if not config_path.exists(): + raise FileNotFoundError(f"Config file not found: {config_path}") - with open(config_path) as f: - return json.load(f) + try: + with open(config_path) as f: + return json.load(f) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON in config file {config_path}: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/multiple_validators/run.py` around lines 18 - 20, The load_config function currently opens and json.load()s without guarding against missing files or invalid JSON; wrap the file open/json.load calls in a try/except that catches FileNotFoundError, PermissionError and json.JSONDecodeError and re-raise or exit with a clear, user-friendly message that includes the config_path and the underlying error text (e.g., "Failed to load config at {config_path}: {error}"); keep the exception types (json.JSONDecodeError) and function name load_config as anchors so callers and tests continue to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/evaluation/multiple_validators/run.py`:
- Around line 86-106: The long-running loop over df.iterrows() that calls
call_guardrails for each row should show progress; update the loop in run.py
(the section building rows and invoking call_guardrails) to use a progress
indicator (e.g., wrap the iterator with tqdm or print progress every N rows),
ensuring you import tqdm if used and preserve existing behavior of collecting
into rows and reading columns "Text", "ID", and "Validators_present"; keep
call_guardrails, validators_payload, organization_id, project_id, and
args.auth_token usage unchanged while adding the progress display.
- Around line 42-58: The HTTP call block using httpx.post(API_URL,
headers=headers, json=payload, timeout=TIMEOUT_SECONDS) should be made resilient
with retry and backoff: wrap the
request/response.raise_for_status()/response.json() logic in a retry loop (or
use tenacity) that retries on transient httpx.HTTPError, timeouts, or 5xx status
codes with exponential backoff and a limited number of attempts, and only return
error strings like "REQUEST_ERROR" or "JSON_ERROR" after retries are exhausted;
keep the existing behavior for extracting body.get("data", {}).get("safe_text")
and returning "" when safe_text is None. Ensure the retry references the same
symbols (API_URL, TIMEOUT_SECONDS, httpx.post, response.raise_for_status) and
preserves parsing of JSON and exception handling semantics.
- Around line 18-20: The load_config function currently opens and json.load()s
without guarding against missing files or invalid JSON; wrap the file
open/json.load calls in a try/except that catches FileNotFoundError,
PermissionError and json.JSONDecodeError and re-raise or exit with a clear,
user-friendly message that includes the config_path and the underlying error
text (e.g., "Failed to load config at {config_path}: {error}"); keep the
exception types (json.JSONDecodeError) and function name load_config as anchors
so callers and tests continue to work.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: edb8223a-3954-4a6c-ba7c-3579d02a476f
📒 Files selected for processing (2)
backend/app/evaluation/multiple_validators/config.jsonbackend/app/evaluation/multiple_validators/run.py
✅ Files skipped from review due to trivial changes (1)
- backend/app/evaluation/multiple_validators/config.json
Summary
Target issue is #67.
Explain the motivation for making this change. What existing problem does the pull request solve?
With a custom config:
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit