STT Evaluation: Gemini Files API instead of signed URLs#685
STT Evaluation: Gemini Files API instead of signed URLs#685AkhileshNegi wants to merge 6 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughRefactors STT batch flow to upload audio to the Gemini File API (instead of generating signed S3 URLs), adds helpers to upload and delete Gemini files, changes create_stt_batch_requests to accept file URIs and explicit MIME types, and performs post-run Gemini-file cleanup after batch completion. Changes
Sequence DiagramsequenceDiagram
participant App as STT Batch Service
participant S3 as AWS S3
participant GeminiFile as Gemini File API
participant GeminiBatch as Gemini Batch API
App->>S3: Download audio bytes
S3-->>App: Audio bytes + inferred MIME
App->>GeminiFile: Upload audio (bytes, mime_type)
GeminiFile-->>App: file_uri, file_name
Note over App: Aggregate file_uris & mime_types (thread pool)
App->>GeminiBatch: create_stt_batch_requests(file_uris, mime_types, prompt)
GeminiBatch-->>App: Batch request payloads / job IDs
App->>GeminiBatch: Submit & poll jobs
GeminiBatch-->>App: STT results
App->>GeminiFile: delete_files(gemini_audio_file_ids)
GeminiFile-->>App: deletion confirmations / partial failures
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/app/crud/stt_evaluations/batch.py (1)
97-134: Add type hint formime_typein tuple return and handle potentialNonevalue.The
_upload_to_geminifunction returnsmime_typewhich can beNoneon error, but on success at line 145,mime_types.append(mime_type)appends it without a null check. While the conditionalif file_uri:guards this path, the type system doesn't guaranteemime_typeis non-Nonewhenfile_uriis truthy.Consider adding an explicit check for defensive coding:
🛡️ Suggested defensive check
for completed_task in as_completed(upload_tasks): sample, file_uri, file_name, mime_type, error = completed_task.result() - if file_uri: + if file_uri and mime_type and file_name: file_uris.append(file_uri) mime_types.append(mime_type) sample_keys.append(str(sample.id)) gemini_file_names.append(file_name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/crud/stt_evaluations/batch.py` around lines 97 - 134, Update the _upload_to_gemini return type to explicitly mark mime_type as optional (e.g., tuple[STTSample, str | None, str | None, str | None, str | None]) and ensure callers treat mime_type as possibly None: when handling the returned (sample, file_uri, file_name, mime_type, error) from _upload_to_gemini, only append mime_type to mime_types (or use a safe fallback like "audio/mpeg") after checking mime_type is not None and file_uri is truthy; adjust the code paths around upload_provider.upload_audio_file, file_uri checks and mime_types.append to perform this defensive null check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 97-134: Update the _upload_to_gemini return type to explicitly
mark mime_type as optional (e.g., tuple[STTSample, str | None, str | None, str |
None, str | None]) and ensure callers treat mime_type as possibly None: when
handling the returned (sample, file_uri, file_name, mime_type, error) from
_upload_to_gemini, only append mime_type to mime_types (or use a safe fallback
like "audio/mpeg") after checking mime_type is not None and file_uri is truthy;
adjust the code paths around upload_provider.upload_audio_file, file_uri checks
and mime_types.append to perform this defensive null check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4e89169a-6920-4f23-8db5-666d9930b908
📒 Files selected for processing (4)
backend/app/core/batch/gemini.pybackend/app/crud/stt_evaluations/batch.pybackend/app/crud/stt_evaluations/cron.pybackend/app/tests/core/batch/test_gemini.py
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/crud/stt_evaluations/batch.py (1)
196-234:⚠️ Potential issue | 🟠 MajorClean up uploaded Gemini files when batch submission never succeeds.
If every
start_batch_job()call raises,gemini_file_namesnever get written into any batch job config, so the later cleanup inpoll_stt_run()has nothing to delete. That leaves all uploaded audio orphaned until Gemini’s TTL expires.♻️ Proposed fix
- if not batch_jobs: - raise Exception("Batch submission failed for all models") + if not batch_jobs: + if gemini_file_names: + upload_provider.delete_files(gemini_file_names) + raise Exception("Batch submission failed for all models")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/crud/stt_evaluations/batch.py` around lines 196 - 234, The loop can fail for every model leaving gemini_file_names orphaned; after the loop when batch_jobs is empty, ensure you clean up the uploaded Gemini files before raising. Add a call (and log) to the existing cleanup helper—or implement a small cleanup routine—to delete gemini_file_names (referencing gemini_file_names) in the same function (where batch_jobs and first_batch_job_id are checked) so poll_stt_run() has nothing left to delete later; do this cleanup just before raising the Exception to avoid leaving uploaded audio behind.
🧹 Nitpick comments (1)
backend/app/tests/core/batch/test_gemini.py (1)
15-24: Please make the shared pytest fixtures typed factories.These new fixtures return concrete instances directly and omit annotations, which makes the setup less reusable across cases and doesn’t match the repo’s testing conventions.
As per coding guidelines,
**/*.py: Always add type hints to all function parameters and return values in Python code;backend/app/tests/**/*.py: Use factory pattern for test fixtures inbackend/app/tests/.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/core/batch/test_gemini.py` around lines 15 - 24, The fixtures currently return concrete instances without type hints or factory flexibility; change both to typed factory fixtures: make mock_genai_client a pytest fixture that returns a Callable[[], MagicMock] with explicit typing (from typing import Callable) that when called creates and returns a MagicMock, and make provider a pytest fixture that returns a Callable[[MagicMock], GeminiBatchProvider] (or Callable[[], GeminiBatchProvider] if you prefer it to construct its own mock) with full type hints and a docstring; update signatures to include return types and parameter types, reference the existing symbols mock_genai_client, provider, GeminiBatchProvider, MagicMock and pytest.fixture so tests can call the factory to get fresh instances instead of reusing a single concrete object.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/core/batch/gemini.py`:
- Around line 345-350: In the call to self._upload_to_gemini within batch
processing (the uploaded_file assignment), change the invalid mime_type="jsonl"
to mime_type="application/json" so the Gemini Files API accepts the upload;
update the mime_type argument passed into _upload_to_gemini in that call site
(uploaded_file = self._upload_to_gemini(...)) to use "application/json".
In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 154-176: The current loop logs failed Gemini uploads but drops
those samples, causing missing STTResult rows; update the upload-failure path in
start_stt_evaluation_batch so that for each entry in failed_samples you persist
a failure STTResult (or equivalent model) tied to the sample and run with the
error message/state instead of silently skipping them; use the same persistence
API used by process_completed_stt_batch (or the STTResult creation helper) so
downstream logic sees explicit failed results, and only raise/abort if you
prefer the alternative behavior (document choice) rather than leaving samples
unrecorded.
---
Outside diff comments:
In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 196-234: The loop can fail for every model leaving
gemini_file_names orphaned; after the loop when batch_jobs is empty, ensure you
clean up the uploaded Gemini files before raising. Add a call (and log) to the
existing cleanup helper—or implement a small cleanup routine—to delete
gemini_file_names (referencing gemini_file_names) in the same function (where
batch_jobs and first_batch_job_id are checked) so poll_stt_run() has nothing
left to delete later; do this cleanup just before raising the Exception to avoid
leaving uploaded audio behind.
---
Nitpick comments:
In `@backend/app/tests/core/batch/test_gemini.py`:
- Around line 15-24: The fixtures currently return concrete instances without
type hints or factory flexibility; change both to typed factory fixtures: make
mock_genai_client a pytest fixture that returns a Callable[[], MagicMock] with
explicit typing (from typing import Callable) that when called creates and
returns a MagicMock, and make provider a pytest fixture that returns a
Callable[[MagicMock], GeminiBatchProvider] (or Callable[[], GeminiBatchProvider]
if you prefer it to construct its own mock) with full type hints and a
docstring; update signatures to include return types and parameter types,
reference the existing symbols mock_genai_client, provider, GeminiBatchProvider,
MagicMock and pytest.fixture so tests can call the factory to get fresh
instances instead of reusing a single concrete object.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f9b617a-e689-4871-b3db-46c3de3626e7
📒 Files selected for processing (4)
backend/app/core/batch/gemini.pybackend/app/crud/stt_evaluations/batch.pybackend/app/crud/stt_evaluations/cron.pybackend/app/tests/core/batch/test_gemini.py
| for completed_task in as_completed(upload_tasks): | ||
| result = completed_task.result() | ||
| if result.file_uri: | ||
| file_uris.append(result.file_uri) | ||
| mime_types.append(result.mime_type) | ||
| sample_keys.append(str(result.sample.id)) | ||
| gemini_file_names.append(result.file_name) | ||
| else: | ||
| failed_samples.append((sample, error)) | ||
| failed_samples.append((result.sample, result.error)) | ||
| logger.error( | ||
| f"[start_stt_evaluation_batch] Failed to generate signed URL | " | ||
| f"sample_id: {sample.id}, error: {error}" | ||
| f"[start_stt_evaluation_batch] Failed to upload to Gemini | " | ||
| f"sample_id: {result.sample.id}, error: {result.error}" | ||
| ) | ||
|
|
||
| if failed_samples: | ||
| logger.warning( | ||
| f"[start_stt_evaluation_batch] Signed URL failures | " | ||
| f"[start_stt_evaluation_batch] Gemini upload failures | " | ||
| f"run_id: {run.id}, failed_count: {len(failed_samples)}, " | ||
| f"succeeded_count: {len(signed_urls)}" | ||
| f"succeeded_count: {len(file_uris)}" | ||
| ) | ||
|
|
||
| if not signed_urls: | ||
| raise Exception("Failed to generate signed URLs for any audio files") | ||
| if not file_uris: | ||
| raise Exception("Failed to upload audio files to Gemini for any samples") |
There was a problem hiding this comment.
Don’t silently drop samples whose Gemini upload fails.
This path only logs failed uploads and then builds the batch from the successful subset. Since process_completed_stt_batch() later creates STTResult rows only for returned batch responses, those samples never get a persisted failure and the run can finish “completed” with missing results. Either create failed result rows here or abort the run when any upload fails.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/crud/stt_evaluations/batch.py` around lines 154 - 176, The
current loop logs failed Gemini uploads but drops those samples, causing missing
STTResult rows; update the upload-failure path in start_stt_evaluation_batch so
that for each entry in failed_samples you persist a failure STTResult (or
equivalent model) tied to the sample and run with the error message/state
instead of silently skipping them; use the same persistence API used by
process_completed_stt_batch (or the STTResult creation helper) so downstream
logic sees explicit failed results, and only raise/abort if you prefer the
alternative behavior (document choice) rather than leaving samples unrecorded.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/core/batch/gemini.py (1)
504-518:⚠️ Potential issue | 🟡 MinorClarify URI vs file-name examples in STT request docs.
The docstring says
file_urisbut example values like"files/abc123"look like file names, not URIs. Please align wording/examples to avoid misuse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/core/batch/gemini.py` around lines 504 - 518, The docstring for create_stt_batch_requests is inconsistent about whether parameter file_uris are full URIs or simple file names; update the docstring to be explicit and align the example: either (A) state that file_uris must be Gemini-style relative URIs (e.g., "files/abc123") and keep the example, or (B) change the example to a full URI (e.g., "https://.../files/abc123") if you intend full URLs. Edit the parameter description for file_uris and the example block so they consistently show the chosen format and ensure mime_types and prompt wording remain unchanged.
♻️ Duplicate comments (2)
backend/app/crud/stt_evaluations/batch.py (1)
154-176:⚠️ Potential issue | 🟠 MajorDon’t silently drop samples when Gemini upload fails.
This path logs upload failures but still submits batches for the successful subset only. Those failed samples won’t produce STTResult rows, so runs can complete with missing per-sample outcomes. Persist explicit failed results for
failed_samples(or fail the run before submission).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/crud/stt_evaluations/batch.py` around lines 154 - 176, The current loop logs Gemini upload failures into failed_samples but then proceeds to submit only the successful uploads, leaving missing STTResult rows; update the post-upload handling in start_stt_evaluation_batch to persist explicit failure results for each entry in failed_samples (or alternatively abort the run): iterate failed_samples and create a failed STTResult (or call the existing helper that writes STTResult rows) for each sample with status/error fields populated (include sample id, run id, error message from result.error and an appropriate failure status), and ensure any downstream submission logic uses the persisted results (or raise an exception to stop the run) so failures are not silently dropped.backend/app/core/batch/gemini.py (1)
347-351:⚠️ Potential issue | 🟠 MajorUse a supported MIME type for JSONL uploads.
mime_type="jsonl"is likely to fail Files API MIME validation; use a standard type (application/json) for batch JSONL uploads.Suggested fix
uploaded_file = self._upload_to_gemini( content=content, suffix=".jsonl", - mime_type="jsonl", + mime_type="application/json", display_name=f"batch-input-{int(time.time())}", )Official Gemini Files API docs: is `mime_type="jsonl"` supported in UploadFileConfig, and what MIME type is recommended for JSONL batch input files?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/core/batch/gemini.py` around lines 347 - 351, The upload uses an unsupported mime_type string "jsonl" which may fail the Gemini Files API validation; update the call to _upload_to_gemini (where uploaded_file is created) to pass a standard MIME type such as "application/json" (keep suffix=".jsonl" and display_name as-is) so the Files API accepts the JSONL batch input.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@backend/app/core/batch/gemini.py`:
- Around line 504-518: The docstring for create_stt_batch_requests is
inconsistent about whether parameter file_uris are full URIs or simple file
names; update the docstring to be explicit and align the example: either (A)
state that file_uris must be Gemini-style relative URIs (e.g., "files/abc123")
and keep the example, or (B) change the example to a full URI (e.g.,
"https://.../files/abc123") if you intend full URLs. Edit the parameter
description for file_uris and the example block so they consistently show the
chosen format and ensure mime_types and prompt wording remain unchanged.
---
Duplicate comments:
In `@backend/app/core/batch/gemini.py`:
- Around line 347-351: The upload uses an unsupported mime_type string "jsonl"
which may fail the Gemini Files API validation; update the call to
_upload_to_gemini (where uploaded_file is created) to pass a standard MIME type
such as "application/json" (keep suffix=".jsonl" and display_name as-is) so the
Files API accepts the JSONL batch input.
In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 154-176: The current loop logs Gemini upload failures into
failed_samples but then proceeds to submit only the successful uploads, leaving
missing STTResult rows; update the post-upload handling in
start_stt_evaluation_batch to persist explicit failure results for each entry in
failed_samples (or alternatively abort the run): iterate failed_samples and
create a failed STTResult (or call the existing helper that writes STTResult
rows) for each sample with status/error fields populated (include sample id, run
id, error message from result.error and an appropriate failure status), and
ensure any downstream submission logic uses the persisted results (or raise an
exception to stop the run) so failures are not silently dropped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 18a9403c-08ef-4940-87c0-243f4819a32e
📒 Files selected for processing (3)
backend/app/core/batch/gemini.pybackend/app/crud/stt_evaluations/batch.pybackend/app/crud/stt_evaluations/cron.py
Summary
Target issue is #672
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
New Features
Improvements
Testing
Tested with 53 files dataset for STT Evaluation and all transcription were successfull