Feat: Implement robust 3-point audio language auto-detect with majority voting when no metadata, and add manual CLI override#4
Conversation
…ty voting, and add manual CLI override
There was a problem hiding this comment.
Pull request overview
Adds a more reliable audio-language detection path when container metadata is missing, so the translation pipeline can still be triggered correctly (plus a manual CLI override).
Changes:
- Add
--audio-languageCLI flag to manually override detected/metadata audio language. - Add WhisperX-based audio sampling language detection (3×30s clips with majority vote) when metadata is missing.
- Wire the detection/override into the audiosync workflow; add
ffmpeg-pythondependency.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
requirements.txt |
Adds ffmpeg-python dependency for probing/extracting audio samples. |
anchor/utils/whisper.py |
Introduces detect_audio_language_whisper() using 3-point sampling + majority vote. |
anchor/utils/args.py |
Adds --audio-language CLI argument. |
anchor/core/audiosync/audiosync.py |
Uses override → metadata → Whisper-based sampling to determine audio language and decide whether to translate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if detected_languages: | ||
| # Counter counts occurrences. most_common(1) returns e.g., [('en', 2)] | ||
| # We then extract just the language code from that nested result. | ||
| winner = Counter(detected_languages).most_common(1)[0][0] | ||
| return winner | ||
| else: | ||
| # Absolute fallback if ALL three clips were purely silent | ||
| return "en" | ||
|
|
There was a problem hiding this comment.
This hard-codes English as a fallback when no language is detected (silent clips / probe failures). Returning 'en' can incorrectly force translation/model selection for non-English audio; it’s safer to return None/"unknown" and let the rest of the pipeline keep Whisper in auto-detect mode (or explicitly handle an unknown audio language).
| # Load the model ONCE outside the loop to save massive amounts of time | ||
| model = whisperx.load_model("base", device, compute_type=compute_type, asr_options={"without_timestamps": True}) | ||
|
|
There was a problem hiding this comment.
detect_audio_language_whisper loads an additional WhisperX model but doesn’t explicitly free it. On CUDA this can leave VRAM allocated until GC runs and can contribute to OOM when the main sync model is loaded next. Consider del model (and torch.cuda.empty_cache() when applicable) before returning, or run this detection model on CPU explicitly.
| "--audio-language", | ||
| type=str, | ||
| help="Manually specify the video's audio language (e.g. 'en', 'sv') bypassing metadata." | ||
| ) |
There was a problem hiding this comment.
The closing parenthesis for this parser.add_argument(...) call is unindented compared to the rest of the file’s argument definitions, which makes the block harder to read and easy to mis-edit. Please indent it consistently with the surrounding parser.add_argument calls.
| ) | |
| ) |
| if getattr(args, "audio_language", None): | ||
| audio_lang = args.audio_language.lower() | ||
| console.print(f"[dim]🌐 Audio language manually overridden to: [bold cyan]{audio_lang.upper()}[/bold cyan][/dim]") |
There was a problem hiding this comment.
The CLI override is lowercased but not normalized. If a user passes a 3-letter code like eng/swe (common in container metadata) it won’t match the 2-letter codes used elsewhere (e.g., NLLB mapping and *.en model selection). Consider normalizing the override with the existing normalize_language_code() helper so both 2- and 3-letter inputs behave consistently.
| meta_lang = get_audio_language(vid) | ||
| if meta_lang: | ||
| console.print(f"[dim]🌐 Metadata language detected: [bold cyan]{meta_lang.upper()}[/bold cyan][/dim]") | ||
| # Detection & Language Override |
There was a problem hiding this comment.
Indentation is inconsistent here: the # Detection & Language Override line is less indented than the surrounding statements inside the for loop, which will end the loop block early and make the next indented line (audio_lang = None) raise an IndentationError/SyntaxError. Align this comment (and any related code) to the same indentation level as the other statements inside the loop.
| # Detection & Language Override | |
| # Detection & Language Override |
Hi!
This is a fix for my reported issue #3 (comment)
This PR adds an auto-detect feature for files missing language metadata. It uses WhisperX to sample three separate 30-second clips (at 30%, 50%, and 70% of the video, purposefully avoiding intros/studio logos) and uses majority voting to determine the correct language.
It also adds an --audio-language flag for manual overrides.
It works real nice from what I've been able to test, does not add that much time at all (none if lang metadata already exists of course)
Let me know what you think