Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus)#16
Open
mohit-twelvelabs wants to merge 1 commit into
Open
Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus)#16mohit-twelvelabs wants to merge 1 commit into
mohit-twelvelabs wants to merge 1 commit into
Conversation
…able from the CLI via --backend pegasus, as a video-native alternative to the local Qwen-VL/Gemma 4 frame-sampling pipeline. VisionCaptioner normally captions a video by extracting a handful of stills (extract_video_frames, default 8) and feeding them to a local VLM, which discards most of the motion and temporal ordering in a clip; Pegasus instead ingests the whole video server-side and reasons over the actual action and sequence of events, which is often a better fit when captioning video training data (LoRA/fine-tune datasets) where motion matters. The integration is implemented as a new pegasus_backend.py PegasusEngine that duck-types the subset of the QwenEngine interface the CLI relies on (find_files / load_model / unload_model / generate_batch plus the is_gguf and model sentinels), so it slots into the existing caption loop without touching any local-model code path. load_model() here just constructs an authenticated twelvelabs.TwelveLabs client (reading TWELVELABS_API_KEY from the environment) and returns the same (success, message) tuple the local engine does; generate_batch() uploads each local video as a direct TwelveLabs asset, waits for it to reach the "ready" state, then calls client.analyze(model_name=..., video=VideoContext_AssetId(...), prompt=..., max_tokens=...) and applies the existing trigger-word prepend convention. Images are returned as a per-item "Error: ..." note (Pegasus is video-only — route images through the local backend) using the same error-string contract the CLI already checks before writing a .txt, so mixed folders degrade cleanly. The backend is fully opt-in and non-breaking: the default remains --backend local, the twelvelabs SDK is imported lazily inside load_model/generate_batch so users who never select Pegasus need no extra dependency, and the dependency is listed only as a commented optional line in requirements.txt. cli.py gained --backend {local,pegasus} and --tl-model arguments plus a dedicated caption-mode branch for the cloud engine (one video per analyze call rather than the local batch, since Pegasus bills per request and processes whole files); --max-tokens is clamped up to the model's server-side minimum of 512 to avoid a 400. Documented the new arguments and an end-to-end example in commandline_interface.md and added a feature bullet to README.md, both pointing to the free key at https://twelvelabs.io. Added tests/test_pegasus_backend.py: eleven no-network unit tests that mock the SDK (file discovery, mask/skip handling, missing-key and client-construction paths, the max_tokens clamp, trigger-word prepend, image rejection, and per-item error capture) plus one live connectivity test gated on TWELVELABS_API_KEY (skipped when unset) that asserts a Marengo text embedding returns 512 dimensions. Full suite passes (176 passed, 1 skipped).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).
This adds an optional cloud captioning backend powered by TwelveLabs Pegasus, a video-native understanding model, alongside the existing local Qwen-VL / Gemma 4 engines.
What it adds
pegasus_backend.pywith aPegasusEnginethat duck-types the subset ofQwenEnginethe CLI uses (find_files/load_model/unload_model/generate_batch+ theis_gguf/modelsentinels), so it drops into the existing caption loop with no changes to local-model code.--backend {local,pegasus}(defaultlocal) and--tl-model(defaultpegasus1.5).analyze(...)and writes the caption with the same trigger-word /.txtconventions the local engine uses.Why it helps this project
VisionCaptioner currently captions video by sampling ~8 frames and feeding them to a local VLM, which drops most of the motion and temporal ordering in a clip. Pegasus ingests the whole video server-side and reasons over the actual action and sequence of events — often a better fit for captioning video training datasets where motion matters. It also needs no GPU/torch, so it's usable on machines that can't run the larger local models.
Opt-in and non-breaking
--backend local; nothing changes unless you ask for Pegasus.twelvelabsSDK is imported lazily, so users who never select it need no new dependency. It's listed only as a commented optional line inrequirements.txt.How it was tested
tests/test_pegasus_backend.py: 11 no-network unit tests mocking the SDK (file discovery, skip/mask handling, missing-key + client construction,max_tokensclamp to the model's 512 minimum, trigger-word prepend, image rejection, per-item error capture).TWELVELABS_API_KEY(skipped without it) asserting a 512-dim Marengo embedding.analyzerequest wiring against the live API (requests reach the model and validate server-side); a full end-to-end caption on a long video can be slow, so the unit tests cover the wiring deterministically.Usage:
You can grab a free API key at https://twelvelabs.io — there's a generous free tier.