Skip to content

Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus)#16

Open
mohit-twelvelabs wants to merge 1 commit into
Brekel:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration
Open

Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus)#16
mohit-twelvelabs wants to merge 1 commit into
Brekel:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

Conversation

@mohit-twelvelabs

Copy link
Copy Markdown

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

This adds an optional cloud captioning backend powered by TwelveLabs Pegasus, a video-native understanding model, alongside the existing local Qwen-VL / Gemma 4 engines.

What it adds

  • New pegasus_backend.py with a PegasusEngine that duck-types the subset of QwenEngine the CLI uses (find_files / load_model / unload_model / generate_batch + the is_gguf/model sentinels), so it drops into the existing caption loop with no changes to local-model code.
  • CLI flags --backend {local,pegasus} (default local) and --tl-model (default pegasus1.5).
  • It uploads each local video as a TwelveLabs asset, waits for it to be ready, then calls analyze(...) and writes the caption with the same trigger-word / .txt conventions the local engine uses.

Why it helps this project

VisionCaptioner currently captions video by sampling ~8 frames and feeding them to a local VLM, which drops most of the motion and temporal ordering in a clip. Pegasus ingests the whole video server-side and reasons over the actual action and sequence of events — often a better fit for captioning video training datasets where motion matters. It also needs no GPU/torch, so it's usable on machines that can't run the larger local models.

Opt-in and non-breaking

  • Default stays --backend local; nothing changes unless you ask for Pegasus.
  • The twelvelabs SDK is imported lazily, so users who never select it need no new dependency. It's listed only as a commented optional line in requirements.txt.
  • Pegasus is video-only; images return a clear per-item error so they're skipped (route images through the local backend). Mixed folders degrade cleanly.

How it was tested

  • tests/test_pegasus_backend.py: 11 no-network unit tests mocking the SDK (file discovery, skip/mask handling, missing-key + client construction, max_tokens clamp to the model's 512 minimum, trigger-word prepend, image rejection, per-item error capture).
  • 1 live connectivity test gated on TWELVELABS_API_KEY (skipped without it) asserting a 512-dim Marengo embedding.
  • Full suite: 176 passed, 1 skipped. I also verified the Pegasus analyze request wiring against the live API (requests reach the model and validate server-side); a full end-to-end caption on a long video can be slow, so the unit tests cover the wiring deterministically.

Usage:

export TWELVELABS_API_KEY="tlk_..."
python cli.py --folder /path/to/videos --backend pegasus --prompt "Describe the action." --max-tokens 512

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

…able from the CLI via --backend pegasus, as a video-native alternative to the local Qwen-VL/Gemma 4 frame-sampling pipeline. VisionCaptioner normally captions a video by extracting a handful of stills (extract_video_frames, default 8) and feeding them to a local VLM, which discards most of the motion and temporal ordering in a clip; Pegasus instead ingests the whole video server-side and reasons over the actual action and sequence of events, which is often a better fit when captioning video training data (LoRA/fine-tune datasets) where motion matters. The integration is implemented as a new pegasus_backend.py PegasusEngine that duck-types the subset of the QwenEngine interface the CLI relies on (find_files / load_model / unload_model / generate_batch plus the is_gguf and model sentinels), so it slots into the existing caption loop without touching any local-model code path. load_model() here just constructs an authenticated twelvelabs.TwelveLabs client (reading TWELVELABS_API_KEY from the environment) and returns the same (success, message) tuple the local engine does; generate_batch() uploads each local video as a direct TwelveLabs asset, waits for it to reach the "ready" state, then calls client.analyze(model_name=..., video=VideoContext_AssetId(...), prompt=..., max_tokens=...) and applies the existing trigger-word prepend convention. Images are returned as a per-item "Error: ..." note (Pegasus is video-only — route images through the local backend) using the same error-string contract the CLI already checks before writing a .txt, so mixed folders degrade cleanly. The backend is fully opt-in and non-breaking: the default remains --backend local, the twelvelabs SDK is imported lazily inside load_model/generate_batch so users who never select Pegasus need no extra dependency, and the dependency is listed only as a commented optional line in requirements.txt. cli.py gained --backend {local,pegasus} and --tl-model arguments plus a dedicated caption-mode branch for the cloud engine (one video per analyze call rather than the local batch, since Pegasus bills per request and processes whole files); --max-tokens is clamped up to the model's server-side minimum of 512 to avoid a 400. Documented the new arguments and an end-to-end example in commandline_interface.md and added a feature bullet to README.md, both pointing to the free key at https://twelvelabs.io. Added tests/test_pegasus_backend.py: eleven no-network unit tests that mock the SDK (file discovery, mask/skip handling, missing-key and client-construction paths, the max_tokens clamp, trigger-word prepend, image rejection, and per-item error capture) plus one live connectivity test gated on TWELVELABS_API_KEY (skipped when unset) that asserts a Marengo text embedding returns 512 dimensions. Full suite passes (176 passed, 1 skipped).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant