Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus) by mohit-twelvelabs · Pull Request #16 · Brekel/VisionCaptioner

mohit-twelvelabs · 2026-06-25T20:39:40Z

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

This adds an optional cloud captioning backend powered by TwelveLabs Pegasus, a video-native understanding model, alongside the existing local Qwen-VL / Gemma 4 engines.

What it adds

New pegasus_backend.py with a PegasusEngine that duck-types the subset of QwenEngine the CLI uses (find_files / load_model / unload_model / generate_batch + the is_gguf/model sentinels), so it drops into the existing caption loop with no changes to local-model code.
CLI flags --backend {local,pegasus} (default local) and --tl-model (default pegasus1.5).
It uploads each local video as a TwelveLabs asset, waits for it to be ready, then calls analyze(...) and writes the caption with the same trigger-word / .txt conventions the local engine uses.

Why it helps this project

VisionCaptioner currently captions video by sampling ~8 frames and feeding them to a local VLM, which drops most of the motion and temporal ordering in a clip. Pegasus ingests the whole video server-side and reasons over the actual action and sequence of events — often a better fit for captioning video training datasets where motion matters. It also needs no GPU/torch, so it's usable on machines that can't run the larger local models.

Opt-in and non-breaking

Default stays --backend local; nothing changes unless you ask for Pegasus.
The twelvelabs SDK is imported lazily, so users who never select it need no new dependency. It's listed only as a commented optional line in requirements.txt.
Pegasus is video-only; images return a clear per-item error so they're skipped (route images through the local backend). Mixed folders degrade cleanly.

How it was tested

tests/test_pegasus_backend.py: 11 no-network unit tests mocking the SDK (file discovery, skip/mask handling, missing-key + client construction, max_tokens clamp to the model's 512 minimum, trigger-word prepend, image rejection, per-item error capture).
1 live connectivity test gated on TWELVELABS_API_KEY (skipped without it) asserting a 512-dim Marengo embedding.
Full suite: 176 passed, 1 skipped. I also verified the Pegasus analyze request wiring against the live API (requests reach the model and validate server-side); a full end-to-end caption on a long video can be slow, so the unit tests cover the wiring deterministically.

Usage:

export TWELVELABS_API_KEY="tlk_..."
python cli.py --folder /path/to/videos --backend pegasus --prompt "Describe the action." --max-tokens 512

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

…able from the CLI via --backend pegasus, as a video-native alternative to the local Qwen-VL/Gemma 4 frame-sampling pipeline. VisionCaptioner normally captions a video by extracting a handful of stills (extract_video_frames, default 8) and feeding them to a local VLM, which discards most of the motion and temporal ordering in a clip; Pegasus instead ingests the whole video server-side and reasons over the actual action and sequence of events, which is often a better fit when captioning video training data (LoRA/fine-tune datasets) where motion matters. The integration is implemented as a new pegasus_backend.py PegasusEngine that duck-types the subset of the QwenEngine interface the CLI relies on (find_files / load_model / unload_model / generate_batch plus the is_gguf and model sentinels), so it slots into the existing caption loop without touching any local-model code path. load_model() here just constructs an authenticated twelvelabs.TwelveLabs client (reading TWELVELABS_API_KEY from the environment) and returns the same (success, message) tuple the local engine does; generate_batch() uploads each local video as a direct TwelveLabs asset, waits for it to reach the "ready" state, then calls client.analyze(model_name=..., video=VideoContext_AssetId(...), prompt=..., max_tokens=...) and applies the existing trigger-word prepend convention. Images are returned as a per-item "Error: ..." note (Pegasus is video-only — route images through the local backend) using the same error-string contract the CLI already checks before writing a .txt, so mixed folders degrade cleanly. The backend is fully opt-in and non-breaking: the default remains --backend local, the twelvelabs SDK is imported lazily inside load_model/generate_batch so users who never select Pegasus need no extra dependency, and the dependency is listed only as a commented optional line in requirements.txt. cli.py gained --backend {local,pegasus} and --tl-model arguments plus a dedicated caption-mode branch for the cloud engine (one video per analyze call rather than the local batch, since Pegasus bills per request and processes whole files); --max-tokens is clamped up to the model's server-side minimum of 512 to avoid a 400. Documented the new arguments and an end-to-end example in commandline_interface.md and added a feature bullet to README.md, both pointing to the free key at https://twelvelabs.io. Added tests/test_pegasus_backend.py: eleven no-network unit tests that mock the SDK (file discovery, mask/skip handling, missing-key and client-construction paths, the max_tokens clamp, trigger-word prepend, image rejection, and per-item error capture) plus one live connectivity test gated on TWELVELABS_API_KEY (skipped when unset) that asserts a Marengo text embedding returns 512 dimensions. Full suite passes (176 passed, 1 skipped).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus)#16

Add optional TwelveLabs Pegasus cloud captioning backend (--backend pegasus)#16
mohit-twelvelabs wants to merge 1 commit into
Brekel:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

mohit-twelvelabs commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohit-twelvelabs commented Jun 25, 2026

What it adds

Why it helps this project

Opt-in and non-breaking

How it was tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant