Fully Open Framework for Democratized Multimodal Training
🤗 2.0 Models · Datasets · Technical Report
🤗 1.5 Models · Datasets · Technical Report
- 2026-04-30: Released LLaVA-OneVision-2.0 — next-generation multimodal model, with new LLaVA-OneVision-2.0-VideoCaption and LLaVA-OneVision-2.0-Spatial datasets.
- 2026-02-10: Released OneVision-Encoder — codec-aligned vision encoders, with Technical Report.
- 2025-12-11: Released RL recipe for LLaVA-OneVision-1.5, with Project, Code, Data, and Model.
- 2025-09-30: Released the LLaVA-OneVision-1.5 Technical Report.
LLaVA-OneVision-2.0 is the next-generation release of the LLaVA-OneVision family — a fully open 8B multimodal model that unifies image, long-form video, and spatial understanding under a single architecture, with the entire pipeline (data, encoders, training, checkpoints, logs) released end-to-end.
Forget uniform patchification. OneVision-Encoder and OneVision-Encoder-Lang are HEVC-style vision transformers that treat video like a codec stream — selecting only motion- and residual-rich patches and sampling dense frames sparsely instead of sparse frames densely. The result is dramatically longer temporal coverage under the same token budget, where prior ViT backbones simply run out of context.
Most open multimodal models still live in a 2D, single-image world. LLaVA-OneVision-2.0-8B-Instruct breaks out of it — one model, native resolution, no task-specific adapters, no hidden tricks.
- Long video — multi-frame reasoning with efficient codec-aligned inference
- 3D-aware spatial reasoning — depth, layout, object relations
- Documents, OCR, charts — structured visual inputs at native resolution
New open-source SOTA across a broad suite of multimodal benchmarks.
Four datasets ship with the LLaVA-OneVision family — two new for 2.0, two carried forward from 1.5:
- LLaVA-OneVision-2.0-VideoCaption — extremely dense video captions
- LLaVA-OneVision-2.0-Spatial — 3D-aware spatial reasoning
- LLaVA-OneVision-1.5-Mid-Training-85M — 85M concept-balanced mid-training corpus
- LLaVA-OneVision-1.5-Instruct — full instruction-tuning mixture
And unlike most "open" releases, everything ships alongside them: encoder weights, training code, configs, and full training logs. Reproducible end to end.
Standard video pipelines uniformly sample a handful of frames and process every patch — most of it static background. We borrow from HEVC: keep I-frames dense, keep only motion- and residual-rich patches from P-frames. Same 54-token budget, 18 frames instead of 6 — 3× the temporal range, no extra LLM context, no input-type adapters.
Most multimodal stacks ship a different tokenizer per input type — one path for images, another for video, a third for multi-image. We don't. Image, uniform frames, and codec-aligned tokens all flow into the same OneVision-Encoder under a shared (t, h, w) position scheme. No task-specific tokenizers, no per-modality routing.
We train LLaVA-OneVision-2.0 in four compact stages:
- Bootstrap video ability from LLaVA-OneVision-1.5 with short 30s video captions.
- Instruction tune with large multimodal instruction data and 30–180s video captions.
- Extend to long videos with 10–15 min captions and public video instruction data.
- Refine codec, spatial, and tracking skills with denser long-video sampling, point tracking, and 4M spatial samples.
The curriculum mixes LLaVA-OneVision-1.5 data, FineVision, and new in-house video caption/spatial datasets; we do not synthesize any video instruction data.
| Model | HF Link | Training Log |
|---|---|---|
| LLaVA-OneVision-2.0-8B-Instruct | — | — |
| LLaVA-OneVision-2.0-4B-Instruct | — | — |
| LLaVA-OneVision-1.5-4B-Instruct | 🤗 HF / 4B-Instruct | 📈 TensorBoard |
| LLaVA-OneVision-1.5-8B-Instruct | 🤗 HF / 8B-Instruct | 📈 TensorBoard |
| OneVision-Encoder | 🤗 HF / OneVision-Encoder | — |
| OneVision-Encoder-Lang | 🤗 HF / OneVision-Encoder-Lang | — |
| Description | Link | Status |
|---|---|---|
| LLaVA-OneVision-2.0-VideoCaption | 🤗HF / VideoCaption | Available |
| LLaVA-OneVision-2.0-Spatial | 🤗HF / Spatial | Available |
| LLaVA-OneVision-1.5-Mid-Training-85M | 🤗HF / Mid-Training 85M | Available |
| LLaVA-OneVision-1.5-Instruct | 🤗HF / Instruct-Data | Available |
Thanks so much to all of our amazing contributors!
If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:
@inproceedings{LLaVA-OneVision-2.0,
title={LLaVA-OneVision-2.0},
author={llava-onevision contributors},
booktitle={arXiv},
year={2026}
}
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
booktitle={arXiv},
year={2025}
}
@article{tang2026onevisionencoder,
title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
journal={arXiv preprint arXiv:2602.08683},
year={2026}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research}
year={2024}
}
We extend our sincere gratitude to AIAK team of the Baige AI computing platform from Baidu AI Cloud for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. To get full AIAK support, you can contact Baidu Cloud.
We acknowledge the support of Synvo AI for contributing to the partial data annotation in this work, and also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:
- LLaVA: Large Language-and-Vision Assistant — LLaVA
- LLaVA-NeXT: Next-generation multi-modal assistant — LLaVA-NeXT
- lmms-eval: A standardized evaluation framework for Large Multimodal Models — lmms-eval
- Megatron-LM: Efficient, scalable training for large language models — Megatron-LM
- Qwen2.5-VL: Strong vision-language foundation model — Qwen2.5-VL
- InternVL: Open-source large-scale vision-language foundation model — InternVL
- Qwen3: Next-generation Qwen LLM — Qwen
- MetaCLIP: Scalable contrastive pretraining — MetaCLIP
- FineVision: Open Data Is All You Need — FineVision