Skip to content

EvolvingLMMs-Lab/LLaVA-OneVision-2

Repository files navigation

LLaVA-OneVision-2.0

Fully Open Framework for Democratized Multimodal Training

🤗 2.0 Models · Datasets · Technical Report

🤗 1.5 Models · Datasets · Technical Report


NEWS

Contents

Introduction

LLaVA-OneVision-2.0 is the next-generation release of the LLaVA-OneVision family — a fully open 8B multimodal model that unifies image, long-form video, and spatial understanding under a single architecture, with the entire pipeline (data, encoders, training, checkpoints, logs) released end-to-end.

🎬 Codec-Aligned Vision Encoders

Forget uniform patchification. OneVision-Encoder and OneVision-Encoder-Lang are HEVC-style vision transformers that treat video like a codec stream — selecting only motion- and residual-rich patches and sampling dense frames sparsely instead of sparse frames densely. The result is dramatically longer temporal coverage under the same token budget, where prior ViT backbones simply run out of context.

🧊 One Model, Every Modality

Most open multimodal models still live in a 2D, single-image world. LLaVA-OneVision-2.0-8B-Instruct breaks out of it — one model, native resolution, no task-specific adapters, no hidden tricks.

  • Long video — multi-frame reasoning with efficient codec-aligned inference
  • 3D-aware spatial reasoning — depth, layout, object relations
  • Documents, OCR, charts — structured visual inputs at native resolution

New open-source SOTA across a broad suite of multimodal benchmarks.

🚀 Fully Open, Reproducible from Day One

Four datasets ship with the LLaVA-OneVision family — two new for 2.0, two carried forward from 1.5:

  • LLaVA-OneVision-2.0-VideoCaption — extremely dense video captions
  • LLaVA-OneVision-2.0-Spatial — 3D-aware spatial reasoning
  • LLaVA-OneVision-1.5-Mid-Training-85M — 85M concept-balanced mid-training corpus
  • LLaVA-OneVision-1.5-Instruct — full instruction-tuning mixture

And unlike most "open" releases, everything ships alongside them: encoder weights, training code, configs, and full training logs. Reproducible end to end.

Evaluation Results

LLaVA-OneVision-2.0 Benchmark Comparison

Figure 4: codec-aligned sampling compared with uniform frame sampling across video benchmarks

Method

Codec-Style Patch Selection

Codec-Style Patch Selection: same 54-token budget, 3× more temporal range than uniform sampling

Standard video pipelines uniformly sample a handful of frames and process every patch — most of it static background. We borrow from HEVC: keep I-frames dense, keep only motion- and residual-rich patches from P-frames. Same 54-token budget, 18 frames instead of 6 — 3× the temporal range, no extra LLM context, no input-type adapters.

One Encoder, Every Modality

Multi-modal vision input: image, uniform frames, or codec-aligned tokens all feed the same OneVision-Encoder with shared (t, h, w) positions

Most multimodal stacks ship a different tokenizer per input type — one path for images, another for video, a third for multi-image. We don't. Image, uniform frames, and codec-aligned tokens all flow into the same OneVision-Encoder under a shared (t, h, w) position scheme. No task-specific tokenizers, no per-modality routing.

Four-Stage Training Curriculum

We train LLaVA-OneVision-2.0 in four compact stages:

  1. Bootstrap video ability from LLaVA-OneVision-1.5 with short 30s video captions.
  2. Instruction tune with large multimodal instruction data and 30–180s video captions.
  3. Extend to long videos with 10–15 min captions and public video instruction data.
  4. Refine codec, spatial, and tracking skills with denser long-video sampling, point tracking, and 4M spatial samples.

The curriculum mixes LLaVA-OneVision-1.5 data, FineVision, and new in-house video caption/spatial datasets; we do not synthesize any video instruction data.

Models

Model HF Link Training Log
LLaVA-OneVision-2.0-8B-Instruct
LLaVA-OneVision-2.0-4B-Instruct
LLaVA-OneVision-1.5-4B-Instruct 🤗 HF / 4B-Instruct 📈 TensorBoard
LLaVA-OneVision-1.5-8B-Instruct 🤗 HF / 8B-Instruct 📈 TensorBoard
OneVision-Encoder 🤗 HF / OneVision-Encoder
OneVision-Encoder-Lang 🤗 HF / OneVision-Encoder-Lang

Datasets

Description Link Status
LLaVA-OneVision-2.0-VideoCaption 🤗HF / VideoCaption Available
LLaVA-OneVision-2.0-Spatial 🤗HF / Spatial Available
LLaVA-OneVision-1.5-Mid-Training-85M 🤗HF / Mid-Training 85M Available
LLaVA-OneVision-1.5-Instruct 🤗HF / Instruct-Data Available

LLaVA-OneVision Data Distribution

Contributors

Thanks so much to all of our amazing contributors!

anxiangsir
anxiangsir
yiyexy
yiyexy
fdcp
fdcp
wideyard
wideyard
Lornatang
Lornatang
chengzheng345
chengzheng345
Luodian
Luodian
kcz358
kcz358
killTheHostage
killTheHostage
mathCrazyy
mathCrazyy
wkzhang636
wkzhang636
yunglechao
yunglechao
RobitYadda
RobitYadda
fengshikun
fengshikun
GeoffreyChen777
GeoffreyChen777
didizhu-judy
didizhu-judy
yshenaw
yshenaw
Yangsenqiao
Yangsenqiao
YunyaoYan
YunyaoYan
FeilongTangmonash
FeilongTangmonash
Jinghao-Guo
Jinghao-Guo

Citation

If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:

@inproceedings{LLaVA-OneVision-2.0,
  title={LLaVA-OneVision-2.0},
  author={llava-onevision contributors},
  booktitle={arXiv},
  year={2026}
}

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
 }

@article{tang2026onevisionencoder,
  title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
  author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
  journal={arXiv preprint arXiv:2602.08683},
  year={2026}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research}
  year={2024}
}

Acknowledgement

We extend our sincere gratitude to AIAK team of the Baige AI computing platform from Baidu AI Cloud for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. To get full AIAK support, you can contact Baidu Cloud.

We acknowledge the support of Synvo AI for contributing to the partial data annotation in this work, and also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:

  • LLaVA: Large Language-and-Vision Assistant — LLaVA
  • LLaVA-NeXT: Next-generation multi-modal assistant — LLaVA-NeXT
  • lmms-eval: A standardized evaluation framework for Large Multimodal Models — lmms-eval
  • Megatron-LM: Efficient, scalable training for large language models — Megatron-LM
  • Qwen2.5-VL: Strong vision-language foundation model — Qwen2.5-VL
  • InternVL: Open-source large-scale vision-language foundation model — InternVL
  • Qwen3: Next-generation Qwen LLM — Qwen
  • MetaCLIP: Scalable contrastive pretraining — MetaCLIP
  • FineVision: Open Data Is All You Need — FineVision