GitHub - EvolvingLMMs-Lab/LLaVA-OneVision-2: Fully Open Framework for Democratized Multimodal Training

LLaVA-OneVision-2.0

Fully Open Framework for Democratized Multimodal Training

🤗 2.0 Models · Datasets · Technical Report

🤗 1.5 Models · Datasets · Technical Report

NEWS

2026-04-30: Released LLaVA-OneVision-2.0 — next-generation multimodal model, with new LLaVA-OneVision-2.0-VideoCaption and LLaVA-OneVision-2.0-Spatial datasets.
2026-02-10: Released OneVision-Encoder — codec-aligned vision encoders, with Technical Report.
2025-12-11: Released RL recipe for LLaVA-OneVision-1.5, with Project, Code, Data, and Model.
2025-09-30: Released the LLaVA-OneVision-1.5 Technical Report.

Introduction

LLaVA-OneVision-2.0 is the next-generation release of the LLaVA-OneVision family — a fully open 8B multimodal model that unifies image, long-form video, and spatial understanding under a single architecture, with the entire pipeline (data, encoders, training, checkpoints, logs) released end-to-end.

🎬 Codec-Aligned Vision Encoders

Forget uniform patchification. OneVision-Encoder and OneVision-Encoder-Lang are HEVC-style vision transformers that treat video like a codec stream — selecting only motion- and residual-rich patches and sampling dense frames sparsely instead of sparse frames densely. The result is dramatically longer temporal coverage under the same token budget, where prior ViT backbones simply run out of context.

🧊 One Model, Every Modality

Most open multimodal models still live in a 2D, single-image world. LLaVA-OneVision-2.0-8B-Instruct breaks out of it — one model, native resolution, no task-specific adapters, no hidden tricks.

Long video — multi-frame reasoning with efficient codec-aligned inference
3D-aware spatial reasoning — depth, layout, object relations
Documents, OCR, charts — structured visual inputs at native resolution

New open-source SOTA across a broad suite of multimodal benchmarks.

🚀 Fully Open, Reproducible from Day One

Four datasets ship with the LLaVA-OneVision family — two new for 2.0, two carried forward from 1.5:

LLaVA-OneVision-2.0-VideoCaption — extremely dense video captions
LLaVA-OneVision-2.0-Spatial — 3D-aware spatial reasoning
LLaVA-OneVision-1.5-Mid-Training-85M — 85M concept-balanced mid-training corpus
LLaVA-OneVision-1.5-Instruct — full instruction-tuning mixture

And unlike most "open" releases, everything ships alongside them: encoder weights, training code, configs, and full training logs. Reproducible end to end.

Evaluation Results

LLaVA-OneVision-2.0 Benchmark Comparison

Figure 4: codec-aligned sampling compared with uniform frame sampling across video benchmarks

Method

Codec-Style Patch Selection

Codec-Style Patch Selection: same 54-token budget, 3× more temporal range than uniform sampling

Standard video pipelines uniformly sample a handful of frames and process every patch — most of it static background. We borrow from HEVC: keep I-frames dense, keep only motion- and residual-rich patches from P-frames. Same 54-token budget, 18 frames instead of 6 — 3× the temporal range, no extra LLM context, no input-type adapters.

One Encoder, Every Modality

Multi-modal vision input: image, uniform frames, or codec-aligned tokens all feed the same OneVision-Encoder with shared (t, h, w) positions

Most multimodal stacks ship a different tokenizer per input type — one path for images, another for video, a third for multi-image. We don't. Image, uniform frames, and codec-aligned tokens all flow into the same OneVision-Encoder under a shared (t, h, w) position scheme. No task-specific tokenizers, no per-modality routing.

Four-Stage Training Curriculum

We train LLaVA-OneVision-2.0 in four compact stages:

Bootstrap video ability from LLaVA-OneVision-1.5 with short 30s video captions.
Instruction tune with large multimodal instruction data and 30–180s video captions.
Extend to long videos with 10–15 min captions and public video instruction data.
Refine codec, spatial, and tracking skills with denser long-video sampling, point tracking, and 4M spatial samples.

The curriculum mixes LLaVA-OneVision-1.5 data, FineVision, and new in-house video caption/spatial datasets; we do not synthesize any video instruction data.

Models

Model	HF Link	Training Log
LLaVA-OneVision-2.0-8B-Instruct	—	—
LLaVA-OneVision-2.0-4B-Instruct	—	—
LLaVA-OneVision-1.5-4B-Instruct	🤗 HF / 4B-Instruct	📈 TensorBoard
LLaVA-OneVision-1.5-8B-Instruct	🤗 HF / 8B-Instruct	📈 TensorBoard
OneVision-Encoder	🤗 HF / OneVision-Encoder	—
OneVision-Encoder-Lang	🤗 HF / OneVision-Encoder-Lang	—

Datasets

Description	Link	Status
LLaVA-OneVision-2.0-VideoCaption	🤗HF / VideoCaption	Available
LLaVA-OneVision-2.0-Spatial	🤗HF / Spatial	Available
LLaVA-OneVision-1.5-Mid-Training-85M	🤗HF / Mid-Training 85M	Available
LLaVA-OneVision-1.5-Instruct	🤗HF / Instruct-Data	Available

LLaVA-OneVision Data Distribution

Contributors

Thanks so much to all of our amazing contributors!

_anxiangsir	_yiyexy	_fdcp	_wideyard	_Lornatang	_{chengzheng345}	_Luodian	_kcz358
_{killTheHostage}	_mathCrazyy	_wkzhang636	_yunglechao	_RobitYadda	_fengshikun	_{GeoffreyChen777}	_didizhu-judy
_yshenaw	_Yangsenqiao	_YunyaoYan	_{FeilongTangmonash}	_Jinghao-Guo

Citation

If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:

@inproceedings{LLaVA-OneVision-2.0,
  title={LLaVA-OneVision-2.0},
  author={llava-onevision contributors},
  booktitle={arXiv},
  year={2026}
}

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
 }

@article{tang2026onevisionencoder,
  title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
  author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
  journal={arXiv preprint arXiv:2602.08683},
  year={2026}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research}
  year={2024}
}

Acknowledgement

We extend our sincere gratitude to AIAK team of the Baige AI computing platform from Baidu AI Cloud for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. To get full AIAK support, you can contact Baidu Cloud.

We acknowledge the support of Synvo AI for contributing to the partial data annotation in this work, and also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:

LLaVA: Large Language-and-Vision Assistant — LLaVA
LLaVA-NeXT: Next-generation multi-modal assistant — LLaVA-NeXT
lmms-eval: A standardized evaluation framework for Large Multimodal Models — lmms-eval
Megatron-LM: Efficient, scalable training for large language models — Megatron-LM
Qwen2.5-VL: Strong vision-language foundation model — Qwen2.5-VL
InternVL: Open-source large-scale vision-language foundation model — InternVL
Qwen3: Next-generation Qwen LLM — Qwen
MetaCLIP: Scalable contrastive pretraining — MetaCLIP
FineVision: Open Data Is All You Need — FineVision

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.github/workflows		.github/workflows
.opencode/skills		.opencode/skills
aiak_megatron		aiak_megatron
aiak_training_llm		aiak_training_llm
asset		asset
configs		configs
docs		docs
examples		examples
offline_packing		offline_packing
tests		tests
tools		tools
transformers_impl		transformers_impl
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
dockerfile		dockerfile
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEWS

Contents

Introduction

🎬 Codec-Aligned Vision Encoders

🧊 One Model, Every Modality

🚀 Fully Open, Reproducible from Day One

Evaluation Results

Method

Codec-Style Patch Selection

One Encoder, Every Modality

Four-Stage Training Curriculum

Models

Datasets

Contributors

Citation

Acknowledgement

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NEWS

Contents

Introduction

🎬 Codec-Aligned Vision Encoders

🧊 One Model, Every Modality

🚀 Fully Open, Reproducible from Day One

Evaluation Results

Method

Codec-Style Patch Selection

One Encoder, Every Modality

Four-Stage Training Curriculum

Models

Datasets

Contributors

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages