From Pixels to Semantics: A Novel MLLM-Driven Approach for Explainable Tampered Text Detection

💡 Introduction

We propose TVSIP, a novel explainable framework that combines low-level visual artifact detection with high-level semantic analysis for tampered text verification. It aims to leverage MLLMs to enhance the pixel-level localization ability of expert models while providing detailed and reliable tampering analysis, including image description, tampered text detection, localization, and explanation.
We present TextDDLE, a meticulously curated benchmark that facilitates both training and evaluation of tampered text analysis capabilities. Created through a systematic pipeline utilizing GPT-4o with expert verification, TextDDLE supports the four fundamental tasks of tampering analysis.
Extensive experiments demonstrate that semantic clues notably improve model performance and robustness in the TTD task. TVSIP offers strong robustness to image degradation and excellent generalization to unseen scenarios.

📥 Download

Dataset

Since TextDDLE-PT is too large, we keep it separate from the other subsets.

Dataset	Link
TextDDLE-PT	BaiduYun:p8q6
TextDDLE wo PT	BaiduYun:5yre

Note:

The TextDDLE dataset is available for non-commercial research purposes only. Scholars or organizations interested in using the dataset may submit an application through our online platform:

🔗 SCUT DLVC Lab Dataset Access Portal → Apply for TextDDLE

We will give you the decompression password after your application has been received and approved.
The original data of the dataset is sourced from public channels such as the Internet, and its copyright shall remain with the original providers. The collated and annotated dataset presented in this case is for non-commercial use only and is currently licensed to universities and research institutions. To apply for the use of this dataset, please fill in the corresponding application form in accordance with the requirements specified on the dataset’s official website. The applicant must be a full-time employee of a university or research institute and is required to sign the application form. For the convenience of review, it is recommended to affix an official seal (a seal of a secondary-level department is acceptable).
All users must follow all use conditions; otherwise, the authorization will be revoked.

Model Zoo

Model	Checkpoint
Locator	BaiduYun:4ake
Pretrained Interpreter	BaiduYun:ibv9
Fine-tuned Interpreter	BaiduYun:avw5

Inference Results of TVSIP

You can download all inference results of TVSIP from BaiduYun:j3jb.

⚒️ Environment

git clone https://github.com/SCUT-DLVCLab/TVSIP.git
cd TVSIP
conda create --name tvsip --file requirements.txt
conda activate tvsip

🔥 Training

Data preparation

Download the TextDDLE dataset into the datasets folder.
Move JSON files in TextDDLE to the data folder.

For Locator:

bash tools/train_locator.sh

For Interpreter:

bash tools/train_interpreter_stage1.sh

You can also skip the pretraining step and fine-tune directly.

bash tools/train_interpreter_stage2.sh

Note: Since visual expert models (i.e., the low-level vision clue branch of Locator in this work) are not the focus of this work, we directly use the results trained by SegFormer. You can download the inference results of the expert models from BaiduYun:j3jb.

🚀 Inference

For the high-level semantic clue branch of Locator:

bash tools/infer_locator.sh

For Interpreter:

bash tools/infer_interpreter.sh

📅 Fusion and Evaluation

For Locator:

bash tools/evaluation_for_locator.sh

Also, you can obtain the final fusion results from the Locator

For Interpreter:

bash tools/evaluation_for_interpreter.sh

📫 Contact

If you have any questions, feel free to contact me at eegtxu@mail.scut.edu.cn.

💙 Acknowledgement

📜 License

The code and dataset should be used and distributed under (CC BY-NC-ND 4.0) for non-commercial research purposes.

⛔️ Copyright

This repository can only be used for non-commercial research purposes.
For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
Copyright 2025, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.

✒️ Citation

If you find this paper helpful, please consider giving this repo a ⭐ and citing:

@inproceedings{xu2025pixels,
  title={From Pixels to Semantics: A Novel MLLM-Driven Approach for Explainable Tampered Text Detection},
  author={Xu, Guitao and Yi, Ziqi and Zhang, Peirong and Cao, Jiahuan and Wu, Shihang and Jin, Lianwen},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={757--766},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
application-form		application-form
data		data
docker		docker
evaluation		evaluation
examples		examples
scripts		scripts
src		src
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Pixels to Semantics: A Novel MLLM-Driven Approach for Explainable Tampered Text Detection

💡 Introduction

📥 Download

⚒️ Environment

🔥 Training

🚀 Inference

📅 Fusion and Evaluation

📫 Contact

💙 Acknowledgement

📜 License

⛔️ Copyright

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From Pixels to Semantics: A Novel MLLM-Driven Approach for Explainable Tampered Text Detection

💡 Introduction

📥 Download

⚒️ Environment

🔥 Training

🚀 Inference

📅 Fusion and Evaluation

📫 Contact

💙 Acknowledgement

📜 License

⛔️ Copyright

✒️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages