This repo contains the code for VGGT-SLAM 2.0 (located here) and VGGT-SLAM (located on the version1.0 branch of this repo).
Clone VGGT-SLAM:
git clone https://github.com/MIT-SPARK/VGGT-SLAM
cd VGGT-SLAM
conda create -n vggt-slam python=3.11
conda activate vggt-slam
This step will automatically download all 3rd party packages including Perception Encoder, SAM 3, and our fork of VGGT. More details on the license for Perception Encoder can be found here, for SAM3 can be found here, and for VGGT can be found here. Note that we only use SAM 3 and Perception Encoder for optional open-set 3D object detection.
chmod +x setup.sh
./setup.sh
run python main.py --image_folder /path/to/image/folder --max_loops 1 --vis_map replacing the image path with your folder of images.
This will create a visualization in viser which shows the incremental construction of the map.
As an example, we provide a folder of test images in office_loop.zip which will generate the following map. Using the default parameters will
result in a single loop closure towards the end of the trajectory. Unzip the folder and set its path as the arguments for --image_folder, e.g.,
unzip office_loop.zip
and then run the below command:
python3 main.py --image_folder office_loop --max_loops 1 --vis_map
Use the --run_os flag to enable 3D open-set object detection. This will prompt the user for text queries and plot a 3D bounding box of the detection on the map in viser. The office loop scene does not have very many interesting objects, but some example queries that can be used are "coffee machine", "sink", "printer", "cone", and "refrigerator." For some example scenes with more interesting objects, check out the Clio apartment and cubicle scene which can be downloaded
from here.
To quickly collect a test on a custom dataset, you can record a trajectory with a cell phone and convert the MOV file to a folder of images with:
mkdir <desired_location>/img_folder
And then, run the command below:
ffmpeg -i /path/to/video.MOV -vf "fps=10" <desired_location>/img_folder/frame_%04d.jpg
Note while vertical cell phone videos can work, to avoid images being cropped it is recommended to use horizontal videos.
See main.py or run --help from main.py to view all parameters.
For visualizing larger datasets, displaying all 3D points in Viser can either slow or crash
the visualizer. One way to mitigate this is to sparsify the point cloud that is sent to Viser which can be done with --vis_voxel_size 0.005. Increasing the number will decrease the number of displayed points. Note that this does not affect the number of points stored or used internally in VGGT-SLAM.
To automatically run evaluation on TUM and 7-Scenes datasets, first install the datasets using the provided download instructions from MASt3R-SLAM. Set the download location of MASt3R-SLAM by setting abs_dir in the bash scripts /evals/eval_tum.sh and /evals/eval_7scenes.sh
To run on TUM, run ./evals/eval_tum.sh <w> and then run python evals/process_logs_tum.py --submap_size <w> to analyze and print the results, where w is
the submap size, for example:
./evals/eval_tum.sh 32
python evals/process_logs_tum.py --submap_size 32
To visualize the maps as they being constructed, inside the bash scripts add --vis_map. This will update the viser map each time the submap is updated.
- May 2025: VGGT-SLAM 1.0 is released
- August 2025: SL(4) optimization is integrated into the official GTSAM repo
- September 2025: VGGT-SLAM 1.0 Accepted to Neurips 2025
- November 2025: VGGT-SLAM 1.0 Featured in MIT News article
- January 2026: VGGT-SLAM 2.0 is released
- Release real-time code. This code enables plugging in a Real Sense Camera and incrementally constructing a map as the camera explored a scene. This has been tested on a Jetson Thor onboard a robot.
- Add optional code to sparsify the visualized map as visualizing large point cloud maps can slow down the code.
This work was supported in part by the NSF Graduate Research Fellowship Program under Grant 2141064, the ARL DCIST program, and the ONR RAPID program.
If our code is helpful, please cite our papers as follows:
@article{maggio2025vggt-slam,
title={VGGT-SLAM: Dense RGB SLAM Optimized on the SL (4) Manifold},
author={Maggio, Dominic and Lim, Hyungtae and Carlone, Luca},
journal={Advances in Neural Information Processing Systems},
volume={39},
year={2025}
}
@article{maggio2025vggt-slam2,
title={VGGT-SLAM 2.0: Real-time Dense Feed-forward Scene Reconstruction},
author={Maggio, Dominic and Carlone, Luca},
journal={arXiv preprint arXiv:2601.19887},
year={2026}
}

