# VideoWorld **Repository Path**: ad_b/VideoWorld ## Basic Information - **Project Name**: VideoWorld - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2025-02-11 - **Last Updated**: 2025-02-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # VideoWorld: Exploring Knowledge Learning from Unlabeled Videos > #### Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, and Xiaojie Jin > Correspondence, Project Lead > Beijing Jiaotong University, University of Science and Technology of China, ByteDance Seed
**Paper** | **Project Page** | [**Installation**](#Install) | [**Training**](#training) | [**Inference**](#inference) | **Video-GoBench**
image ## :fire: News * **[2025.1]** We release the code and dataset. # Highlight πŸ‘‰ We explore, for the first time, whether video generation models can learn knowledge and observe two key findings: i) merely observing videos suffices to learn complex tasks, and ii) compact representations of visual changes greatly enhance knowledge learning. πŸ‘‰ We propose VideoWorld, leveraging a latent dynamics model to represent multi-step visual changes, boosting both efficiency and effectiveness of knowledge acquisition. πŸ‘‰ We construct Video-GoBench, a large-scale video-based Go dataset for training and evaluation, facilitating future research on knowledge learning from pure videos. # Introduction This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop \emph{VideoWorld}, an autoregressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual changes is crucial for knowledge learning. To improve both the efficiency and efficacy of knowledge learning, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models to be open-sourced for further research. # Video [![IMAGE ALT TEXT](./figs/ytb_new.png)](https://www.youtube.com/watch?v=y_TT4dtIPXA "VideoWorld Demo") # Architecture image Overview of the proposed VideoWorld model architecture. (Left) Overall architecture. (Right) The proposed latent dynamics model (LDM). First, LDM compresses the visual changes from each frame to its subsequent H frames into compact and informative latent codes. Then, an auto-regressive transformer seamlessly integrates the output of LDM with the next token prediction paradigm. # Installation ### Setup Environment ``` conda create -n videoworld python=3.10 -y conda activate videoworld pip install --upgrade pip pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121 ``` ### Install VideoWorld ``` git clone https://github.com/bytedance/VideoWorld.git cd VideoWorld bash install.sh ``` # Inference ### Go Battle VideoWorld relies on the Katago Go engine. We provide scripts to facilitate battles against our model; install Katago to engage in these matches. ``` cd VideoWorld # This VideoWorld is located in a subdirectory. bash install_katago.sh ``` or follow the official installation instructions: https://github.com/lightvector/KataGo We provide a version of the weights for playing against humans in https://huggingface.co/maverickrzw/VideoWorld-GoBattle. Use the script to start a match: ``` # Please place the weight in the path: ./VideoWorld/work_dirs/go_battle.pth cd VideoWorld # This VideoWorld is located in a subdirectory. bash ./tools/battle_vs_human.sh ``` ### Robotics Download CALVIN dataset follow the official instructions and organize it as follows: ``` β”œβ”€β”€ VideoWorld β”‚ β”œβ”€β”€ VideoWorld β”‚ β”‚ └── data β”‚ └── └── calvin ``` Testing requires the CALVIN environment configuration. We have automated the installation of CALVIN in the install.sh script. If any issues arise, please refer to the official installation instructions: https://github.com/mees/calvin ``` cd VideoWorld # This VideoWorld is located in a subdirectory. # Since we only tested the tasks of opening drawers, pushing # blocks, and switching lights, the original CALVIN test task # construction script needs to be replaced. rm -r ../calvin/calvin_models/calvin_agent/evaluation/multistep_sequences.py cp ./tools/calvin_utils/multistep_sequences.py ../calvin/calvin_models/calvin_agent/evaluation/ bash ./tools/calvin_test.sh ``` # Training Our training consists of two stages: LDM training and autoregressive transformer training. We use the CALVIN robotic environment as an example to demonstrate how to initiate the training. ### Stage 1: LDM Training Download CALVIN dataset follow the official instructions and organize it as follows: ``` β”œβ”€β”€ VideoWorld β”‚ β”œβ”€β”€ LDM β”‚ β”‚ │── data β”‚ β”‚ β”‚ └── calvin β”‚ β”‚ └── work_dirs β”‚ β”‚ └── magvit_init.pth β”‚ │── VideoWorld β”‚ β”‚ └── work_dirs β”‚ β”‚ │── calvin_fsq.pth β”‚ └── └── Intern_300m ``` To initiate LDM training, use the script ./LDM/tools/calvin_ldm_train.sh. The training requires the [Magvit weights](https://huggingface.co/maverickrzw/VideoWorld-GoBattle/tree/main) we pre-trained on natural image reconstruction as initializationβ€”please place them at the specified path. Once training completes, the latent codes for the training set will be automatically saved to ./LDM/work_dirs/calvin_ldm_results.pth, and a UMAP visualization of these codes will also be generated. ``` cd LDM bash ./tools/calvin_ldm_train.sh ``` ### Stage 2: Next Token Prediction We incorporate the LDM-generated latent codes into the configuration, then initiate the autoregressive transformer training. Please store the dataset in the same path used for inference, and place the [tokenizer weights](https://huggingface.co/maverickrzw/VideoWorld-GoBattle/tree/main) in the location specified above. ``` cd VideoWorld bash ./tools/calvin_train.sh ``` # Citation If you find this project useful in your research, please consider citing: ``` @misc{ren2025videoworldexploringknowledgelearning, title={VideoWorld: Exploring Knowledge Learning from Unlabeled Videos}, author={Zhongwei Ren and Yunchao Wei and Xun Guo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin}, year={2025}, eprint={2501.09781}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.09781}, } ```