# VideoRoPE **Repository Path**: wcyl/VideoRoPE ## Basic Information - **Project Name**: VideoRoPE - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-18 - **Last Updated**: 2025-12-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
[ICML2025 (Oral)]

🚀🚀🚀 Official implementation of **VideoRoPE: What Makes for Good Video Rotary Position Embedding?**

- **Authors**: [Xilin Wei*](https://github.com/Wiselnn570), [Xiaoran Liu*](https://scholar.google.de/citations?user=Qe6F4J4AAAAJ&hl=en), [Yuhang Zang](https://yuhangzang.github.io), [Xiaoyi Dong](https://lightdxy.github.io), [Pan Zhang](https://panzhang0212.github.io/), [Yuhang Cao](https://scholar.google.com/citations?user=sJkqsqkAAAAJ&hl=en), [Jian Tong](), [Haodong Duan](https://kennymckormick.github.io/), [Qipeng Guo](https://scholar.google.com/citations?user=k3mPGKgAAAAJ&hl=en), [Jiaqi Wang](https://myownskyw7.github.io/), [Xipeng Qiu](https://xpqiu.github.io/en.html), [Dahua Lin](http://dahua.site/) - **Institutes**: Fudan University; Shanghai AI Laboratory; Shanghai Innovation Institute - **Resources**: [📖[Paper](https://arxiv.org/pdf/2502.05173)] [[🏠Project Page](https://wiselnn570.github.io/VideoRoPE/)] [[🤗Huggingface](https://huggingface.co/collections/Wiselnn/videorope-what-makes-for-good-video-rotary-position-embeddi-67ca90664c8e169422449c56)] ## 💡 Highlights - 🔥 **Four Key Positional Encoding Schemes:** We present an analysis of four key properties essential for RoPE when applied to video. Motivated by this analysis, we propose **VideoRoPE** including **Low-frequency Temporal Allocation (LTA)**, **Diagonal Layout (DL)**, and **Adjustable Temporal Spacing (ATS)** to satisfy all four properties. - 🔥 **A Challenging Video Haystack Retrieval Benchmark:** We introduce the challenging **V-NIAH-D** task to expose the drawbacks of current position embedding designs regarding frequency allocation. Our findings reveal that existing Video LLMs are easily misled to frequency-based distractors. - 🔥 **Excellent Performance:** Extensive experiments demonstrate that VideoRoPE consistently achieves superior performance compared to other RoPE variants. For example, VideoRoPE outperforms previous M-RoPE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on LongVideoBench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on VideoHallucer) benchmarks. ## 📜 News **[2025/7/2]** VideoRoPE++ is released with the **training-free extrapolation method YaRN-V** and the **comprehensive V-RULER benchmark**. See [paper](https://github.com/Wiselnn570/VideoRoPE/blob/main/VideoRoPE_plus.pdf) *(currently on hold at arXiv, temporarily hosted here)* and [code](https://github.com/Wiselnn570/VideoRoPE/tree/main/videorope_plus). **[2025/6/7]** VideoRoPE is selected as ICML 2025 🌟**Oral**! **[2025/3/7]** The V-NIAH-D benchmark, checkpoints and training data have been released on [Huggingface](https://huggingface.co/collections/Wiselnn/videorope-what-makes-for-good-video-rotary-position-embeddi-67ca90664c8e169422449c56). **[2025/3/7]** The training code has been added to the repository, please check it out. **[2025/2/14]** [Code]() and [Project Page](https://wiselnn570.github.io/VideoRoPE/) are released! ## 👨‍💻 Todo - [x] VideoRoPE Implementation with *transformers* - [x] VideoRoPE Implementation with *vLLM* - [x] V-NIAH-D Release - [x] Checkpoints Release - [x] Evaluation Code Release - [x] VideoRoPE++ Paper Release - [x] VideoRoPE++ Code Release - [ ] VideoRoPE++ V-RULER huggingface Release ## 🛠️ Usage - Required Package Versions ``` transformers 4.45.2 vllm 0.6.3.post2.dev171+g890ca360 ``` - The implementation of videorope (both transformers and vllm) is emphasized with **#!**, and you can easily find it by pressing ctrl + F. - For transformer inference: ``` with torch.inference_mode(): generated_ids = model.generate( ..., which_rope=which_rope, scale_factor=scale_factor ) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) generated_text = output_text[0] ``` - For vLLM inference: ``` mm_data['which_rope'] = which_rope mm_data['scale_factor'] = scale_factor llm_inputs = { "prompt": prompt, "multi_modal_data": mm_data, } with torch.no_grad(): outputs = llm.generate([llm_inputs], sampling_params=sampling_params) generated_text = outputs[0].outputs[0].text ``` ## Train To verify the superiority of VideoRoPE, we use the diverse and high-quality video dataset [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) for video fine-tuning. To balance training efficiency and long-video comprehension, we randomly select 136K videos with durations under 2 minutes and 18K videos with durations between 2 and 3 minutes. Once the data is prepared, one can fine-tune model following the training data format of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): ```sh cd LLaMA-Factory sh multi_gpu_sft_slurm.sh ``` *It is important to note that in order to align with the training format of Qwen2-VL, we mainly made adjustments to LLaMA-Factory/src/llamafactory/data/mm_plugin.py.* ## ✒️ Citation If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝 ```bibtex @inproceedings{wei2025videorope, title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?}, author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others}, booktitle={International Conference on Machine Learning}, year={2025} } @misc{wei2025videoropepp, author = {Xilin Wei and Xiaoran Liu and Yuhang Zang and Shengyuan Ding and Xiaoyi Dong and Yuhang Cao and Haodong Duan and Qipeng Guo and Jiaqi Wang and Xipeng Qiu and Dahua Lin}, title = {VideoRoPE++: Towards Better Video Rotary Position Embedding}, year = {2025}, howpublished = {\url{https://github.com/Wiselnn570/VideoRoPE/blob/main/videorope_plus/VideoRoPE_plus.pdf}}, doi={10.5281/zenodo.16529245} } ``` ## ❤️ Acknowledgments - [transformers](https://github.com/huggingface/transformers): the codebase we built upon. Thanks for their wonderful work. - [vLLM](https://github.com/PKU-YuanGroup/Open-Sora-Plan): an excellent open-source codebase for high-throughput and memory-efficient inference. Thanks for their wonderful work. - [Qwen2-VL](https://github.com/QwenLM/Qwen2.5-VL): the amazing open-sourced multimodal large language model! - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): Wonderful job in facilitating LLMs & VLMs training.

VideoRoPE: What Makes for Good Video Rotary Position Embedding? [ICML2025 (Oral)]

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
[ICML2025 (Oral)]