# TokenSwift **Repository Path**: snakecy/TokenSwift ## Basic Information - **Project Name**: TokenSwift - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-05-22 - **Last Updated**: 2025-05-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Model On HF
TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences, up to 100K tokens, while maintaining the target model's inherent quality. | Highlights | Description | Emoji | |------------------|----------------------------------------------|-------| | ⚡ **Speed** | 3× faster than vanilla Transformers | ⏩ | | 🎯 **Lossless** | Matches original model's output quality | ✅ | | 📈 **Scalability**| Linear time complexity for 100K+ sequences | 📏 | | 🛠️ **Plug & Play**| Works with most HuggingFace models | 🤗 | --- ## ✨ News [2025.5.2] 🔥🔥 Our Paper is accepted by ICML 2025! [2025.3.19] 🔥🔥Relase model for finetuned [QwQ-32B](https://huggingface.co/TokenSwift/TokenSwift-QwQ-32B) with 3 $\times$ acceleration. Check out [inference guide](#inference) for deployment. [2025.2.28] 🔥🔥Relase model for finetuned [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/TokenSwift/TokenSwift-DeepSeek-R1-Distill-Qwen-32B) with 3 $\times$ acceleration. Check out [inference guide](#inference) for deployment. [2025.2.27] Paper Release on Arxiv. --- ## 📦 Demo https://github.com/user-attachments/assets/5094fca7-0b12-470c-a7b6-456d254855d1 --- ## ✨ Star History [![Star History Chart](https://api.star-history.com/svg?repos=bigai-nlco/TokenSwift&type=Date)](https://www.star-history.com/#bigai-nlco/TokenSwift&Date) --- ## 📖 Table of contents - [Introduction](#introduction) - [Installation](#installation) - [Method 1: With pip](#method-1-with-pip) - [Method 2: From the source (recommended)](#method-2-from-the-source-recommended) - [Getting Started](#getting-started) - [Models Download](#models-download) - [Inference](#inference) - [Training Guide (Optional)](#training-guide-optional) - [Datasets Download](#datasets-download) - [How to Train](#how-to-train) - [Citation](#citation) - [Acknowledgment](#acknowledgment) --- ## Introduction We propose **TokenSwift**, a novel framework that achieves **lossless acceleration** for ultra-long sequence generation (up to 100K tokens) while **reducing computation time from hours to minutes**. *Illustration of TOKENSWIFT Framework. First, target model (LLM) with partial KV cache and three linear layers outputs 4 logits in a single forward pass. Tree-based attention is then applied to construct candidate tokens. Secondly, top-k candidate 4-grams are retrieved accordingly. These candidates compose draft tokens, which are fed into the LLM with full KV cache to generate target tokens. The verification is performed by checking if draft tokens match exactly with target tokens. Finally, we randomly select one of the longest valid draft tokens, and update n-gram table and KV cache accordingly.* This repository contains: - ✅ **100% reproducibility** for all experiments - 📊 Benchmark scripts for sequence lengths: 20K/40K/60K/80K/100K - 🤖 Pre-trained model adapters for Any Structure *Visualization of our acceleration performance vs. baseline methods* --- ## Installation ### Method 1: With pip ```bash pip install tokenswift ``` ### Method 2: From the source (recommended) ```bash git clone https://github.com/bigai-nlco/TokenSwift.git cd TokenSwift conda create -n tokenswift python=3.11 conda activate tokenswift conda install nvidia::cuda-nvcc pip install -r requirements.txt pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl ``` --- ## Getting Started ### Models Download | Model Name | Download Link | |------------|-------------| | TokenSwift-Yarn-Llama-2-7b-128k | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-Yarn-Llama-2-7b-128k) | | TokenSwift-Llama-3.1-8B | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-Llama-3.1-8B) | | TokenSwift-Qwen2.5-1.5B | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-Qwen2.5-1.5B) | | TokenSwift-Qwen2.5-7B | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-Qwen2.5-7B) | | TokenSwift-Qwen2.5-14B | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-Qwen2.5-14B) | | TokenSwift-DeepSeek-R1-Distill-Qwen-32B | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-DeepSeek-R1-Distill-Qwen-32B) | | TokenSwift-QwQ-32B | [HuggingFace](https://huggingface.co/TokenSwift/TokenSwift-QwQ-32B) | ### Inference Take LLaMA3.1-8B as an example: ```bash torchrun --master-port 1111 --nproc_per_node=1 main.py \ --model_type llama3_1 \ --ckpt_path your_checkpoint_path \ --prefill_len 4096 \ --retrival_max_budget 4096 \ --gen_len 102400 \ --gamma 4 \ --min_p 0.1 \ --temperature 1.0 \ --tree_decoding \ --ngram_topk 20 \ --penalty 1.2 \ --penalty_length 1024 \ --prompt_id 0 ``` For other models, you can run the scripts in ```infer_scripts/``` folder. For example: ```bash bash infer_scripts/r1_qwen_32b.sh ``` --- ## Training Guide (Optional) ### Datasets Download From the [PG-19](https://huggingface.co/datasets/deepmind/pg19) training set, data larger than 8K are filtered out according to different tokenizer. Or download processed training datasets from [llama2-pg19](https://huggingface.co/datasets/TokenSwift/llama2_pg19_train_data), [llama3.1-pg19](https://huggingface.co/datasets/TokenSwift/llama3.1_pg19_train_data), [qwen2.5-pg19](https://huggingface.co/datasets/TokenSwift/qwen2.5_pg19_train_data). ### How to Train Take LLaMA3.1-8B as an example: ```bash torchrun --master-port 1111 --nproc_per_node=4 train/train_legacy.py \ --model_name_or_path /your_model_path/Meta-Llama-3.1-8B \ --llama_type llama3_1 \ --data_path /your_data_path/llama3_1_pg19_8k_data \ --output_dir /your_checkpoint_path/adapter_ckpts_llama3_1 \ --max_steps 200 \ --per_device_train_batch_size 3 \ --gradient_accumulation_steps 10 \ --save_steps 200 \ --learning_rate 5e-3 \ --weight_decay 0.1 \ --warmup_steps 50 \ --lr_scheduler_type cosine \ --logging_steps 5 \ --report_to tensorboard \ --bf16 True \ --medusa_heads 3 \ --remove-unused-columns false ``` For other models, you can run the scripts in ```train/scripts/``` folder. For example: ```bash cd train bash scripts/train_R1_qwen2_5_32b.sh ``` --- ## Citation If you are interested in our work or use our library, please cite: ```bibtex @misc{tokenswift, title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens}, author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng}, year={2025}, eprint={2502.18890}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.18890}, } ``` --- ## Acknowledgment This codebase is influenced by remarkable projects from the LLM community, including [Medusa](https://github.com/FasterDecoding/Medusa/tree/main) and [TriForce](https://github.com/Infini-AI-Lab/TriForce).