# Ultra-Sparse-Memory-Network **Repository Path**: shawn2020/Ultra-Sparse-Memory-Network ## Basic Information - **Project Name**: Ultra-Sparse-Memory-Network - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-03 - **Last Updated**: 2025-09-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ![bytedance](./assets/bytedance.jpeg)

Ultra Sparse Memory Network

GitHub License Paper URL Paper URL

Ultra sparse memory network (UltraMem) is a sparse model similar with MoE, but with significantly low memory access, so the inference is fast. but UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap.
Overall structure of 3 sparse layers
Overall structure of 3 sparse layers
Flow of Tucker Decomposed Query-Key Retrieval(TDQKR)
Flow of Tucker Decomposed Query-Key Retrieval(TDQKR)
Comprehensive performance comparison across different open sourced models and benchmarks
Comprehensive performance comparison across different open sourced models and benchmarks
## Installation Please follow OLMoE https://github.com/allenai/OLMoE ## Training **Key files** ``` olmo/ ├── models.py # main model ├── memory_plus_layer.py # Memory+ layer (by meta) ├── ultramem_layer.py # UltraMem layer └── ultramem_layer_v2.py # UltraMemV2 layer ``` **Dataset** We use the same dataset as OLMoE. You can download it from [here](https://huggingface.co/datasets/allenai/OLMoE-mix-0924). Then you need use the following command to change the dataset to the format that our model can use: ``` dolma tokens \ --documents ${PATH_TO_DOWNLOADED_DATA} \ --destination ${PATH_WHERE_TO_SAVE_TOKENIZED_DATA} \ --tokenizer.name_or_path 'allenai/gpt-neox-olmo-dolma-v1_5' \ --max_size '2_147_483_648' \ --seed 0 \ --tokenizer.eos_token_id 50279 \ --tokenizer.pad_token_id 1 \ --processes ${NUMBER_OF_CPU_CORES_TO_USE} ``` **Training** You can run the training script below, which is a 227M/1.2B UltraMemV2. Noticed that we only implement DDP with fused kernel, it's easy to extend to megatron with some synchronization and communication. ``` sh launch.sh ${CONFIG_PATH} \ --save_folder=${SAVE_DIR} \ --run_name=${run_name} \ --save_overwrite=true \ --mount_common_hdfs=true \ --fsdp.sharding_strategy=NO_SHARD \ --canceled_check_interval=9999999 \ --global_indices_file=${CODE_DIR}/global_indices.npy \ --load_path=${CUR_CKPT_PATH} \ --model.init_std=0.02282177322938192 \ --model.init_fn="full_megatron" \ --model.d_model=768 \ --model.n_layers=20 \ --model.n_heads=12 \ --model.n_kv_heads=12 \ --model.weight_tying=true \ --max_duration=5e11T \ --scheduler.t_warmup=1e10 \ --scheduler.t_max=5e11 \ --device_train_microbatch_size=3 \ --global_train_batch_size=768 \ --save_interval=1000 \ --eval_interval=1000 \ --save_num_checkpoints_to_keep=-1 \ --console_log_interval=10 \ \ --model.block_type='sequential' \ --model.mlp_hidden_size=4992 \ \ --optimizer.mem_value_lr_times=4.0 \ --optimizer.mem_value_lr_max_steps_rate=1.0 \ --model.mem_insert_way='full' \ --model.mem_knum=360 \ --model.mem_kdim=192 \ --model.mem_vdim=192 \ --model.mem_pre_vdim=192 \ --model.mem_knn=32 \ --model.mem_head=1 \ --model.mem_share_ratio=0.5 \ --model.mem_type='ultramem_v2' \ --model.mem_value_expand_time=1 \ \ --distributed_strategy=ddp \ --model.init_device='cuda' \ --optimizer.metrics_log_interval=50 \ --model.mem_log_interval=50 \ --save_interval_unsharded=1000 ``` You can also pick the code to your own project, it should be easy to run : ) ## Citing If you find this work helpful or use it in your research, please consider citing our paper: ```bibtex @article{huang2025ultra, title={UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning}, author={Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao}, journal={Arxiv}, year={2025} } ``` ```bibtex @article{huang2024ultra, title={Ultra-Sparse Memory Network}, author={Huang, Zihao and Min, Qiyang and Huang, Hongzhi and Zhu, Defa and Zeng, Yutao and Guo, Ran and Zhou, Xun}, journal={ICLR 2025}, year={2024} } ``` ## Acknowledgement Our open-soursed work is mainly based on OLMoE, thanks for their work.