# multimodal-training
**Repository Path**: mirrors_ray-project/multimodal-training
## Basic Information
- **Project Name**: multimodal-training
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-15
- **Last Updated**: 2026-01-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Disaggregated Hybrid Parallelism with Ray
A framework for training vision-language models using disaggregated hybrid parallelism, where each model component adopts its optimal parallelization strategy independently.
## Overview
Modern vision-language models (VLMs) present a fundamental training challenge: they consist of **heterogeneous components** with vastly different computational and memory characteristics, making uniform parallelization strategies inefficient.
### The Heterogeneity Problem
Vision-language models combine two components with fundamentally mismatched characteristics:
**Vision Encoder** (Small Model, Long Sequences)
- **Parameters**: ~500M-6B (relatively small)
- **Sequences**: Up to 65k tokens per high-resolution image
- **Computation**: Fast per iteration (smaller model, often sparse attention)
- **Bottleneck**: Activation memory scales with sequence length (B×S×H)
**Text Decoder** (Large Model)
- **Parameters**: 7B-72B (10-100× larger than vision)
- **Computation**: Slower per iteration (much larger model, dense attention)
- **Bottleneck**: Model states scale with parameter count (M×20 bytes with mixed precision + AdamW)
This asymmetry creates a fundamental challenge: **the vision encoder is memory-bound by activations, while the text decoder is memory-bound by parameters**.
### Why Uniform Parallelization Fails
Existing distributed training frameworks (DeepSpeed, Megatron-LM) assume **uniform parallelization strategies** across all model components. This creates severe inefficiencies:
**Tensor Parallelism (TP):**
- ✅ Efficient for large text models (partitions parameters across GPUs)
- ❌ **Inefficient for small vision encoders**: Small hidden dimensions make communication overhead dominate computation time, creating a bottleneck
**Data Parallelism / ZeRO3 (parameter sharding):**
- ✅ Works well for short sequences that fit in memory
- ❌ **Critical limitation**: Minimum batch size per GPU = 1, causing OOM errors with long sequences (65k tokens)
- ❌ **Real-world impact**: Researchers must downscale images (896×896 vs 3584×3584), resulting in a **16× reduction in trainable sequence length** compared to hardware capability
### Our Solution: Disaggregated Hybrid Parallelism (DHP)
This repository implements **disaggregated hybrid parallelism (DHP)** using Ray, allowing each model component to adopt its optimal parallelization strategy:
**✅ Vision Encoder: Sequence Parallelism (SP) + ZeRO1 (data parallel + optimizer states sharding)**
- Implements DeepSpeed-Ulysses style sequence parallelism [1] to split long sequences across GPUs
- Activation memory scales linearly with GPU count: B×S×H/N per GPU
- Avoids tensor parallelism overhead for small models
- Communication pattern: All-to-All operations enable efficient attention across partitioned sequences
- Particularly effective with sparse attention patterns (see Qwen2.5-VL results below)
**✅ Text Decoder: Tensor Parallelism (TP)**
- Efficiently partitions large models: model memory = M×20/N per GPU (mixed precision + AdamW)
- Dense computation amortizes TP communication cost
- Essential for fitting 7B-72B parameter models in memory
**Key Insight**: Heterogeneous models need heterogeneous parallelization strategies. By disaggregating components and choosing the parallelism of each independently, we achieve better resource utilization and enable previously infeasible training scenarios.
## How Ray Enables Disaggregated Hybrid Parallelism
Traditional distributed training frameworks (DeepSpeed, Megatron-LM) use a **single SPMD (Single Program Multiple Data) process group** where all model components must use the same parallelization strategy. This creates a fundamental constraint: **you cannot apply different parallelization strategies to different parts of the model**.
**Traditional Single SPMD Group:**
```
GPU 0: Process(rank=0) → Full Model (Vision + Text)
GPU 1: Process(rank=1) → Full Model (Vision + Text)
GPU 2: Process(rank=2) → Full Model (Vision + Text)
GPU 3: Process(rank=3) → Full Model (Vision + Text)
↓
Single torch.distributed process group
Uniform parallelization strategy enforced across ALL components
(e.g., if using TP, BOTH vision and text must use TP)
```
Ray breaks this constraint by enabling **multiple independent SPMD groups**, each with its own parallelization strategy and resource allocation:
**Ray-Orchestrated Multiple SPMD Groups:**
```
Ray Driver (Central Orchestrator)
├── Vision ActorGroup: Independent SPMD Group (SP)
│ ├── VisionActor(rank=0) → Vision Encoder only
│ ├── VisionActor(rank=1) → Vision Encoder only
│ ├── VisionActor(rank=2) → Vision Encoder only
│ └── VisionActor(rank=3) → Vision Encoder only
│ ↓
│ SP process group: All-to-All for sequence parallel attention
│
└── Text ActorGroup: Independent SPMD Group (TP)
├── TextActor(rank=0) → Text Decoder only
├── TextActor(rank=1) → Text Decoder only
├── TextActor(rank=2) → Text Decoder only
└── TextActor(rank=3) → Text Decoder only
↓
TP process group: All-Reduce for activations (forward) and gradients (backward)
```
### Key Components
**ActorGroup**: Collection of Ray actors forming an independent SPMD group with its own `torch.distributed` process group. Supports GPU collocation (multiple actors per GPU) and provides APIs for synchronous (`execute_all`) and asynchronous (`execute_all_async`) execution across all actors.
**VisionTrainer / TextTrainer**: Specialized trainer classes with decomposed methods (`forward_step`, `backward_step`, `optimizer_step`) instead of monolithic training loops. Each trainer initializes its component's model with the appropriate parallelization strategy (SP for vision, TP for text).
**Ray Driver**: Orchestrates the training loop by invoking methods on ActorGroups. Coordinates cross-component data transfer via Ray ObjectRefs, enabling automatic data movement between vision and text actors.
### Training Loop Example
```python
# Create independent actor groups with different parallelization strategies
vision_group = ActorGroup(config, VisionTrainer, num_actors=4) # SP
text_group = ActorGroup(config, TextTrainer, num_actors=4) # TP
for iteration in range(num_iterations):
# Forward: Vision (SP) → Text (TP) via Ray ObjectRefs
vision_refs = vision_group.execute_all_async("forward_step", iteration)
text_refs = text_group.execute_all_async("forward_step", vision_refs, iteration)
losses = ray.get(text_refs)
# Backward: Text → Vision
text_grad_refs = text_group.execute_all_async("backward_step")
vision_backward_refs = vision_group.execute_all_async("backward_step", text_grad_refs)
ray.get(vision_backward_refs)
# Independent optimizer updates
vision_group.execute_all("optimizer_step")
text_group.execute_all("optimizer_step")
```
**Key Benefits**: Vision and text groups use different parallelization strategies (SP vs TP) on the same GPUs through collocation. Ray ObjectRefs handle cross-component data transfer automatically. Easy to reconfigure `num_actors` or strategies per component without framework changes.
## Case Study: Qwen2.5-VL
### Architecture Details
*Figure 1: Qwen2.5-VL architecture showing the vision encoder and the text decoder. The vision encoder processes high-resolution images (up to 3584×3584) into variable-length sequences (N = 1k-65k tokens). The text input contains `` placeholder tokens alongside 512 text tokens. Vision embeddings are scattered into these N placeholder positions, creating a combined sequence of N vision tokens + 512 text tokens processed by the language model.*
**Vision Encoder** (Custom ViT - 670M parameters)
- **Architecture**: 32 layers, 1280 hidden dimension, 16 attention heads
- **Window Attention**: 112-token windows, full attention only on layers {7, 15, 23, 31}
- **Sequences**: 1k-65k tokens from high-resolution images (up to 3584×3584) or videos
- **Computation**: Fast (~0.2-0.7s per iteration for 4-16k tokens)
**Text Decoder** (Qwen2.5 - 7B/32B/72B parameters)
- **7B**: 28 layers, 3584 hidden dimension, 4 KV heads
- **32B**: 64 layers, 5120 hidden dimension, 8 KV heads
- **72B**: 80 layers, 8192 hidden dimension, 8 KV heads
- **Sequences**: Vision tokens (up to 16k) + text tokens (~512) processed together
- **Computation**: Slower (~1-3s per iteration) due to model size
### Why Disaggregated Hybrid Parallelism Works for Qwen2.5-VL
Qwen2.5-VL's architecture is well-suited for disaggregated hybrid parallelism: the vision encoder's **small hidden size (1280)** makes TP inefficient, while **sparse window attention** (only 4 of 32 layers need cross-GPU All-to-All) makes SP highly efficient. The text decoder's **large parameter count (7B-72B)** requires TP's parameter partitioning to fit in memory.
## Experimental Results
### Throughput Comparison: Qwen2.5-VL-32B (8×H100)
We compare three parallelization strategies across varying sequence lengths:

*Figure 2: Throughput comparison of three parallelization strategies (SP+TP, pure TP, ZeRO3) on Qwen2.5-VL-32B across varying vision sequence lengths (1k–65k tokens). SP+TP consistently outperforms pure TP and enables training at extreme sequence lengths where ZeRO3 hits OOM beyond 16k tokens. The batch size upper bound was set to 8 (corresponding to a micro batch size of 1, which is the minimum batch size supported by ZeRO3). When batch size 8 was not feasible due to memory limits, we used the largest batch size that could fit on the GPUs. ZeRO3 results were obtained using the implementation from the Qwen-VL repository, while the TP baseline was implemented in-house without using Ray.*
**Key Findings:**
1. **SP+TP achieves 1.26-1.37× speedup over pure TP** consistently across all sequence lengths (1k-65k tokens)
2. **SP+TP enables training at extreme sequence lengths where ZeRO3 fails**: ZeRO3 hits OOM at 16k+ tokens due to replicated activations, while DHP distributes activation memory across GPUs
3. **ZeRO3 has sequence-length dependent performance**: Optimal at ~9k tokens where it overlaps communication, but struggles with short sequences (insufficient computation) and long sequences (OOM). SP+TP maintains consistent performance across all lengths
**Configuration**: Qwen2.5-VL-32B, 8× H100 GPUs, BF16 mixed precision, Flash Attention 2, activation checkpointing enabled.
## How to use
### Environment Setup
These are the environments we have tested and verified the system on. Other configurations may also work but are not officially tested.
- Python 3.12
- CUDA 12.6
- NVIDIA A100/H100 GPUs
- Qwen2.5-VL 7B model: 4 GPUs
- Qwen2.5-VL 32B model: 8 GPUs
#### Installation Steps
1. **Clone the repository**
```bash
git clone https://github.com/ray-project/multimodal-training
cd multimodal-training
```
2. **Install Python dependencies**
```bash
# Install PyTorch with CUDA support (adjust CUDA version as needed)
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126
# Install Ray, DeepSpeed (custom branch), and other dependencies
pip install -r requirements.txt
```
3. **Start Ray cluster (Optional)**
You don't need to start a Ray cluster when running on your local machine. If you’d like to run on a Ray cluster, please refer to the [Ray documentation](https://docs.ray.io/en/latest/ray-core/starting-ray.html) for instructions on how to set one up.
### Data Preparation
The framework uses a simplified, LLaVA-style data format that is widely adopted in the VLM community. This format requires two components:
1. **Annotation file** (JSON or JSONL) containing conversation data and image references
2. **Image directory** containing the actual image files
#### Annotation File Format
The annotation file should contain conversation examples in the following structure:
```json
[
{
"image": "relative/path/to/image.jpg",
"conversations": [
{
"from": "human",
"value": "\nDescribe this image."
},
{
"from": "gpt",
"value": "A man with a red helmet on a small moped on a dirt road."
}
]
},
...
]
```
**Format Details:**
- **File format**: Either JSON (array of objects) or JSONL (one JSON object per line)
- **`image`**: Relative path to the image file (relative to the `data_path` specified in config)
- **`conversations`**: List of conversation turns between human and model
- **`from`**: Either `"human"` (user) or `"gpt"` (assistant)
- **`value`**: Text content. Use `` token to indicate where the image should be inserted
#### Tested Datasets
We have tested the codebase with:
- [COCO](https://cocodataset.org/) - Image captioning dataset
- [LAION POP](https://laion.ai/blog/laion-pop/) - High-resolution images with detailed descriptions
You'll need to convert the original annotation formats to the conversation format shown above.
### Training Configuration
The model and training settings are specified using a YAML configuration file. See [example configuration](configs/sample.yaml) for full details.
#### Key Configuration Sections
Below is a snippet illustrating the main configuration parameters:
```yaml
...
# Vision model configuration
vision:
model_type: "qwen2_5_vl" # Options: "qwen2_5_vl"
model_name: "Qwen/Qwen2.5-VL-32B-Instruct" # Path to pretrained model
parallelism: "sequence" # Options: "none", "tensor", "sequence", "deepspeed"
...
# Text model configuration
text:
model_type: "qwen2_5_vl" # Options: "qwen2_5_vl"
model_name: "Qwen/Qwen2.5-VL-32B-Instruct" # Path to pretrained model
parallelism: "tensor" # Options: "none", "tensor", "deepspeed", "autotp"
autotp_size: null # Optional tensor parallel size when using AutoTP (defaults to world size)
...
# Training configuration
training:
batch_size: 2
learning_rate: 5e-05
no_checkpoint: true # Disable checkpointing (default: false)
checkpoint_dir: "/tmp/checkpoints" # Directory to save/load
...
# Data configuration
data:
datasets:
- laion_pop_train
data_registry:
mscoco2017_train_captions:
annotation_path: "/path/to/mscoco2017/annotations/coco2017_train.json"
data_path: "/path/to/mscoco2017"
mscoco2017_val_captions:
annotation_path: "/path/to/mscoco2017/annotations/coco2017_val.json"
data_path: "/path/to/mscoco2017"
laion_pop_train:
annotation_path: "/path/to/laion_pop/laion_pop_train.jsonl"
data_path: "/path/to/laion_pop/images"
laion_pop_val:
annotation_path: "/path/to/laion_pop/laion_pop_val.jsonl"
data_path: "/path/to/laion_pop/images"
num_workers: 4
pin_memory: true
data_flatten: true
min_pixels: 802816
max_pixels: 802816
```
### Running Training
You can run training specifying the configuration file.
```bash
python -m python.train_ray \
--config-path=../configs \
--config-name=sample
```
To save checkpoints during training, set `no_checkpoint` to `false` in the training configuration section. Checkpoints will be saved to the specified `checkpoint_dir`.
Checkpoints are saved at the end of each epoch.
If you want to resume training from a checkpoint, specify the `checkpoint_dir` where the checkpoints are stored. The training script will automatically load the latest checkpoint from that directory.
To use the pretrained weights for Qwen2.5-VL, please refer to [Loading Pretrained Weights](LOAD_PRETRAINED_WEIGHTS.md).
To reproduce the experiments described above, please refer to [Experiments](experiments/EXPERIMENTS.md).
## References
[1] Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminabadi, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models." *arXiv:2309.14509*, 2023. https://arxiv.org/abs/2309.14509
## Citation
If you use this codebase in your research, please cite:
```bibtex
@software{ray-multimodal-training,
title={Disaggregated Hybrid Parallelism with Ray},
author={Masahiro Tanaka, Amjad Almahairi, Nithin Chalapathi, Richard Liaw, Praveen Gorthy, Matthew Deng},
year={2025},
url={https://github.com/ray-project/multimodal-training}
}
```
## Acknowledgments
- Thanks to the Qwen team for their excellent open-source models
- Built on top of DeepSpeed, PyTorch Distributed, and Ray
## Contact
For questions or issues, please [open an issue](https://github.com/ray-project/multimodal-training/issues).