# vllm-gfx906 **Repository Path**: bjxamo_admin/vllm-gfx906 ## Basic Information - **Project Name**: vllm-gfx906 - **Description**: vLLM for AMD gfx906 GPUs, e.g. Radeon VII / MI50 / MI60 - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: gfx906/main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-05-18 - **Last Updated**: 2025-05-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README vLLM for gfx906 =================== This is a modified version of vLLM, works with (and only works with) AMD gfx906 GPUs such as Radeon VII / Radeon Pro VII / Instinct MI50 / Instinct MI60. This modified version of vLLM does two things: 1. Makes vLLM could run on gfx906 GPUs. 2. Optimizes quantization GEMV kernels for gfx906 GPUs. NEWS ------------------- 2025-05-02: Update vLLM to 0.8.5 Upstream vLLM 0.8.5 has lots of issues with ROCm platform, which have already been fixed in the main branch. I cherry-picked those fixes. I also fixed the issue of garbled output for GPTQ desc_act=True models. 2025-04-29: I have fixed GGUF batched request performance issue. Now it's usable, but still not as fast as GPTQ. In origin llama.cpp GEMV kernel, there are some optimizations for batched requests, but those optimizations have been removed in vLLM, and fall back to GEMM kernel instead. As a result, GGUF models in vLLM are very slow with batched requests. I have updated GGUF GEMV kernel to support batched requests by doubling the number of thread blocks. This method differs from the original llama.cpp GEMV kernel but runs faster and is easier to implement. I also added some autotune configs to `triton_flash_attention.py` by increasing `num_stages`. In my test, gfx906 GPUs could benefit from this parameter. BTW, I saw `num_stages` is often set higher than 1 for NVIDIA GPUs, but not for AMD GPUs. 2025-04-28: Update rocm to 6.3 Update torch to 2.7.0 Update triton to 3.3.0 2025-04-22: I have fixed GPTQ Int4/Int8 GEMV kernel, by changing dot product accumulator type from FP16 to FP32 to avoid calculation overflow. Thanks to the fdot2 intrinsic introduced in Vega 12/20, using FP32 accumulators remains fast and guarantees no overflow. 2025-04-21: Update vLLM to v0.8.4 2025-04-20: The reason behind Qwen2 GPTQ models outputting infinite "!!!!!!!!!!!!!!!!!!!" has been identified. For detailes, see: https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4 In short: This is not an issue with model file itself. When the input prompt is too short, the computation for prefill is more like GEMV then GEMM, vLLM uses a different CUDA kernel to compute, which has some issues with prefill workround. There are two workarounds: 1. Prepend some junk data to the prompt to make it longer. 2. Adjust the threshold for distinguishing between GEMM and GEMV. In the original post, the author recommends changing the threshold to 0. However, this significantly hurts performance on gfx906 GPUs. I changed the threshold from 50 (for 4-bit) / 24 (for 8-bit) to 8/8. I tested several models with very short inputs, and it worked fine without any noticeable performance difference (when the number of concurrent requests is ≤ 8). 2025-04-19: I attempted to optimize AWQ, by adding '@triton.autotune' to triton_awq.py. This improved performance by about 50%, but it's still very slow on gfx906 GPUs. I also tried on an NVIDIA Turing GPU. The origin awq_triton.py is slow on Turing too, but this autotune technique improves it's performance to match the default AWQ CUDA implementation. This is somewhat ironic: I was optimizing awq_triton.py for AMD's GPU but failed to make it usable, yet I made NVIDIA's GPU usable first. 2025-04-01: Optimized the GEMV kernel for GGUF q4_1 and q8_0 quantization, achieving 10%~20% performance improve. NOTES ------------------- GPTQ quantization is the first recommended quantization format to use. GGUF quantization is functional and should work. I recommend using q4_1 for best 4-bit single batch decode performance. K-quants (e.g. q4_K / q6_K) should work. I have not tested I-quants (e.g. IQ4 / IQ3) yet. If you are runing a unquantized bfloat16 model, add '--dtype float16' to the parameters for better performance. INSTALL ------------------- You MUST INSTALL triton-gfx906 v3.3.0+gfx906 first, see: https://github.com/nlzy/triton-gfx906/tree/v3.3.0+gfx906 ``` cd vllm-gfx906 python3 -m venv vllmenv source vllmenv/bin/activate pip3 install 'torch==2.7' torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3 pip3 install -r requirements/rocm-build.txt pip3 install -r requirements/rocm.txt pip3 install --no-build-isolation . ``` CREDITS ------------------- https://github.com/Said-Akbar/vllm-rocm