# vllm-gfx906

**Repository Path**: bjxamo_admin/vllm-gfx906

## Basic Information

- **Project Name**: vllm-gfx906
- **Description**: vLLM for AMD gfx906 GPUs, e.g. Radeon VII / MI50 / MI60 
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: gfx906/main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-05-18
- **Last Updated**: 2025-05-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

vLLM for gfx906
===================

This is a modified version of vLLM, works with (and only works with) AMD gfx906
GPUs such as Radeon VII / Radeon Pro VII / Instinct MI50 / Instinct MI60.

This modified version of vLLM does two things:

1. Makes vLLM could run on gfx906 GPUs.
2. Optimizes quantization GEMV kernels for gfx906 GPUs.


NEWS
-------------------

2025-05-02:

Update vLLM to 0.8.5

Upstream vLLM 0.8.5 has lots of issues with ROCm platform, which have already
been fixed in the main branch. I cherry-picked those fixes.

I also fixed the issue of garbled output for GPTQ desc_act=True models.


2025-04-29:

I have fixed GGUF batched request performance issue. Now it's usable, but still
not as fast as GPTQ.

In origin llama.cpp GEMV kernel, there are some optimizations for batched
requests, but those optimizations have been removed in vLLM, and fall back to
GEMM kernel instead. As a result, GGUF models in vLLM are very slow with
batched requests.

I have updated GGUF GEMV kernel to support batched requests by doubling the
number of thread blocks. This method differs from the original llama.cpp GEMV
kernel but runs faster and is easier to implement.

I also added some autotune configs to `triton_flash_attention.py` by increasing
`num_stages`. In my test, gfx906 GPUs could benefit from this parameter.

BTW, I saw `num_stages` is often set higher than 1 for NVIDIA GPUs, but not for
AMD GPUs.


2025-04-28:

Update rocm to 6.3
Update torch to 2.7.0
Update triton to 3.3.0


2025-04-22:

I have fixed GPTQ Int4/Int8 GEMV kernel, by changing dot product accumulator
type from FP16 to FP32 to avoid calculation overflow. Thanks to the fdot2
intrinsic introduced in Vega 12/20, using FP32 accumulators remains fast and
guarantees no overflow.


2025-04-21:

Update vLLM to v0.8.4


2025-04-20:

The reason behind Qwen2 GPTQ models outputting infinite "!!!!!!!!!!!!!!!!!!!"
has been identified. For detailes, see:
https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4

In short: This is not an issue with model file itself. When the input prompt
is too short, the computation for prefill is more like GEMV then GEMM, vLLM uses
a different CUDA kernel to compute, which has some issues with prefill workround.

There are two workarounds:
1. Prepend some junk data to the prompt to make it longer.
2. Adjust the threshold for distinguishing between GEMM and GEMV.

In the original post, the author recommends changing the threshold to 0.
However, this significantly hurts performance on gfx906 GPUs. I changed the
threshold from 50 (for 4-bit) / 24 (for 8-bit) to 8/8. I tested several models
with very short inputs, and it worked fine without any noticeable performance
difference (when the number of concurrent requests is ≤ 8).


2025-04-19:

I attempted to optimize AWQ, by adding '@triton.autotune' to triton_awq.py.
This improved performance by about 50%, but it's still very slow on gfx906 GPUs.

I also tried on an NVIDIA Turing GPU. The origin awq_triton.py is slow on
Turing too, but this autotune technique improves it's performance to match the
default AWQ CUDA implementation.

This is somewhat ironic: I was optimizing awq_triton.py for AMD's GPU but
failed to make it usable, yet I made NVIDIA's GPU usable first.


2025-04-01:

Optimized the GEMV kernel for GGUF q4_1 and q8_0 quantization, achieving
10%~20% performance improve.


NOTES
-------------------

GPTQ quantization is the first recommended quantization format to use.

GGUF quantization is functional and should work. I recommend using q4_1 for
best 4-bit single batch decode performance. K-quants (e.g. q4_K / q6_K) should
work. I have not tested I-quants (e.g. IQ4 / IQ3) yet.

If you are runing a unquantized bfloat16 model, add '--dtype float16' to the
parameters for better performance.


INSTALL
-------------------

You MUST INSTALL triton-gfx906 v3.3.0+gfx906 first, see:
https://github.com/nlzy/triton-gfx906/tree/v3.3.0+gfx906

```
cd vllm-gfx906

python3 -m venv vllmenv
source vllmenv/bin/activate

pip3 install 'torch==2.7' torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
pip3 install -r requirements/rocm-build.txt
pip3 install -r requirements/rocm.txt

pip3 install --no-build-isolation .
```


CREDITS
-------------------

https://github.com/Said-Akbar/vllm-rocm