[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Want some advice !



Hi Kohei,

Thanks for the updates! See my comments below:

On Fri, 2025-08-08 at 21:32 +0900, 千代航平 wrote:
> HI, I'm tackling gpu version vllm and find some difficult part.
> 
> I read and tried to understand https://github.com/vllm-project/vllm vllm build system and it uses two external projects.( flash mla and vllm-flash-attn).
> These two dependencies are cloned from the internet during the build in this part 
> 
> .cmake/external_projects/vllm_flash_attn.cmake
>  
> ```FetchContent_Declare(
>           vllm-flash-attn
>           GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
>           GIT_TAG dc9d410b3e2d6534a4c70724c2515f4def670a22
>           GIT_PROGRESS TRUE
>           # Don't share the vllm-flash-attn build between build types
>           BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
>   )
> ```

Generally we have to patch all "fetchcontent" stuff from cmake to make sure
nothing will be downloaded from the internet. This is required by Debian policy.

Pytorch has done many patching work for stripping these downloads and third party directories.
It's solution is messy:
https://salsa.debian.org/deeplearning-team/pytorch/-/blob/master/debian/patches/1020-dirtyhack.patch?ref_type=heads

For a cleanly organized example, please see onnxruntime, e.g.,
https://salsa.debian.org/deeplearning-team/onnxruntime/-/blob/master/debian/patches/system-benchmark.patch?ref_type=heads


> First, flash mla is " Only build FlashMLA kernels if we are building for something compatible with" so we can skip the first.

This sentence is vague. By default, is the flash mla used for vLLM? (especially for their
official binary release) What does "something compatible with" mean?

> Second, vllm-flash-attn, it is a bit weird. https://github.com/vllm-project/flash-attention.git
> I successfully built the original version of flash-attn (https://github.com/Dao-AILab/flash-attention) , however, the vllm-flash-attn is a forked version and includes some files for vllm. (such as vllm_flash_attn
> directory).

Generally, when cmake downloads vllm-flash-attn, we should seprately package it and patch
cmake to use the system copy. However, since vllm-flash-attn is a fork, and it is very
likely only vllm is going to use this fork. The simpliest reasonable way to handle
vllm-flash-attn is an embedded source copy like this:

https://salsa.debian.org/deeplearning-team/pytorch/-/tree/master/debian/pocketfft-9d3ab05a7fffbc71a492bc6a17be034e83e8f0fe?ref_type=heads

After importing this embedded copy, make sure to update the d/copyright for it, and
patch cmake to redirect the download to this embedded source code copy.

> In this case, how can I treat these files.  Do I need to package flash-attn and vllm-flash-attn separately? 

vllm-flash-attn can be embedded. The original version of flash-attn is not used by
vllm, we don't have to touch it.


Reply to: