[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Want some advice !



Thanks for the reply.
vllm-flash-attn can be embedded. The original version of flash-attn is not used by
vllm, we don't have to touch it.
Make sense, will follow this way.

I think it means “when building for Hopper GPUs”? Then we might skip it for now. 
Yes, it is used for sm90 architecture, so optimization for Hopper GPUs. 
 
Regards

2025年8月9日(土) 16:29 Shengqi Chen <harry@debian.org>:
Hi Kohei,

It’s great to see your progress!
I have some addition comments to Mo’s:

> 2025年8月9日 07:43,M. Zhou <cdluminate@riseup.net> 写道:
>
> Hi Kohei,
>
> Thanks for the updates! See my comments below:
>
> On Fri, 2025-08-08 at 21:32 +0900, 千代航平 wrote:
>> HI, I'm tackling gpu version vllm and find some difficult part.
>>
>> I read and tried to understand https://github.com/vllm-project/vllm vllm build system and it uses two external projects.( flash mla and vllm-flash-attn).
>> These two dependencies are cloned from the internet during the build in this part
>>
>> .cmake/external_projects/vllm_flash_attn.cmake
>>
>> ```FetchContent_Declare(
>>          vllm-flash-attn
>>          GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
>>          GIT_TAG dc9d410b3e2d6534a4c70724c2515f4def670a22
>>          GIT_PROGRESS TRUE
>>          # Don't share the vllm-flash-attn build between build types
>>          BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
>>  )
>> ```
>
> Generally we have to patch all "fetchcontent" stuff from cmake to make sure
> nothing will be downloaded from the internet. This is required by Debian policy.
>
> Pytorch has done many patching work for stripping these downloads and third party directories.
> It's solution is messy:
> https://salsa.debian.org/deeplearning-team/pytorch/-/blob/master/debian/patches/1020-dirtyhack.patch?ref_type=heads
>
> For a cleanly organized example, please see onnxruntime, e.g.,
> https://salsa.debian.org/deeplearning-team/onnxruntime/-/blob/master/debian/patches/system-benchmark.patch?ref_type=heads

Just a comment — patching out FetchContent_MakeAvailable is indeed sufficient.
Leaving FetchContent_Declare is typically harmless.

Also, now Debian has passed -DFETCHCONTENT_FULLY_DISCONNECTED=ON to cmake in
dh_auto_configure (if you are using the correct dh helper). So leaving them
as-is should also be safe now — nonetheless you can always patch that (which
might lead to some rebasing work when importing new upstream releases).

[1]: https://salsa.debian.org/debian/debhelper/-/blob/main/lib/Debian/Debhelper/Buildsystem/cmake.pm?ref_type=heads

>
>> First, flash mla is " Only build FlashMLA kernels if we are building for something compatible with" so we can skip the first.
>
> This sentence is vague. By default, is the flash mla used for vLLM? (especially for their
> official binary release) What does "something compatible with" mean?

I think it means “when building for Hopper GPUs”? Then we might skip it for now.

>> Second, vllm-flash-attn, it is a bit weird. https://github.com/vllm-project/flash-attention.git
>> I successfully built the original version of flash-attn (https://github.com/Dao-AILab/flash-attention) , however, the vllm-flash-attn is a forked version and includes some files for vllm. (such as vllm_flash_attn
>> directory).
>
> Generally, when cmake downloads vllm-flash-attn, we should seprately package it and patch
> cmake to use the system copy. However, since vllm-flash-attn is a fork, and it is very
> likely only vllm is going to use this fork. The simpliest reasonable way to handle
> vllm-flash-attn is an embedded source copy like this:
>
> https://salsa.debian.org/deeplearning-team/pytorch/-/tree/master/debian/pocketfft-9d3ab05a7fffbc71a492bc6a17be034e83e8f0fe?ref_type=heads
>
> After importing this embedded copy, make sure to update the d/copyright for it, and
> patch cmake to redirect the download to this embedded source code copy.
>
>> In this case, how can I treat these files.  Do I need to package flash-attn and vllm-flash-attn separately?
>
> vllm-flash-attn can be embedded. The original version of flash-attn is not used by
> vllm, we don't have to touch it.

Thanks,
Shengqi Chen

Reply to: