Re: pytorch and CUDA

To: Andrius Merkys <merkys@debian.org>, debian-ai@lists.debian.org
Subject: Re: pytorch and CUDA
From: "M. Zhou" <lumin@debian.org>
Date: Fri, 24 Feb 2023 11:38:12 -0500
Message-id: <[🔎] 759f1c56ab9d9a9cfbf5a08d8d1224be94c79649.camel@riseup.net>
In-reply-to: <[🔎] 212db2bc-f992-6c02-b03f-7343d9d5d6d4@debian.org>
References: <[🔎] 5c98d10d-1587-a99c-f16d-e16aca612a14@debian.org> <[🔎] e40ebe5d421c9026dd53324e3375a07421a5d0a2.camel@debian.org> <[🔎] 4ca35fe98d94b5d4cae13125719dd572222b0a28.camel@riseup.net> <[🔎] 3e9f9aa7-96f0-8169-f101-ac2e559b87d2@debian.org> <[🔎] 63f01c35495a193d01b56535c96e23abe04b000c.camel@debian.org> <[🔎] 212db2bc-f992-6c02-b03f-7343d9d5d6d4@debian.org>

On Fri, 2023-02-24 at 16:02 +0200, Andrius Merkys wrote:
> Hi,
> 
> On 2023-02-20 16:08, M. Zhou wrote:
> > That branch uses the same source as src:pytorch.
> > I really dislike duplicating the same source multiple times.
> 
> OK, but I probably should use something other than gbp, as gbp complains:
> 
> $ gbp buildpackage --git-ignore-branch
> gbp:info: Creating 
> /home/andrius/debian-packages/pytorch_1.13.1+dfsg.orig.tar.gz
> gbp:error: Cannot find pristine tar commit for archive 
> 'pytorch_1.13.1+dfsg.orig.tar.gz'

It's because Aron forgot to push the +dfsg pristine-tar. I've imported
that pristine tar from archive and pushed to the git repo.

To build the cuda variant locally, you will also need to rebuild the
following packages on your own:

~/sbuild-arch ppc64el \
        --extra-package=../../nvidia-cudnn.pkg/ \
        --extra-package=../../nvidia-nccl.pkg/ \
        --extra-package=../../tensorpipe.pkg/ \
        --extra-package=../../nvidia-cutlass.pkg/

all nvidia-* packages can be found under the nvidia-team.
The tensorpipe needs to be recompiled from the `cuda`
branch to enable cuda support.

src:gloo also needs to be rebuilt against cuda for cuda support,
but I chose to skip it by exporting USE_GLOO=OFF in d/rules
to reduce my workload.

Then everything is ready. I gone through this path on ppc64el,
and it ends up with linker error about linker overflow, possibly
due to the cuda fat binaries. Maybe I should get rid of some old
CUDA compute capacity like 3.X-5.X.

My ppc64el builder has got 8 cores and 16GB of RAM (+16GB swap).
The cpu version of pytorch takes about 1 hour to build. The cuda version
takes roughly 6 hours to build.

I have no amd64 device within my easy reach that is capable
of building this brutal thing -- amd64 is untested.

Reply to:

Follow-Ups:
- Re: pytorch and CUDA
  - From: "M. Zhou" <lumin@debian.org>

References:
- pytorch and CUDA
  - From: Andrius Merkys <merkys@debian.org>
- Re: pytorch and CUDA
  - From: "M. Zhou" <lumin@debian.org>
- Re: pytorch and CUDA
  - From: "M. Zhou" <lumin@debian.org>
- Re: pytorch and CUDA
  - From: Andrius Merkys <merkys@debian.org>
- Re: pytorch and CUDA
  - From: "M. Zhou" <lumin@debian.org>
- Re: pytorch and CUDA
  - From: Andrius Merkys <merkys@debian.org>

Prev by Date: Re: pytorch and CUDA
Next by Date: hipcub_5.3.3-1_amd64.changes ACCEPTED into unstable
Previous by thread: Re: pytorch and CUDA
Next by thread: Re: pytorch and CUDA
Index(es):
- Date
- Thread