Re: Any input for some talk about usage of Debian in HPC

To: debian-med@lists.debian.org
Subject: Re: Any input for some talk about usage of Debian in HPC
From: Tony Travis <tony.travis@minke-informatics.co.uk>
Date: Sun, 19 May 2024 15:31:02 +0100
Message-id: <[🔎] 5a5bd0b6-6ac8-42fd-9509-ba3d988094c6@minke-informatics.co.uk>
In-reply-to: <[🔎] ZkneqPqtYOuUXbBV@an3as.eu>
References: <[🔎] ZkneqPqtYOuUXbBV@an3as.eu>

On 19/05/2024 12:12, Andreas Tille wrote:

Hi,

I have an invitation to have some talk with the title

    Debian GNU/Linux for Scientific Research

Abstract:

    Over the past decade, Enterprise Linux has dominated large-scale
    research computing infrastructure. However, recent developments have
    sparked increased interest in community-led alternatives. Debian
    GNU/Linux, a long-standing choice among researchers for supporting
    scientific work, is experiencing a renewed interest for High-Throughput
    Computing (HTC) and High-Performance Computing (HPC) applications.  This
    presentation will provide an overview of how Debian is being utilized to
    support scientific research and will include a case study showcasing the
    migration of HTC operations from Enterprise Linux 7 (EL7) to Debian.

While I could talk about Debian Science and Debian Med in general it
would be cool to reference to some real life examples where Debian is
used in Science and what might be the reason to use Debian.


Hi, Andreas.

The Sanger Centre in the UK use Ubuntu + OpenStack + Ceph:

https://www.sanger.ac.uk/group/core-software-services/

I realise that it's not Debian, but it is based on Debian. I went theremany years ago when they were running Debian on DEC Alpha AXP's, butthey moved to CentOS because many other Academic HPC centres were usingit, including ours when I worked at the University of Aberdeen.

This was not a good experience, and they decided to change to Ubuntumainly because of the support provided by Canonical for OpenStack andCeph. However, in my opinion, CentOS/RHEL is not a good platform forbioinformatics because the 'Enterprise' approach stifles innovation.

You can't ignore the host OS when you talk about HPC applications andthe HEP (High Energy Physics) community put a lot of effort intodeveloping good node provisioning systems and job-scheduling for HPC.Consequently, there was a significant bias towards support for HEPapplications running under CentOS and less support for bioinformatics.

This was partly the motivation underlying our development of Bio-Linuxin order to provide biologists with an alternative platform running ontheir own hardware instead of struggling to get the IT department toport the software they wanted to use to CentOS. In that respect theDebian-Med project was fundamentally important in helping biologists dotheir work outside of the centrally managed 'Enterprise' oriented ITpolicy imposed on us by Universities and Research Institutes.

The Sanger Centre provide a centrally managed HPC that is'biologist-friendly' and, I think, is an excellent model of how itshould be done. However, it does not support the view that Debian shouldbe the HPC OS because the main reason they chose Ubuntu was thecommercial support for OpenStack and Ceph provided by Canonical.

I personally would like to stress the "we package what we use" aspect
and the "we mentor upstream to merge competence of the program with
packaging skills" idea.  Any input would be welcome to cover more ideas.

As you might remember, I built and I advocate the use of 'departmental'or 'research-group' clusters. These are much more powerful than anindividual biologists personal laptop, but are under the administrativecontrol of the department or research group that funded their purchase.

In the past, I've used various HPC node-provisioning, cluster filesystemand job submission systems running under one version of another ofBio-Linux, now using your "med-bio" meta-package to providebioinformatics software instead of the discontinued Bio-Linux packages.


However, I've recently set up a 3-node 'Proxmox-VE' cluster:

https://www.proxmox.com/en/proxmox-virtual-environment/overview


[Proxmox is a GPL server management system based on Debian]

I'm using the Proxmox cluster for a bioinformatics in schools projectwith the University of Edinburgh:

https://4273pi.org/


I'm also planning to use it for a new project with the IAEA in Vienna.

I think that giving biologists the choice of running the software theywant under the OS they choose is very important when innovation is thepriority of an organisation rather than centralisation of IT systems toreduce cost. You can, of course use Proxmox-VE as the node-provisioningand shared filesystem of an HPC cluster. Or, simply provide biologistswith VMs running their OS of choice, administered by themselves e.g. aBio-Linux VM or vanilla Debian etc. etc.


Finally, don't forget about Amdahl's Law:

https://en.wikipedia.org/wiki/Amdahl%27s_law

There is really no such thing as an HPC or HTP 'application', becauseit's the underlying resource management system of an HPC cluster thatprovides the 'HP'. In my experience, most bioinformatics applicationsare 'embarrassingly' parallel and in this case processes do notcommunicate with each other. The 'HP' is achieved by managing theworkflow efficiently using e.g. "Slurm" or "[Sun]Grid-Engine".

MPI-parallel processes are subject to diminishing returns as the numberof processes increases as described by Amdahl's Law. Other 'Map-Reduce'workflows are also 'embarrassingly' parallel and GPGPU workflows whileparallel on a given device are not HPC in the conventional sense despitenVidia's claims to the contrary, but are GPGPU-accelerated applicationsthat are, typically, executed as embarrassingly parallel work-flowsacross clusters of servers with GPGPUs. All 'SLI' does is to combineGPGPUs locally on a server which, again, is not conventional HPC.

All of these technologies have fantastic potential, but I've been veryfrustrated and disappointed many times how difficult it is to utiliseGPGPU for bioinformatics. It's only recently that e.g. Oxford NanoporeTechnology and Pacbio have made effective use of GPGPU for base-callinglong reads. The basic problem is that existing 'classic' bioinformaticsanalyses require re-writing to execute in parallel and it is only a fewnew bioinformatics algorithms that are worthwhile designing for GPGPU.


Bye,

  Tony.

--
Minke Informatics Limited, Registered in Scotland - Company No. SC419028
Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
tel. +44(0)19755 63548                    http://minke-informatics.co.uk
mob. +44(0)7985 078324        mailto:tony.travis@minke-informatics.co.uk

Reply to:

Follow-Ups:
- Re: Any input for some talk about usage of Debian in HPC
  - From: Steven Robbins <steve@sumost.ca>

References:
- Any input for some talk about usage of Debian in HPC
  - From: Andreas Tille <andreas@an3as.eu>

Prev by Date: Any input for some talk about usage of Debian in HPC
Next by Date: Re: Any input for some talk about usage of Debian in HPC
Previous by thread: Any input for some talk about usage of Debian in HPC
Next by thread: Re: Any input for some talk about usage of Debian in HPC
Index(es):
- Date
- Thread