[replying on the debian-med list with permission. Please keep Martin and Milot CC'd as they do not subscribe]
I am a developer on the MMseqs2 team and I saw your tweet regarding the AWS ARM64 machines earlier and checked on Debian Salsa if it would be a lot of work enabling ARM64 support with the next release as we worked on that recently.
Hey Milot, thanks for your email!
I saw that Debian's MMseqs2 now uses SIMDe to abstract away different architectures. While this is a very cool technical achievement, I am very uncomfortable with it without being properly integrated into and monitored by our CI regression testing.
During ARM64 development I found that there are a lot of subtle issues that can result in differing sensitivity between architectures (e.g. ARM64's default unsigned char type causes issues, there are many crashes on 32-bit ARM). I am also worried that our two most important platforms (SSE4.1 and AVX2) might suffer from performance regressions.
Interesting! On Debian we have to provide binaries that respect the architecture baseline. That means no SSE-, SSE2-, only binaries on i386/i686 and no SSE3+ only binaries on AMD64. So that's why we compile mmseqs2 multiple times, so there is a version that doesn't violate the baseline, along with versions that should match the highest level of SIMD support available on the user's CPU.
We will have ARM64 and hopefully also PPC64LE support in the next release. I would suggest to either wait and use our upstream code, or submit a PR with your changes to us and see how we can integrate everything correctly.
Sure, happy to send the patches! I meant to, but hadn't gotten around to it yet
Also I would be very glad if you could integrate the full regression suite to spot if all architectures produce consistent results. You can run the regression by calling from the repository:
git submodule update --init
./util/regression/run_regression.sh ./path-to-mmseqs-binary scratch-directory
Oh yeah, would love to! Except we need all the upstream sources in a single tarball, which git submodules + GitHub releases makes difficult. So if you can add a pure source (with all git submodules) tarball to https://github.com/soedinglab/MMseqs2/releases
that would be appreciated!
We had refactored this test suite to make it as easy as possible to use for Shayan who initially had proposed to package MMseqs2 for Debian. The test subfolder is badly named and contains scratch scripts for feature development. They don't do anything useful for testing such as finding performance or sensitivity drops.
Thanks for your work and best regards,
Thank you for sharing your work under a F/OSS license and for your contributions to Open Science!