Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1
From: Noah Meyerhans <nmeyerha@amzn.com>
Date: Fri, 10 Apr 2020 13:16:31 -0700
Message-id: <[🔎] 20200410201631.GA23377@amazon.com>
Reply-to: Noah Meyerhans <nmeyerha@amzn.com>, 956418@bugs.debian.org

Package: src:glibc
Version: 2.30-4
Severity: wishlist
X-Debbugs-CC: debian-arm@lists.debian.org

The ARMv8.1 spec, as implemented by the ARM Neoverse N1 processor,
introduces a set of instructions [1] that result in significant performance
improvements for multithreaded applications.  Sample code demonstrating the
performance improvements is attached.  When run on a 16-core Neoverse N1
host with glibc 2.30-4, runtimes vary significantly, ranging from lows
around 250ms to highs around 15 seconds.  When linked against glibc rebuilt
with support for these instructions, runtimes are consistently <50ms.
Significant performance impact has also been observed in less contrived
cases (MariaDB and Postgres), but I don't have a repro to share.

Gcc provides two ways to enable support for these instructions at build
time.  The simplest, and least disruptive, is to enable -moutline-atomics
globally in the arm64 glibc build.  As described at [2], this option enables
runtime checks for the availability of the atomic instructions.  If found,
they are used, otherwise ARMv8.0 compatible code is used.  The drawback of
this option is that the check happens at runtime, thus introducing some
overhead on all arm64 installations.

The second option is to provide libraries built with explicit support for
the ARM v8.1a spec via the -march=armv8.1-a flag.  This option is also
described at [2].  This build would be incompatible with earlier versions of
the spec, so it would need to be provided in a location where the linker
will automatically discover it if it is usable (e.g.
/lib/aarch64-linux-gnu/atomics/).  This does not incur any runtime overhead,
but obviously involves an additional libc build, and the corresponding
complixity and disk space utilization.  I'm not sure if this is an option
that the glibc maintainers are interested in pursuing.

I've tested both options and found them to be acceptable on v8.1a (Neoverse
N1) and v8a (Cortex A72) CPUs.  I can provide bulk test run data of the
various different configuration permutations if you'd like to see additional
data.

I can provide patches or merge requests implementing either option, at least
for a starting point, if you'd like to see them.

Thanks!
noah

1. https://static.docs.arm.com/ddi0557/a/DDI0557A_b_armv8_1_supplement.pdf
   Section B1
2. https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html

/*
 * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"). You may
 * not use this file except in compliance with the License. A copy of the
 * License is located at
 *
 *      http://aws.amazon.com/apache2.0/
 *
 * or in the "license" file accompanying this file. This file is distributed
 * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 * express or implied. See the License for the specific language governing
 * permissions and limitations under the License.
*/

/* Build with:
 * gcc -O2 -o a.out a.c -lpthread -DITER=1000 -DTHREADS=64
*/

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>

#ifndef ITER
# define ITER 1000
#endif
#ifndef THREADS
# define THREADS 3
#endif

#if THREADS < 1
# error "THREADS is supposed to be at least 1"
#endif

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static int shared_ptr = 0;

typedef struct stats_s {
  uint64_t min, max;
  int times;
  uint64_t total;
  uint64_t flips;
} stats_t;

stats_t stats[THREADS + 1];
pthread_t threads[THREADS];

#ifdef __aarch64__
static uint64_t cpu_shift() {
  uint64_t shift = 0;
  __asm__ __volatile__ ("mrs %0,cntfrq_el0; clz %w0, %w0":"=&r"(shift));
  return shift;
}
#endif

static uint64_t gettime() {
#ifdef __aarch64__
  uint64_t ret = 0;
  __asm__ __volatile__ ("isb; mrs %0,cntvct_el0":"=r"(ret));
  return ret << cpu_shift();

#elif defined __x86_64__
  uint64_t a, d;
  __asm__ __volatile__ ("rdtsc" : "=a" (a), "=d" (d));
  return ((uint64_t)a + ((uint64_t)d << 32));
#endif

  return 0;
}

static void init_stats() {
  int i;
  for (i = 0; i <= THREADS; i++) {
    stats_t *s = &stats[i];
    s->min = 1000000;
    s->max = 0;
    s->times = 0;
    s->total = 0;
    s->flips = 0;
  }
}

static void print_stat(int i) {
  stats_t *s = &stats[i];
  float average = (float) s->total / s->times;
  if (i == THREADS)
    fprintf(stdout, "server: min=%ld, max=%ld, average=%f, mutexes_locked=%d, flips=%ld\n", s->min, s->max, average, s->times, s->flips);
  else
    fprintf(stdout, "thread %d: min=%ld, max=%ld, average=%f, mutexes_locked=%d, flips=%ld\n", i, s->min, s->max, average, s->times, s->flips);
}

static void print_stats() {
  int i;
  for (i = 0; i <= THREADS; i++)
    print_stat(i);
}

static void update_stats(stats_t *s, uint64_t time) {
  ++s->times;
  if (time < s->min)
    s->min = time;
  if (time > s->max)
    s->max = time;
  s->total += time;
}

static void fun(int check, int set, stats_t *stat) {
  int loop = 1;
  while (loop) {
    uint64_t start = gettime();
    pthread_mutex_lock (&lock);
    if (shared_ptr == check) {
      loop = 0;
      ++stat->flips;
      shared_ptr = set;
    }
    pthread_mutex_unlock (&lock);
    update_stats(stat, gettime() - start);
  }
}

static void *tf (void *arg)
{
  int i;
  stats_t *stat = NULL;
  pthread_t tid = pthread_self();

  for (i = 0; i < THREADS; i++)
    if (tid == threads[i]) {
      stat = &stats[i];
      break;
    }

  /* Run until canceled. */
  while(1)
    fun(1, 0, stat);
  return NULL;
}

int main (int argc, char **argv) {
  int i;
  for (i = 0; i < THREADS; i++) {
    if (pthread_create (&threads[i], NULL, tf, NULL) != 0)
      {
        puts ("pthread_create failed");
        exit (1);
      }
  }

  init_stats();

  for (i = 0; i < ITER; i++)
    fun(0, 1, &stats[THREADS]);

  for (i = 0; i < THREADS; i++) {
    if (pthread_cancel (threads[i]) != 0)
      {
        puts ("pthread_cancel failed");
        exit (1);
      }
  }

  print_stats();
  return 0;
}

Reply to:

Follow-Ups:
- Re: Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1
  - From: Aurelien Jarno <aurelien@aurel32.net>

Prev by Date: Bug#956400: pdf: FTBFS on multiple 32-bit architectures, needs libatomic
Next by Date: Re: Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1
Previous by thread: Bug#956400: pdf: FTBFS on multiple 32-bit architectures, needs libatomic
Next by thread: Re: Bug#956418: src:glibc: Please provide optimized builds for ARMv8.1
Index(es):
- Date
- Thread