[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: rv-manda-01 might have hardware issues?



On 2023-08-03 12:35, Aurelien Jarno wrote:
> On 2023-08-03 12:07, Aurelien Jarno wrote:
> > On 2023-08-03 13:01, Adrian Bunk wrote:
> > > On Thu, Aug 03, 2023 at 11:12:49AM +0200, Aurelien Jarno wrote:
> > > > On 2023-08-02 18:32, Adrian Bunk wrote:
> > > > > Hi,
> > > > > 
> > > > > while there is a (rare)
> > > > >   semop(1): encountered an error: Invalid argument
> > > > > error that happens on all buildds, rv-manda-01 seems
> > > > > to have issues unique to this buildd:
> > > > > https://buildd.debian.org/status/fetch.php?pkg=softhsm2&arch=riscv64&ver=2.6.1-2.1&stamp=1690878571&raw=0
> > > > > https://buildd.debian.org/status/fetch.php?pkg=mmseqs2&arch=riscv64&ver=14-7e284%2Bds-2&stamp=1690917698&raw=0
> > > > > https://buildd.debian.org/status/fetch.php?pkg=ocaml-dune&arch=riscv64&ver=3.9.1-1%2Bb1&stamp=1690975265&raw=0
> > > > > https://buildd.debian.org/status/fetch.php?pkg=libint&arch=riscv64&ver=1.2.1-6&stamp=1690989462&raw=0
> > > > > 
> > > > > This happens only on rv-manda-01, and my guess would be that this might 
> > > > > be a hardware problem (e.g. a nonworking fan).
> > > > > 
> > > > 
> > > > This is unfortunately not limited to rv-manda-01 and also appeared on
> > > > the other buildds, so i really doubt its a hardware issue:
> > > > 
> > > > https://buildd.debian.org/status/fetch.php?pkg=freewnn&arch=riscv64&ver=1.1.1%7Ea021%2Bcvs20130302-7&stamp=1690541230&raw=0
> > > > https://buildd.debian.org/status/fetch.php?pkg=vnlog&arch=riscv64&ver=1.36-2&stamp=1690741628&raw=0
> > > > https://buildd.debian.org/status/fetch.php?pkg=audit&arch=riscv64&ver=1%3A3.1.1-1%2Bb1&stamp=1690705512&raw=0
> > > > https://buildd.debian.org/status/fetch.php?pkg=libnl3&arch=riscv64&ver=3.7.0-0.2&stamp=1690668687&raw=0
> > > > https://buildd.debian.org/status/fetch.php?pkg=globus-authz&arch=riscv64&ver=4.6-2&stamp=1690813082&raw=0
> > > > https://buildd.debian.org/status/fetch.php?pkg=krb5&arch=riscv64&ver=1.20.1-2%2Bb1&stamp=1690796233&raw=0
> > > >...
> > > 
> > > These are the semop(1) issue, which as I said happens on all buildds.
> > > 
> > > 
> > > rv-manda-01 had mysterious FTBFS that did not appear when the package 
> > > was retried:
> > > 
> > > ocaml-dune:
> > > ...
> > > cd _boot && /usr/bin/ocamlopt.opt -c -g -no-alias-deps -w -49-6 -alert -unstable -I +threads dune_rules__Coq_stanza.mli
> > > Fatal error: exception Failure("lexing: empty token")
> > > ...
> > > 
> > > proj:
> > > ...
> > > In file included from /usr/include/features.h:490,
> > >                  from /usr/include/errno.h:25,
> > >                  from /<<PKGBUILDDIR>>/src/projections/wag3.cpp:3:
> > > /usr/include/riscv64-linux-gnu/bits/stdio2.h:244:14: error: expected string-literal before ‘^=’ token
> > >   244 | extern char *__REDIRECT (__fgets_unlocked_alias,
> > >       |              ^~~~~~~~~~
> > > ...
> > > 
> > > 
> > > rv-manda-01 also had several cases of the kind of gcc ICEs that are
> > > clear buildd problems:
> > > 
> > > mmseqs2 (similar in softhsm2 and libint):
> > > ...
> > > /usr/include/c++/13/bits/stl_algo.h:1830:5: internal compiler error: in add_regs_to_insn_regno_info, at lra.cc:1502
> > > ...
> > > The bug is not reproducible, so it is likely a hardware or OS problem.
> > > ...
> > 
> > Ok, it am afraid that we just have to shutdown this buildd and wait for
> > new hardware to be available.
> 
> Alternatively I wonder if it could be the following issue, that never
> get solved, and can appear or disappear depending on the random values
> used by the kernel:
> 
> https://yhbt.net/lore/all/20200710191250.GA2242132@aurel32.net/T/

It appears that all the issues started to happen after the host has been
upgraded from kernel 6.3.7-1 to 6.4.4-1. We have been slowly upgrading
the buildds to that kernel as liburing's testsuite was causing a
reproducible kernel oops with kernel 6.1.15-1 (but not confirmed on
6.3.7-1). It appears that Fedora also starting to encounter similar
issues "recently", and that they are currently using a 6.4 kernel, but
they don't have more details at this stage.

For now, I have just rebooted the buildd on kernel 6.3.7-1 and reenable
the buildd. Let's monitor carefully the situation, for that specific
buildd but also others, to see if the issue happens again.

Cheers
Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                     http://aurel32.net


Reply to: