[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#993978: linux-image-5.10.0-8-arm64: host hangs after some time of use



Hi Paul,

On Thu, Sep 09, 2021 at 11:00:26AM +0200, Salvatore Bonaccorso wrote:
> Hi Paul,
> 
> On Thu, Sep 09, 2021 at 09:37:02AM +0200, Paul Gevers wrote:
> > Package: src:linux
> > Version: 5.10.46-4
> > X-Debbugs-CC: debian-ci@lists.debian.org, aron@debian.org
> > Severity: serious
> > Justification: data loss
> > 
> > Hi,
> > 
> > As discussed over IRC, here is the bug report for one of the hanging
> > arm64 hosts we have for ci.debian.net.
> > 
> > Since the upgrade of our hosts to bullseye (days before the bullseye
> > release) we have been experiencing random loss of access to our hosts.
> > For the hosts that have some form of out-of-bound access, I tried to use
> > that to see what's going on, but at AWS our account doesn't have the
> > right permissions to use the serial port out-of-bound access and all
> > other forms that I tried on all hosts that I have access to some for of
> > out-of-bound access that didn't work.
> > 
> > Since the bullseye release I've rebooted (externally triggered) already
> > dozens of times and for those host that don't allow rebooting (AWS
> > again) I had to reprovision the hosts.
> > 
> > All the architectures (amd64, arm64, ppc64el and s390x) that we have
> > experience these hangs. I'm absolutely not claiming that the root cause
> > is the same, but on buster we didn't experience this (our s390x host
> > never workerd on buster so I don't claim regression there), so there is
> > a pattern. However, the symptoms don't look completely the same everywhere.
> > 
> > On one of our arm64 hosts (we call ci-worker-armel-01) I found the
> > attached logging as the final logs in the journal.
> 
> I suspect it's the same issue as fixed by
> https://git.kernel.org/linus/ad9f151e560b016b6ad3280b48e42fa11e1a5440
> upstream,
> https://lore.kernel.org/lkml/000000000000ef07b205c3cb1234@google.com/
> 
> The fix landed in 5.13-rc7 (was backported to 5.12.13 as well, but not
> 5.10.y). It seems it requires more work to address it as well in
> 5.10.y.
> 
> Asked upstream in
> https://lore.kernel.org/lkml/YTkj4xH2Ol075+Ge@eldamar.lan/

The needed patches are now there:
https://lore.kernel.org/stable/20210909140337.29707-1-fw@strlen.de/
and queued for the next 5.10.y upload (so I expect it to have thos
latest in our first bullseye point release).

I will try to cherry-pick those, if you can check they fix the issue
that would be great.

Regards,
Salvatore


Reply to: