Re: autopkgtesting and regressions that slip into the baseline

To: Iain Lane <laney@debian.org>
Cc: Debian CI team <debian-ci@lists.debian.org>
Subject: Re: autopkgtesting and regressions that slip into the baseline
From: Paul Gevers <elbrus@debian.org>
Date: Wed, 8 Aug 2018 14:38:31 +0200
Message-id: <[🔎] a72e01f0-1af4-018e-7f5b-68b06dd5986a@debian.org>
In-reply-to: <[🔎] 20180808120643.b7tvmccg6c2yogcv@nightingale>
References: <[🔎] c0d6b8ca-3fe1-7cd1-6cb2-4737f74cb9bd@debian.org> <[🔎] 20180808120643.b7tvmccg6c2yogcv@nightingale>

Hi Iain,

On 08-08-18 14:06, Iain Lane wrote:
> On Fri, Aug 03, 2018 at 11:36:48AM +0200, Paul Gevers wrote:
>> I think your question was something like "issues are only found late by
>> indirect reverse dependencies, by which time there is not much to do
>> that accept the situation. Does Debian also experience that?"
> 
> That's right. The way I put it I think was "we are bad at pinning the
> blame for test regressions at the right place".

Often (but by far not always) I leave that up to the packages involved.
There have been numerous bugs I have filed against both the package that
starts to fail as well as to the package that happens to trigger the
failure first.

>> First, I do find issues like that and I create bugs (hopefully with the
>> right severity) as I create bugs for most regressions that aren't
>> related to unsolved bugs in our infrastructure.
> 
> That sounds like really valuable work, thanks for doing that. Am I right
> in thinking that this is mostly through manual examination of each
> particular failure?

See also [1]. That is correct. But in nearly all cases, what I do is
quickly try to spot the error, see if it already makes sense (in quite a
bit of the cases it does, although I don't think it is above 50%). Then
I check if the apt fallback triggered, the changelog of the new package
and the status of the tests in unstable (can all be done in a couple of
minutes). Typically submitting the bug correctly costs most time (I have
to improve my scripts to generate a proper template with the info I
want). I spend most time actually communicating with people like about
the right technical solution; versioned breaks and versioned depends are
often not fully on the radar and currently not 100% supported by the
autopkgtest framework. Bug 896023 is a nasty one in that respect, which
should be fixed by britney; I am working on a patch to add multiple
packages from unstable at the same time (I want the apt fallback to be
disabled for Debian testing when that lands).

>> Second, because how Debian works with a different baseline (our baseline
>> is the current situation in testing, rather than all past results for a
>> package) than Ubuntu, the bug report is there but the baseline is
>> updated automatically so the problem for gating "goes away".
> 
> Indeed. I think this is part of the situation that I don't find 100%
> comfortable.
> 
> I can see the justification: we missed when the regression came in, and
> it's not fair to impose penalties on maintainers/packages that didn't
> cause the problem. That is arguably a sensible way to deal with the
> current reality. But I think it should be considered a workaround for
> the problem that I'm specifying.

Well, although I agree with your statement, it's also a consequence of
the current way Debian works (albeit we all agree that we want to go to
gating, much like Ubuntu). As we are only delaying, eventually
regressions are *allowed* to enter, unless the bug has RC status.

> Ideally we would have 0 flaky tests (which mess everything else up)

I have automatic retries in place, so there are not such a big problem
in Debian. Also, all flaky tests with bugs filed are force-badtest by me.

> and
> every test regression would exactly pin the blame on the change(s?) that
> caused it.

If we didn't have that stupid (but currently needed) apt fallback. Also,
because of dependencies, you may only be able to migrate a group of
packages in lock-step and then any package in the group can actually
cause the regression. I don't think that without rebuilding (and manual
work) you'll be able to spot for every regression which package actually
caused the issue. I agree we can do better than now.

> If we get better at catching things at the right time, the
> testing-as-baseline thing won't be so necessary - because it will be
> really abnormal for something to go from green to red in testing: we'll
> have noticed it in unstable when it actually happened.

No. Unstable is broken on a regular basis. Transitions are often the
cause for that, especially in combination with apt's behavior of only
considering candidates. I.e. from the Depends and availability view
point you would be able to find a solution, except apt doesn't want to
install it.

> I know that achieving this perfectly is probably not possible, so steps
> towards this state are what I would be looking for.

Full ack on that.

>> I think a real solution may be to test not only the direct reverse
>> dependencies, but also indirect reverse dependencies. If I am correct
>> ci.debian.net (with 10 workers for unstable and testing) is doing that
>> for the unstable archive for years already so this is probably less
>> problematic than it sounds, although I am a bit worried about the time
>> it takes sometimes. (We could add more workers of course).
> 
> I think that's the right idea, but I'm a bit worried about whether it
> scales. There are some extreme cases like glibc and perl - just testing
> their direct dependencies on Ubuntu for us results in more than 1,000
> test requests (we test on 6 arches) and if more than one of these big
> packages is uploaded close together in time we already end up with a
> multi-day backlog of pending tests. I can't exactly remember what our
> capacity is, but on x86 I'm sure it's more than 10 instances - more like
> 40 (shared between i386 and amd64).
> 
> I'd need to do some analysis to figure out what kind of numbers we'd be
> talking about here.

Maybe Antonio can comment on our unstable situation. I am pretty sure
that we sometimes are behind, yes, and I also know we have 10 workers in
total, only amd64. Since we prioritized submissions over the debci
(unstable) tests I haven't seen much backlog, except indeed in the glibc
case.

>> Related note, we test all packages in a pure testing environment at
>> least once a week, so if needed, figuring out what changed is easier
>> than for you (you noted that there may have been quite some time between
>> tests in Ubuntu), although we aren't doing that yet.
> 
> We discussed this between a small group after the BoF had finished, and
> had the idea to do the same in Ubuntu too. Something like this: keep a
> low priority queue constantly filled with all packages that have tests
> in 'testing' (which we call 'the release') which is consumed from
> whenever the other queues are empty. If we do this, then there should be
> a reasonably fresh baseline available to look at whenever something
> flips to red.

Indeed. There are currently about 8600 packages with tests in testing.
That means about 1 test per minute when looping is done in a week
(that's what I do). (@Antonio) It would be even greater if I could do
that will lower prio than the current gating testing, but at the moment
this isn't really an issue.

Did you further discuss the idea's of Ian about bisecting a regression?

Paul

https://bugs.debian.org/cgi-bin/pkgreport.cgi?users=debian-ci@lists.debian.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply to:

References:
- autopkgtesting and regressions that slip into the baseline
  - From: Paul Gevers <elbrus@debian.org>
- Re: autopkgtesting and regressions that slip into the baseline
  - From: Iain Lane <laney@debian.org>

Prev by Date: Re: autopkgtesting and regressions that slip into the baseline
Next by Date: Bug#905778: forge: autopkgtest regression
Previous by thread: Re: autopkgtesting and regressions that slip into the baseline
Next by thread: Bug#905540: dask.distributed: autopkgtest fails with python3.7 in supported versions
Index(es):
- Date
- Thread