[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: avoid friday deployments?



Hi everyone,

First thanks for your response, I'll reply all in one batch to simplify.

On 2017-06-30 13:23:02, Holger Levsen wrote:
> there are cultures on earth where fridays are holidays and/or saturdays
> and/or sundays are normal workdays. that said…

Of course, I meant to address that with the timezone comment, but
obviously I overlooked this completely...

> I don't see any point in holding off fixes for known issues any longer.
> On the contrary, I think releasing those fixes earlier is better. 
>
> Releasing embargos prio to weekends might not be the nicest thing to do, 
> but fixed packages for widely known issues should be released ASAP.

The problem is that we often have issues pending for a while. A few days
won't make much of a difference. The two issues I was just working on
were sudo and puppet which I consider to be very critical
infrastructure. Let's take them one by one as case studies.

The sudo bug (CVE-2017-1000368) has been publicly known and pending in
LTS for a month now. It was actually triaged as "let's wait for more
issues to pile up" and it's only because I took the initiative of
backporting the patch that it was fixed at all. The impact of the
security issue is limited (local, so needs multi-user system): it's a
privilege escalation that is a little tricky to leverage.

Still, considering the high impact, I figured it was important to fix,
but I do not think it was urgent to release the fix, especially on a
friday. The patch affected a critical part of the code where sudo
inspects the tty and sets up the session. A failure on my part to do the
right thing here would be catastrophic, so it seemed reasonable to leave
more time for testing and review. Furthermore, because there is no test
suite in sudo, it's basically impossible for me to test all sudo use
cases, since there are so many (sudo-ldap anyone?). Arguably, this
change shoudln't impact the more exotic features of sudo, but who
knows...

Then the puppet bug (CVE-2017-2295). This vulnerability was announced on
may 11th and fixed in jessie on may 25th. The fix in wheezy is bold: we
change the serialization format from YAML to PSON on the wire for the
clients *and* deny the old YAML format to older clients on the
server. Deployments need to update all machines at once otherwise their
puppet manifests will stop running on out of date clients, once the
master is deployed.

Similar arguments can be made with sudo: the issue has been public for a
while yet we haven't fixed it in june yet, so I figured it could wait
over the weekend for me to think about it again and to leave time for
people to object to the change.

I uploaded both packages today.

On 2017-06-30 09:25:05, Roberto C. Sánchez wrote:
> On Fri, Jun 30, 2017 at 08:56:17AM -0400, Antoine Beaupré wrote:
>> 
>> What do people think of such a policy? Should I refrain from uploading
>> those three packages today and complete that work monday?
> 
> I don't think it makes much difference in the end, so long as you are
> available and willing to work any follow-up issues.  I think some people
> find easier to commit large blocks of weekend time to deal with
> unexpected issues.  Some weeks I am quite busy and so I will take this
> approach.  Other times I have more flexibility during the week so I will
> prefer to deploy during the week so as to not disturb my weekend time.

I agree this is the approach when you're the person responsible for the
machines affected. But the problem in this situation is that even if I'm
available to fix issues with the packages during the weekend, I won't be
the one picking up the scraps with *actual* infrastructure over the
weekend.

In other words, it doesn't matter if (say) I upload a regression fix for
(say) sudo on saturday. The person with the nightly upgrades will (say)
still have lost access to all their machines over the weekend or will be
awaken by the pager which yells because puppet failed.

To give a concrete example of what happened to me on the receiving end
of those updates, at my previous job: our VoIP server was upgraded
during the night and all of a sudden, no calls would go through. Worse:
any call would actually crash the server (DSA-2605-2, maybe?). I was
oncall so I had to backtrack what happened and downgrade the server in
an emergency. I also worked with the security team to test the new
update... I don't quite remember if it happened over the weekend, but it
certainly disrupted my workflow and since then I am way more sensitive
to the impact of those updates to my fellow sysadmins.

[...]

On 2017-06-30 15:35:07, David Ayers wrote:
> If you were talking about deploying new features, refactorization or
> similar work, I'd tend to a agree.  But this LTS support.  I assume you
> are fixing vulnerabilities.  You should carefully consider exposing
> systems longer than necessary.

Sometimes we need to be creative in the way we fix those issues,
unfortunately. The puppet update is a good example: I had to actually
switch on a new feature to fix the bug, so this *is* a major change, and
it's not as simple.

The sad truth is that we *are* exposing systems longer than necessary
already. The question is what constitutes a valid reason for this. For
me, avoiding disruption is one. Our most common reason for delaying
updates, however, is the lack of workforce and I find *that*
unfortunate. :)

Thank you again for all the feedback. All in all, I do not believe we
should have a hard rule against friday deployments, but I do think it's
something we should keep in mind, especially when more time may be
useful to think about major updates.

A.

-- 
We must learn to live together as brothers or perish together as fools.
                        - Martin Luther King, Jr.


Reply to: