Re: End of hypocrisy ?

To: Debian User <debian-user@lists.debian.org>
Subject: Re: End of hypocrisy ?
From: Tom H <tomh0665@gmail.com>
Date: Tue, 5 Aug 2014 09:04:50 -0400
Message-id: <[🔎] CAOdo=SzoiPO=VfVp2f4CM8vvZF93vcZ1-yM4aStzK6+7yjeUZg@mail.gmail.com>
In-reply-to: <[🔎] CAAr43iPTW4MjxXXd3xTLTALMyBspTsL=u0WVN0q1AZvfxNGaCA@mail.gmail.com>
References: <20140721053040.GC18817@rail.eu.org> <[🔎] 53DF6256.5080307@affinityvision.com.au> <[🔎] CAOdo=SxKfsc6pMntDLC7+XfFq_AfqpadUeU_d11QNO=8PRLR1w@mail.gmail.com> <[🔎] 53DF9A92.9050406@affinityvision.com.au> <[🔎] CAOdo=SzePe6tACV76NLfH4ZVzCrnA3zuM58Hxx1b_qGzyO7ZvA@mail.gmail.com> <[🔎] 53DFE2B8.8060106@rail.eu.org> <[🔎] 53DFEAAC.6000703@affinityvision.com.au> <[🔎] CAOdo=SweELVZb=kQzy0e1uOgWcpZUkfObU7N8XbF1qheuMcZKA@mail.gmail.com> <[🔎] CAAr43iPTW4MjxXXd3xTLTALMyBspTsL=u0WVN0q1AZvfxNGaCA@mail.gmail.com>

On Mon, Aug 4, 2014 at 7:21 PM, Joel Rees <joel.rees@gmail.com> wrote:
> On Tue, Aug 5, 2014 at 6:20 AM, Tom H <tomh0665@gmail.com> wrote:
>> On Mon, Aug 4, 2014 at 4:18 PM, Andrew McGlashan
>> <andrew.mcglashan@affinityvision.com.au> wrote:
>>> On 5/08/2014 5:44 AM, Erwan David wrote:
>>>> Le 04/08/2014 21:34, Tom H a écrit :
>>>>>
>>>>> Suppose that you have a 16-node cluster, some patches were applied to
>>>>> the systems overnight, a mistake was made, and you have to correct
>>>>> this mistake on all of the systems during trading hours. Once you get
>>>>> all the OKs that are needed for this kind of emergency change, the
>>>>> head of the trading desk that uses that cluster calls you and says
>>>>> "I'm going to be on the line for as long as you're working on our
>>>>> system." So you fix one node, reboot it, make sure that it's back in
>>>>> the cluster and doing its job, and fix another, etc. You can be sure
>>>>> that everyone's happier that the systems boot quickly and that the
>>>>> cluster was running with 15 rather than 16 nodes for as few minutes as
>>>>> possible (because you can be sure that the fact that this cluster
>>>>> wasn't running at full capacity for X minutes will come up in
>>>>> managerial meetings, both in IT ones and in IT-Business ones).
>>>
>>> The argument here is likely that the upgrade should have been tested on
>>> a test cluster FIRST and perhaps extensively -- if you have that many
>>> servers in play, you should have a development, test and production
>>> environment to work with and very stringent change control methods in place.
>>
>> Come on! Changes go through dev and uat before being rolled out to
>> prod. The night-shift sysadmin who made the changes screwed up. It
>> happens...
>
> When the operating system itself tries to hold the night-shift admin
> by the hand, we have serious problems.
>
> Current trading systems are completely wrong. It's no surprise if they
> can't get the failover part right, either.

The init system isn't baby-sitting the sysadmin and it has nothing to
do with trading system failover.

It's a question of having to correct a configuration error one node at
a time while the other nodes keep on doing whatever they're emant to
be doing and rebooting these nodes as quickly as possible.

Reply to:

References:
- Re: End of hypocrisy ?
  - From: Andrew McGlashan <andrew.mcglashan@affinityvision.com.au>
- Re: End of hypocrisy ?
  - From: Tom H <tomh0665@gmail.com>
- Re: End of hypocrisy ?
  - From: Andrew McGlashan <andrew.mcglashan@affinityvision.com.au>
- Re: End of hypocrisy ?
  - From: Tom H <tomh0665@gmail.com>
- Re: End of hypocrisy ?
  - From: Erwan David <erwan@rail.eu.org>
- Re: End of hypocrisy ?
  - From: Andrew McGlashan <andrew.mcglashan@affinityvision.com.au>
- Re: End of hypocrisy ?
  - From: Tom H <tomh0665@gmail.com>
- Re: End of hypocrisy ?
  - From: Joel Rees <joel.rees@gmail.com>

Prev by Date: Re: End of hypocrisy ?
Next by Date: Re: Wireless card unavailable in Debian, but works in Ubuntu
Previous by thread: Re: End of hypocrisy ?
Next by thread: Re: End of hypocrisy ?
Index(es):
- Date
- Thread