Re: Dreamhost dumps Debian
Clint Byrum <email@example.com> writes:
> Perhaps you missed the blog post  details?
> "About ten months ago, we realized that the next installation of Debian
> was upcoming, and after upgrading about 20,000 machines since Debian 6
> (aka Squeeze) was released, we got pretty tired."
> Even if the script is _PERFECT_ and handles all of the changes in
> wheezy, just scheduling downtime and doing basic sanity checks on 20,000
> machines would require an incredible effort. If you started on release
> day, and finished 2-3 machines per hour without taking any weekend days
> off, you would just barely finish in time for oldstable to reach EOL. I
> understand that they won't be done in a linear fashion, and some will
> truly be a 5 minute upgrade/reboot, but no matter how you swing it you
> are talking about a very expensive change.
A few comments here from an enterprise administration perspective:
First, if you have 20,000 machines, it's highly unlikely that each system
will be a special snowflake. In that environment, you're instead talking
about large swaths of systems that are effectively identical. You
therefore don't have to repeat your sanity checking on each individual
system, just on representives of the class, while using your configuration
management system to ensure that all the systems in a class are identical.
And in many cases you won't have to arrange downtime at all (because the
systems are part of redundant pools).
Second, with 20,000 machines, there is no way that I would upgrade the
systems. Debian's upgrade support is very important for individual
systems, personal desktops, and smaller-scale environments, but even when
you're at the point of several dozen systems, I would stop doing upgrades.
At Stanford, we have a general policy that we rebuild systems from FAI for
new Debian releases. All local data is kept isolated from the operating
system (or, ideally, not even on that system, which is the most common
case -- data is on separate database servers or on the network file
system) so that you can just wipe the disk, build a new system on the
current stable, and put the data back on (after performing whatever
related upgrade process you need to perform). There's up-front
development required for your new service model for the new operating
system release, which you validate outside of production, and then the
production rollout is mechanical system rebuilds (which usually take under
10 minutes with FAI and are parallelizable).
My personal opinion is that if someone is scripting an upgrade to 20,000
systems and running it on those systems one-by-one, they're doing things
at the wrong scale and with the wrong tools for that sort of environment.
Russ Allbery (firstname.lastname@example.org) <http://www.eyrie.org/~eagle/>