Re: Dreamhost dumps Debian
Excerpts from Russ Allbery's message of 2013-08-27 13:47:01 -0700:
> Clint Byrum <firstname.lastname@example.org> writes:
> > Perhaps you missed the blog post  details?
> > "About ten months ago, we realized that the next installation of Debian
> > was upcoming, and after upgrading about 20,000 machines since Debian 6
> > (aka Squeeze) was released, we got pretty tired."
> > Even if the script is _PERFECT_ and handles all of the changes in
> > wheezy, just scheduling downtime and doing basic sanity checks on 20,000
> > machines would require an incredible effort. If you started on release
> > day, and finished 2-3 machines per hour without taking any weekend days
> > off, you would just barely finish in time for oldstable to reach EOL. I
> > understand that they won't be done in a linear fashion, and some will
> > truly be a 5 minute upgrade/reboot, but no matter how you swing it you
> > are talking about a very expensive change.
> A few comments here from an enterprise administration perspective:
> First, if you have 20,000 machines, it's highly unlikely that each system
> will be a special snowflake. In that environment, you're instead talking
> about large swaths of systems that are effectively identical. You
> therefore don't have to repeat your sanity checking on each individual
> system, just on representives of the class, while using your configuration
> management system to ensure that all the systems in a class are identical.
> And in many cases you won't have to arrange downtime at all (because the
> systems are part of redundant pools).
Dreamhost is a hosting company. It actually is quite possible that all
20,000 machines mentioned are unique snowflakes in this case. Though
it is probably more likely that there at most 10,000 unique machines,
with some customers having only one, but others having 3 or more.
(would be great if one of them were on this thread to comment.. ;)
> Second, with 20,000 machines, there is no way that I would upgrade the
> systems. Debian's upgrade support is very important for individual
> systems, personal desktops, and smaller-scale environments, but even when
> you're at the point of several dozen systems, I would stop doing upgrades.
> At Stanford, we have a general policy that we rebuild systems from FAI for
> new Debian releases. All local data is kept isolated from the operating
> system (or, ideally, not even on that system, which is the most common
> case -- data is on separate database servers or on the network file
> system) so that you can just wipe the disk, build a new system on the
> current stable, and put the data back on (after performing whatever
> related upgrade process you need to perform). There's up-front
> development required for your new service model for the new operating
> system release, which you validate outside of production, and then the
> production rollout is mechanical system rebuilds (which usually take under
> 10 minutes with FAI and are parallelizable).
I was actually thinking this too, and by upgrade I mean "the software
that serves each specific job is now wheezy". In-place upgrades for
20,000 machines would definitely be an incredible explosion of entropy
How long does FAI take to make a new machine? If it is more than 30 minutes
then you need at least two FAI's going all the time to finish on time.
> My personal opinion is that if someone is scripting an upgrade to 20,000
> systems and running it on those systems one-by-one, they're doing things
> at the wrong scale and with the wrong tools for that sort of environment.
I wasn't clear, I don't mean you'll do each one as a special snowflake
in-place. I mean, 20,000 machines is simply a lot of machines to
manage. No matter what, upgrading or replacing the OS all within a 1
year schedule that you do not control and cannot fully predict, is a