Re: Kernel Live Patching

To: debian-user@lists.debian.org
Subject: Re: Kernel Live Patching
From: Andy Smith <andy@strugglers.net>
Date: Fri, 29 Jun 2018 05:00:11 +0000
Message-id: <[🔎] 20180629050011.GL4569@bitfolk.com>
In-reply-to: <[🔎] 201806281053.58050.gheskett@shentel.net>
References: <[🔎] CAF1fUBQ-3a+OKNyx0=raW7Tz=zLvVMGkzMHF_FNXEJOQhJ6=6w@mail.gmail.com> <[🔎] 201806281053.58050.gheskett@shentel.net>
Hi Aleksey,

I'm sorry, I don't have useful answers to your questions, so if that
is all you're after you may as well skip this email.

On Thu, Jun 28, 2018 at 10:53:58AM -0400, Gene Heskett wrote:
> Given the history of ksplice, and my innate paranoia, I don't have a pole 
> long enough to reach it. You shouldn't either.

In my opinion this is just fear of a process that is not understood.
At another time and place, you could be cautioning people not to
travel faster than 40mph because they will surely suffocate, or that
photographic reproductions of a person's image trap their soul.

Live patching is a technique. It has trade-offs, like everything
else.

> If something is patched and a reboot is needed to make it 100% 
> functional, and you can't stand the thought of 2 minutes downtime while 
> its rebooting, its time to mirror your app to a second machine and 
> configure an automatic failover.

In some (maybe even many) scenarios this is absolutely true.

As a thought experiment though, imagine you have a server with 1,000
services on it, each of them through virtualisation belonging to a
different entity (customer, organisation, user, whatever). We'll
call the entities users for simplicity.

Each user pays $5 a month for their service to run on this platform.
There is no redundancy. If the user wants redundancy then the user
can purchase more services and implement it themselves. Because most
users do not see that as a priority, most users do not. The users
accept that there will be inevitable occasional downtime, because
they don't want to increase their costs to perhaps double or more
for what is a relatively rare event.

So, this platform, it's raking in $5k a month per server.

Then there's a kernel update for a serious security flaw and that
requires a reboot. You as operator of the platform schedule
maintenance and the users endure 5 minutes of outage.

Your competitor works out that they can pay someone to produce live
kernel patches for $100 a month per server and does so. Your
competitor also has 1,000 users per server, so they're raking in $5k
per server, less $100 to pay the live patching company. $4,900 a
month. They don't reboot their servers causing outages for their
users, they just live patch¹.

Your users find out that your competitor's service is exactly the
same as yours, with the same features and price, but they've heard
that it's a lot more available! 20% of your users move to your
competitor.

You're still making $5k per server but you have 20% fewer servers
because 20% of your users left. Your competitor is still making
$4,900 per server but they have 20% more servers, because they took
on a bunch of new users. Your competitor is crushing you.

You try to explain to your users that they could build in the
availability they need by running multiple instances of their
services and architecting it so that it can survive failure of a
percentage of the instances. Your users ask you why you expect them
to pay more, do more, know more and make things more complex, when
it is an inarguable fact that your competitor has a more available
service than yours for the same price they pay now. Looks like
you're going to have to either copy your competitor, or lower your
prices. You can talk until you are blue in the face about there
being no free lunch, while your user is paying your competitor
what they used to pay you, only your competitor's lunch offering is
better.

As I say you are absolutely correct that for some scenarios it is
right and proper to build the resilience into the app. However what
I suggest you have missed is that very few use cases require that
level of engineering. Most of the users of any software at all
exist in a place with much more lax requirements where it is *nice*
when things don't fail, but they aren't interested in building N
copies of it and altering the software so that it can make use of
that distributed nature. Just making it incrementally better is a
big thing at scale, especially if that doesn't cost much. Hint: live
patching services don't tend to cost $100 per server per month.

So we are left in a situation where a lot of things are running on
"platforms" and the users like it when the platforms are highly
available. Why isn't everything engineered to have near-perfect
availability? That's possible, right? Yes, but it costs. Motor
vehicles aren't space shuttles; the process around their design and
manufacture does not attempt to make them near-perfect, merely good
enough for what they cost. If a simple alteration to the process can
save X lives per year while not costing very much, then it makes
sense, and no one scoffs that it's not near enough to perfection yet
so don't do anything at all.

The thing about live kernel patching is that it's private to the
kernel. The rest of the system doesn't necessarily know that
anything has happened. When you start to build highly available
distributed services you tend to find that you need to alter the way
you do things in order to make it work with the distributed nature
of the service. Where there is a partition between the people
running the platform and the people running the service (as
suggested in the example scenario I gave above), it becomes more
likely that the platform operator actually cannot run things in a
manner that suits every variety of service owner, and vice versa.
Remaining as generic as possible results in the largest market
possible.

That is why techniques like live kernel patching increasingly do
have a place, even though it may horrify purists.

> There are some OS's that can do that, QNX comes to mind, but they
> aren't free. Even the QNX microkernel has a dead time of 15 or 20
> seconds for a full reload of everything else.

Also, there are further trade-offs here in that you'd then be
deploying software in a quite specialist environment, rather than on
plain old well-understood Linux. While there are some that need,
want and are capable of working with QNX, there's vastly more market
in hosting things on Linux.

> I think the applicable keyword here is TANSTAAFL. Its a universal law, 
> and there are no shortcuts around it.  IOW, if you think the lunch is 
> free, check the price of the beer.

The price of live patching at the moment is that for every kernel
update, someone has to work out the corresponding live patch. That
work is not free and that is why various organisations charge for
it.

The Linux kernel cannot go unpatched upstream, so the costs of
generating patches are borne by the Linux kernel project. By
contrast, most end-users' kernel *can* go without a live patch, so
those interested in using live patches currently need to pay or
employ someone to generate them.

No one is suggesting that anyone is expecting to get anything for
free, so the choice of using live patches, or live migrating things,
or any other technique to increase availability of a service is just
a matter of choosing your poison. It's not black magic and no one
needs to spit and say an oath whenever its name is mentioned.

I am not currently aware of anyone providing free kernel live
patches. These are things you pay for, from sources like Red Hat,
Canonical, CloudLinux and Oracle, or hire staff to produce for you,
or make yourself.

Perhaps one day the process will become so simple that volunteers
in a project like Debian could do it for free, nearly as fast as the
regular binary package updates come out. We're not there yet and
information on how to do it seems quite scarce, which is why people
are currently paying for it.

Cheers,
Andy

¹ At this point some may say, "well, if the users ran their services
  in virtual machines then the VMs could be migrated to
  already-patched hardware without the users actually noticing. No
  need for this live patching stuff!"

  That's true but I think it's getting bogged down in details. A
  person could equally say, "this live migration thing is the
  Devil's work; just make every application distributed so it can
  survive failure, or else you don't really care about the
  application!"

  Like live patching, live migration is a technique that has its
  trade-offs and will not be suitable for all scenarios.
Reply to:
References:
- Kernel Live Patching
  - From: Aleksey Kravchenko <gmkrab@gmail.com>
- Re: Kernel Live Patching
  - From: Gene Heskett <gheskett@shentel.net>
Prev by Date: Re: how to update Debian 9.4 to at least OpenGL 3.3
Next by Date: After some effort, can't change grub screen resolution
Previous by thread: Re: Kernel Live Patching
Next by thread: how to update Debian 9.4 to at least OpenGL 3.3
Index(es):
- Date
- Thread