Debian Google Compute Engine kernel improvements, now and future
(Please keep me and my colleague CC'ed, as I'm not subscribed to
debian-kernel and he's on neither list.)
First, really good news - I just pushed Debian wheezy images for
Google Compute Engine using the standard Debian kernel included within
the image, instead of the Google-injected kernel we used before.
Rather than pv_grub, we use whatever bootloader you want - it's
preconfigured with Debian's normal grub2 setup, with all the usual
config bits like update-grub and /etc/default/grub working. Hardware
is virtio-based, including virtio-scsi for the disk driver and
virtio-net for network.
The image is the newest one in the debian-cloud project, named
debian-7-wheezy-v20131120. gcutil addinstance --image=debian-7 should
get it. We've already upstreamed all the necessary code to
build-debian-cloud. The only difference is, if you allow the tool to
add the image directly to Google Compute Engine by passing a value to
--gce-project, you'll also need to specify --gce-kernel="", or else do
it manually and pass --preferred_kernel="" to gcutil addimage. I'll
send another patch shortly to change the default Compute Engine kernel
to the null string.
During the process of developing this, we encountered several issues
which will take time to fix in Debian stable but are worth fixing for
the best performance in our environment, especially under high load.
Examples: a memory leak in virtio-scsi (Debian bug #730138),
multi-queue networking support, ext4 stall bugs. None of these fixes
are Google-specific in any way, and none of them are needed for casual
usage to work smoothly.
We have some thoughts on how to address this, and wanted to give
Debian a chance to comment before we move forward.
All needed fixes to currently known kernel problems are in the
wheezy-backports kernel, but shipping the primary supported image with
a kernel backported on a best-effort volunteer basis (not the Debian
security team) is hard for both Google and Debian to view as
Google cares enough about handling this right to put some man-hours
into both short-term and long-term solutions.
Short-term proposal: one of my colleagues is planning to upload a
debian-7-wheezy-backports-vYYYYMMDD (date TBD) image to the
debian-cloud project, which is built with build-debian-cloud but adds
backports to sources.list.d, an apt preference for linux-image-* to
pull from backports, and I think a minimally newer gsutil (Google
Cloud Storage CLI) got in there too. We'll continue to build, test,
and support images with the standard Debian kernel as well as
backports-kernel images, allowing users/customers to easily choose
whether they want more reliable security updates via stable or better
reliability and performance via the backports kernel.
Long-term proposal: We're still figuring out the best way to handle
this long-term, since we only identified this issue in the last week
or so. Certainly we'd like to avoid having a long-term duality in
images. To complement our internal discussions, we welcome thoughts
from Debian on this point.
Both long-term and short-term, when fixes are suitable for inclusion
in a stable update, we'll of course push them forward that way - we've
already done so with #730138. Still, that does take longer due to the
nature of stable, and it's better not to leave users/customers with no
way to avoid crashing VMs, filesystem stalls, or the like.
Does this sound good? Let us know soon if there are concerns.