[ Please note the cross-post and Reply-To ] Hi folks, As promised, here's a (very long!) report of what was discussed and agreed at the 3-day Debian Cloud Sprint earlier this month in Seattle. We used Gobby (gobby.debian.org Sprints/CloudSprint2016) to help us track things during the sprint. For posterity, I've taken verbatim copies of the docs created there and attached them to the wiki page for the sprint [1]. This has taken a long time to write; I hope it's useful for people! [1] https://wiki.debian.org/Sprints/2016/DebianCloudNov2016 TLDR Summary ############ * We've got a useful cloud team prepared to work together, and lots of work to do. * We want to produce a reasonable set of Official Debian images for the major cloud providers, both to release with Stretch (and be updated regularly to cope with security updates) and for "testing" * We spent a long time looking at tools to build images, and *mostly* agreed that FAI is the best option to move forward with. Various people agreed to start evaluating it and report back * Various other work needed: publishing images, improving the web site and helping users to find our images, maintaining and improving the packaging of various cloud tools that we need/want. * We want our images to represent what people reasonably expect to get from Debian; this may mean some changes in our images and also some things that we think should be changed in the broader Debian context (e.g. unattended-upgrades) * We had an incredibly productive get-together, and because of that we're planning to make this a regular event. It's already suggested to meet again next year. Quite a number of actions were agreed - see the ACTIONS SUMMARY section at the bottom for a collected list. Cloud sprint in Seattle, 2nd to 4th November 2016 ################################################# Hosted by Zach Marano at the Google offices - thanks! Present: ======== (on-site) * James Bromberger (JEB) * Emmanuel Kasper (marcello^/manu) * Steve McIntyre (Sledge/93sam) * Martin Zobel-Helas (zobel) * Bastian Blank (waldi) * Sam Hartman (hartmans) * Jimmy Kaplowitz (Hydroxide/jimmy) * Marcin Kulisz (kuLa) * Thomas Lange (Mrfai) * Manoj Srivastava (manoj/srivasta@{debian.org,google.com,ieee.org}) Affiliation: Debian/Google * Zach Marano - Google (zmarano) * David Duncan - Amazon (davdunc) * Tomasz Rybak (serpent) * Noah Meyerhans (noahm) * Stephen Zarkos - Microsoft (???) (irc/hangout at various points) * liw * hug * damjan Agenda ====== We started by discussing and agreeing the (initial!) agenda for the meeting. It was re-arranged and expanded significantly as we went along and more discussion points came up. Wednesday --------- * What does it mean to run in a cloud environment? + Priority of our users vs. technical priorities * In depth look at how Debian runs in major clouds (AWS, Azure, GCE, Oracle, on premise, ... etc) * Define an official Debian cloud image. + Legal and trademark issues + Test suite for images + Official "Debian-based" images (for container platforms and other variations) + Versioning 'Cloud Images' and critical holes patching policy (example: dirty CoW, heartblead, etc) + SDKs and access for platforms (including release cycle mismatch vs cloud) Thursday -------- * Decide, if we want to split into WGs for below discussion items? * Look into different build processes of different image types (maybe short presentations) * Introspect the various image build tools and whittle the list down. * Cloud-init maintenance * Test suite * Rebuilding and customisations * Current and future architectures * (Human) Language support and i18n Friday ------ * Supporting services (for platforms, mirrors, finding things) * Ideally, come to consensus on many of the open ended issues that have been talked about at DebConf and on the debian-cloud list. * Better handling/publishing/advertising of cloud images by Debian + Getting updated packages into Debian - stable updates, -updates, backports + Website changes (better promote Debian Cloud images) * AOB * Going out to the computer museum? Wednesday 2016-11-02 #################### Everybody introduced themselves and listed their affiliations as appropriate. Most were DDs/DMs with various reasons to be interested in the Cloud effort in Debian; we also had representatives from the three major Cloud providers based in Seattle: Microsoft (Azure), Google (GCE), and Amazon (AWS). The initial proposed agenda in the wiki was a good start. We went through it and re-organised; it doubled so we prioritised the order in case we ran out of time. Need for customizations of images. No vendor lock-in What does it mean to run in a cloud environment? ================================================ Cloud is basically disposable/elastic computer resource. There are many types of usage, anywhere from long-term to really short (mere hours). Most has been server related, but desktop in the cloud is a thing too. Debian supports many architectures, many languages, many packages. This can be be useful for many people in the cloud too! For most people, a cloud instance has to be fast to boot. Things are often charged per unit time, so nobody's happy if a system takes a long time to come up to the point where it's useful. For Debian cloud images, we're thinking of basically three different models for users that we should be supporting: a) User takes an official image as a base, customises it and saves it as a new image to be run in the cloud b) User generates their own image from scratch, using the same tooling we use to generate our images c) User launches (and maybe customises) an existing image, without saving it for later (cloud init, etc.) In the cloud, is it a fair way to think of things that all the user systems will effectively be Debian derivatives? What types of images are likely to be wanted? Some people will want a minimal image, at least to start with. We should expect the need for a full-featured image for some users. For further ideas, we can look at Ubuntu and other vendors to see what they're doing. Ubuntu's image finder might help here. Advanced users can help themselves, but will need a good starting point. They need us to provide a solid base, not be be discouraged. Which commercial cloud providers should we be thinking about? The obvious top three were all present at the meeting. Others certainly exist, but nobody present had much experience to work from. Openstack is another obvious target, but is apparently also difficult to target with a single image type: Openstack has multiple virtualisation backends (LXC, KVM) and various ways of presenting images and network to the users, which can be awkward. Some of us were initially hoping to maybe just provide a single image that would work portably across (most) providers, but this is clearly now unrealistic. The various platforms are too different, and there will be different conflicting optimisations and configurations that will be needed. In a lot of cases, some different software will need to be installed depending on the platform. * In some AWS setups Xen tries to initialize non-existing framebuffer devices that it thinks it has, leading to a 30-second latency in boot time. Should people blacklist the driver in all images? * In many common cases, people will want/need extra agents in their images that are provided by the platform. This kind of thing can be done by cloud-init, but there are issues there (see below). At a more simple level, different platforms look different in terms of hardware setup. Things like disk setup can vary a lot; UEFI and Secure Boot support are more examples. Agents ------ The topic of agents is a major one. Each of the major cloud providers has software agents that do various things. Some of them are very important in some circumstances. For example, Google won't provide a default username/password for base images, so to allow configuration for login there needs to be an agent installed. It will also deal with things like ssh keys, and adding routes needed to support multiple NICs etc. Typically, agents are not required *per se* for running images on any of the common platforms, but they provide important additional services. We need an easy way to allow for users to opt-in to them, otherwise they will just switch to some other distribution/image. Packaging these agents shouldn't be too hard - the providers generally understand that they need to be Free for Debian and other Linux distros to package them. But, the harder problem for us in Debian is to keep those packages up to date, as they're quite fast moving. Getting them updated in stable will need effort, yet people want *stable* images to run in the cloud too. Backend APIs they call can change quickly, and new features are regularly being added (e.g. support for new machine types). What's the best solution? More on this later. Various services that agents typically provide: a) monitoring b) hardware (mostly networking) management c) security related (SSH key injection, etc.) d) deployment (e.g. AWS DevOps) e) other (e.g. EC2 Run Command) As these would be fast-changing packages, updates are hard. Testing or unstable are not useful for most users, as they're too risky (breaking stuff, transitions). Backports is probably not ideal, similarly. We think that the updates archive (what used to be called volatile) is probably the best answer, but will need to discuss with other people too. Cloud-init ---------- Cloud-init is meant to provide a generic solution to some of the same problems as the agents, and more besides. It can install specific software at boot time and configure things, potentially allowing a generic image to work in multiple setups. Unfortunately it has a number of problems that cause people to not like it. First, it's really slow. GCE can boot a Jessie image to a prompt in ~3 seconds. cloud-init is massively slower than this. The cloud provider experience is that users don't know much about cloud-init, and the user experience is not great. They have to deal with supporting their images, and this is a problem. Providers much prefer to have their software installed directly in their images already, then run apt update at boot in case of updates. Cloud-init also only runs at boot; it's much preferred to have long-running daemons (for at least GCE and Azure) which better integrate with the platform to allow for all sorts of config. AWS has a similar agent, but it only makes changes at boot. Other targets ------------- As another consideration for our images, it's now common for people to want to use containers with their cloud images too. Some are simply wanting to use cloud machines as a base, providing their workloads in various container setups on top of those: Docker, Mesos, Kubernetes, ECS, ??? Another target is Vagrant, a VM image typically used for a reproducible environment for various purposes. Vagrant for cloud providers is an option to move things from developer machines to the cloud. In depth look at Debian in the major clouds =========================================== People gave demos of Debian running on various clouds. AWS (James) ----------- Images are built on EC2 instances. You need to create snapshot of the volume - so it's easy to work from an EC2 instance. There is a volume import API - but not sure that it's usable. James used it only once. All regions have Debian images, including China (behind the Firewall) and the Government Cloud (you need to be ITAR Certified for access). Package mirroring is done using a CDN (Cloud Front), now including IPv6. James has custom header handling in the CDN for expiring files differently depending on type. HTTPS will work, but depends on the apt-transport-https package to be useful. Statistics: about 1TB/day current usage. In the image itself, there's a bit of a mess with a billing code (?); there's no possibility of removing it, even though it is not useful for community images. There are restrictions on usage: snapshot, clone volume, start. Can users do it? James has been rebuilding stable images to match point releases and to pick up important security fixes. Testing builds should be done regularly too. Debian images are popular: there are 22k accounts subscribed. AMI images are uploaded to market place, and James maintains a spreadsheet with detailed information. It should be possible to automate it and we'll look into that. In the AMI interface, Ubuntu images are shown directly in the "preferred AMI" list as Canonical work directly with Amazon to support that. For Debian images, we need to look in the Marketplace. On instance creation, there is no encryption by default. Is that related to the billing code? There's no direct support for key-refresh over time. Vagrant (Emmanuel) ------------------ There are both Wheezy and Jessie images, using the vagrant repository or a custom one. Uses Virtual Box, so there is synchronization of directories. This is enabled by default on boot, ability to have during lifetime. The build system uses packer to build image. It depends on the latest official Debian ISO image, validating the checksum. The output includes JSON with description metadata. There is a Makefile to call packer with appropriate parameters and metadata. It uses Debian Installer in the background to generate the image. There's a test suite - vagrant up and install the package. Vagrant doesn't need root during processing, but needs kvm or a similar kernel module. Azure (Martin, Bastian and Steve Z) ----------------------------------- They currently are doing automated builds in Jenkins, building Wheezy, and Jessie and Stretch daily and uploading to an Azure publishing account. These builds are then replicated to public Azure network, across all the regions. The current build system is a modification of the build-openstack-debian-image script, written in Bash. Customisation is difficult. The standard image is 30GB. There is a mirror network running inside Azure to support the images, allowing them to keep update traffic completely inside the Azure network. Max capacity is 2.5GB/s. Anybody can upload images to Azure, but only authenticated people can publish. The release process is manual. The daily images are not shown in the marketplace, but are visible through the Azure API. if users want to find them. Daily images like this are removed after 14 days. The more permanent public images are manually published to (and remove from) the marketplace using a JSON interface. It's recommended that ordinary users should use the published images - the daily builds are available for developers who know what they're doing. In the Classic Azure UI, there isn't good discoverabilty for Debian. That UI is in maintenance mode and there's a new Azure portal coming(?) which will allow easy search for Debian images. It allows for configuration of resource providers with Azure Resource Manager, and there's a templating system. There's no SSH key management in portal. Resource groups give the ability to better manage resources. Boot diagnostics and Guest OS diagnostics help with debugging issues, but to use them needs an Azure daemon. Supporting (CLI) tools are available in many languages: node.js, Python, Go. Data about images includes offer, SKU and version (including a keyword: latest); use a URN for command line to identify images uniquely. The Azure agent is open source and published on github. - see https://github.com/Azure/WALinuxAgent It manages SSH keys for login, and diagnostics. It can live side by size with cloud-init, and the choice of which to use is made during image creation. To use cloud-init in Azure, you need the latest version, with many issues fixed. There are Debian 7 images available; these use a kernel from backports to work well. Best performance for Jessie images also really wants a backports kernel to get the latest drivers for things like 40G networking. There's a big test suite for the Azure images - yay! The images should be compatible with Azure Stack, on-premise Azure cloud. GCE (Zach Marano) +++++++++++++++++ Google people are using bootstrap-vz to generate their GCE images of Jessie, and the code and manifest are in the upstream git repo. Their images run 3 daemons by default, for clock sync, credentials control and IP forwarding(?). All their tools are on github, but they're not packaged ready to go into Debian yet. They're unhappy with the amount of effort needed to maintain "proper" packages for all the different distros they support, and have been using the RPM tool to generate "good enough" packages for their use. Image builds are usually run on a VM inside GCE - there's a script using bootstrap-vz which starts a VM and prepares everything. Debian Jessie is the default image provided by GCE! The GCE SDK is baked in on GCE images. Google default to publishing images monthly, but will publish more frequently if needed for security updates. They just have one Debian image (stable). They used to have an oldstable image too but are deprecating it. LTS does not cover their needs to be able to support oldstable. They have a test suite for their images, including performance tests. The tests not yet open-sourced, as they're integrated with internal Google tools and infrastructure, but are planning to open them. Google's setup is global - they don't have the same "regions" setup as other large providers. Ubuntu build (and support?) their own images for GCE instead of Google. Google are unhappy about cloud-init updates not happening. They need the updates, but don't want to use backports. Zigo has been pushing for updates in stable, but no success so far. Maybe we need more people asking / in a better way? For now Debian cloud use was not seen as important; maybe we need to make more noise about cloud users to help convince people and get the necessary packages uploaded. ScaleWay -------- ARM/ARM64 cloud appliances Docker ------ There's an example script showing how to create image, using debootstrap. A docker image is based around a minimal setup for hosting a single application; a docker file describes how to build that image. There's already an image named "Debian" in DockerHub. We don't exactly know who is the publisher of this image. Might be a DD, but they're not a member of debian-cloud team, and the image has not been maintained for some time. Might be Tianon Gravi - see http://joeyh.name/blog/entry/docker_run_debian/ . Can *anyone* publish something named "Debian"? There are potential legal/trademark issues here... Official Debian cloud images ============================ We've had a lot of discussions over the last couple of years. A discussion on the mailing list started in November 2015 [2], and another in March 2016 [3]. The latter included a suggested schedule for producing good, tested official cloud images. We're quite a long way behind that schedule... [2] https://lists.debian.org/debian-cloud/2015/11/msg00005.html [3] https://lists.debian.org/debian-cloud/2016/03/msg00042.html Non-controversial proposals --------------------------- Steve listed the requirements for official Debian images that had been proposed by the Debian CD and trademarks teams, and nobody disagreed with them: * All the software included needs to in the Debian archive, in main * Stable release images will include stable-updates, but by default would not have stable-backports. (We might have extra backports images too for people that want them.) * No extra archives included * Built by DDs, on Debian infrastructure. Build scripts and config public, controlled in Debian. * Published on Debian infrastructure, and also uploaded to appropriate cloud providers. * Signed checksums published too * Testing, with public test logs. * Test suite should (must?) be also public. As we're delayed, we may have only a very limited (no?) test-suite for Stretch, but this will be a requirement for Buster. Azure images can have the rootfs built on Debian hardware. Then upload to Azure infrastructure, and perform some last steps. Discussion of keys and security ------------------------------- For secure boot, we should do not store keys in cloud infrastructure. The plan is for Debian keys to never leave ftp-master, in fact in an HSM attached to FTP master. How far will cloud providers want to go with signing of components of our image. Sign grub, sign kernel, sign entire image? We're planning on the first two so far. None of the providers are yet requiring signed bootstrap like this. Signing should not be required before Stretch release; work is in progress to get UEFI Secure Boot for Debian; we have a shim submitted to Microsoft for signing. Currently our CD images are not signed using HSM; there's some work needed to do that, and Steve has related ideas. Tools for official images ------------------------- We really want to be able to use one tool set to build on all clouds if at all possible, but it's understood this might be to late for Stretch. We want a test suite to catch bugs in images, and discrepancies between different images/tools. We need a *stable* toolset, i.e. we want to be able to reproduce our images on demand. This may need us to use snapshots for older version of packages. We might get to the point of fully reproducible builds - i.e. the same checksum, or we might never be able to get there due to things like different UUID for FS on images. We can still work in that direction, though. Is it possible to have just one toolset? Cloud providers might have different needs. Nothing to say that this will be a problem from the providers present, though. Will the cloud providers use official images? --------------------------------------------- Amazon does not care; anybody can create and publish images. Anything above the provided VM is the customer's responsibility; buyer beware. In GCE, the default image is currently Debian. But whether they will use a Debian official image or their own is not known yet. They do care that the image runs in their cloud. Azure has endorsed distributions, and a suite of tests to validate images. Unattended-Upgrades? -------------------- Should we include unattended upgrades in the images? The user experience from Debian desktop installations is to *not* have unattended upgrades, but the cloud is not the desktop. GCE Debian images include unattended updates; their customers expect updates, e.g. for ssl, kernel, etc. Should this be an option on a running instance? Or should there be two images: minimal without unattended-upgrades, and base with? If unattended-upgrades is enable, some people need the ability to disable it, e.g. for database servers, Tomcat, etc. It can take a long time to stop/start services during security upgrade, or glibc. Control via cloud-init could be an option, with the ability to set it through user data. Or, unattended-upgrades has a configuration file that could be managed via DebConf General agreement that unattended-upgrades is the right way to go *by default*, so long as people know how to disable it for their own use cases. For consistency, if we want to use unattended-upgrades, this should also be in Debian-Installer, with default answer "YES". *** ACTION: Sledge to push unattended-upgrades on debian-devel. Potential issue: non-https may leave users vulnerable to analysis which packages they have installed (might be problematic in some jurisdictions). Provider-specific agents? ------------------------- What should we do about official images including provider-specific agents? They should be included, assuming that the agents are in main. This is going to lead to specific images per provider - that's fairly obvious now. Other config ------------ There are a few other things that are changed in some of our existing cloud images, not very consistently: * Amazon images have some tweaked kernel settings to extend the range of TCP ports to use as ephemeral ports. This can allow individual cloud instances to act as more effective servers. * Misc performance tuning settings: per cloud, but even per instance type in some cases. * Disabling IPv6 can help in some cases where a cloud provider doesn't support it, reducing delays and timeouts. But it can confuse some users. In future, we must document what we're changing *and why*: users' needs, because of cloud provider technical needs, etc. We'd like to be consistent across our images where possible, unless there are strong reasons to make changes and (again!) these must be documented. Should we have SSH installed by default? Yes! Legal and trademark issues -------------------------- There have been quite a number of conversations in this area over the years. There are obvious potential risks associated with unofficial "Debian" cloud images. They could be infected with trojans or other malware, or even simply not well tuned with various problems. Each of these risks diluting the Debian name. Our trademark needs to be protected here. What names/descriptions should we use, and allow other people to use? "Official Debian" and "related to Debian", or "Debian-based"? "Debian provided by XXX" or "Debian provided by Debian". If we're making official Debian images, then we don't want to stop others from making their own Debian images, nor from distributing them. This is Free Software, after all. However, what we *do* want is for users to know what people have done to make those, and that information should be clearly available. Where any "Debian" image differs from the official images, the list of changes should be clear. Testing Cloud images -------------------- We're agreed that we need test suite(s) to test the images we make. https://wiki.debian.org/Testing%20Debian%20Cloud%20Images We don't have one yet, and we need one. We had one for Debian CDs, (from GSoC) but it has rotted over the years. We need tests to make sure: (a) images work (b) they meet our policies See gobby.debian.org /Sprints/CloudSprint2016/TestIdeas for an *initial* set of test ideas Docker image ------------ Official Docker images should follow the cloud images policy, and should be built also on Debian infrastructure. We should contact the current author of the "Debian" Docker Image (Tianon Gravi). Versioning 'Cloud Images' and rebuilds policy ============================================= When do we build and rebuild? ----------------------------- On top of worrying about unattended-upgrades, we also need to consider when we trigger rebuilds of our published cloud images. Even if users will be getting updated packages shortly after boot, it's much more efficient if large numbers of instances (millions?) are already up to date and don't need to install updates. Plus, the kernel and core packages can't be easily updated without a reboot. For a cloud instance without state, that doesn't help! So, we need to rebuild the stable images (at least) when there are critical holes that need patching, e.g. fixes for things like dirty CoW and heartbleed, Or should we rebuild on every change? Should we build daily? Quite a lot of discussion here...! As the cloud images are going to be fairly minimal (by definition!), we shouldn't expect to see (comparatively) very many updates to the packages included. Considering the cost of updates on a large number of instances, we should just rebuild our stable on *all* changes, at least once per day. Steve already has a cron job which (daily) checks our Openstack image to see if there are any security updates needed, and we could do similar for other images too. Going further, simply rebuilding all the images daily is not a problem if we want to and might be useful to help find problems, even if we don't publish the output. The difference is in signing the images. On the *release* images, a human signs them to give some assurance that there's a minimum level of quality. On daily/weekly testing installer images, we're using a different key to sign things at the moment and that's just a local key on pettersson, the current CD/image build box. That should be fixed... *** ACTION: Sledge to start organising HSM for the CD build machine We need automated testing: from experience, manual checking images for a few weeks will lead to boredom and problems with catching problems after a longer time. How do we version/label our images? ----------------------------------- Current versions published are MAJ.MIN.BUILDNUM - we use the last component in case of errors in the build process. We could also use time stamps for that (YYYYMMDD or similar). Some discussion about that. To make it easier for users to see when an image has been built, we agreed that we should include the timestamp. *** ACTION: Sledge to switch the version of the openstack image to include timestamp and look at adding changelog CVE fixed should be included in the changelog of the image. We had some more discussion about labelling in AWS and GCE. It's clear that codenames are sometimes difficult to follow for users, and we should make a point of using proper version numbers everywhere. Codenames are OK, but as *extra* metadata at best. Debian-cd does not use version number in testing CD on purpose (to make easy for people to distinguish from stable images). Planning to continue this for cloud image builds too. We may end up with potentially long images names, but that's fine. Names should be searchable but typically don't need to be typed. SDKs and access to platforms ---------------------------- There's a common issue here - all of the cloud platforms have rapidly-evolving SDKs which don't work well with getting them packaged and working for Debian stable releases. This *is* fixable... What used to be the "volatile" archive is now "updates". Currently only a small number of packages go through that; we need to follow that route too. This isn't necessarily going to be easy - there are several different cloud providers and potentially many different sets of language bindings for the APIs. Should we just package things, or also plan to keep them current in -updates? Should the Cloud team take a more active role in maintaining packages? We may have to if this is going to work in general. As a separate problem, packaging is hard work for people outside of Debian like the cloud providers themselves. It's doable for one distro for them to keep track with Debian policy, etc., but it's much harder to work sensibly across distros and have things work well. For example, Google are currently using FPM to help provide packages of their tools and SDKs for a wide variety of distros. The packages are hacky and horrible and would never meet Debian standards, but they *basically* work. There's real tension here - Debian wants high-quality packages that integrate well. We take care that all (or at least most) of our packages work well together and don't conflict with each other, and to know what is required (and provided) by packages. Debian policy is a market differentiator. Tools are for enforcing policy and helping with that. There are potential new solutions such as snap to "solve" (or work around) packaging problems, but they're not great - they typically side-step the issues by just bundling everything. If we want to get all the pieces in Stretch, time is running out. What do we do later, with updates? An additional apt source for images is not possible - we've already agreed that's not allowed for official images. This would also problems in the future if some provider decides to put a new version of glibc in their local repo. Backports is also not a sensible option for updates here. It allows for newer versions of software to be used, but there can be delays if things get caught up in library transitions in unstable->testing. *** ACTION: Sledge to talk with the Release Team at the Cambridge Mini-DebConf to see how we can work together in the best way. We'd like to have the various cloud provider agents in Stretch if at all possible; there's still a lot of work to be done yet, so it's going to be tight. For some people, it might cause big problems to have old stale versions of their current software in stable and they're rather not have them in stable at all. In those cases, we'd recommend that they still get uploaded, but keep them from migrating by opening an appropriate RC bug. Example: cloud-init is broken in stable (but we're hoping to get that fixed). Assuming this all works, we will then build official images using stable and stable-updates. So, where are the providers up to with SDKs etc.? * In AWS, basic stuff is already done - all they *need* is cloud-init. The rest of their agents etc. are nice to have, but optional. * Azure SDKs are similar - mostly convenience, but really nice to have * GCE has a weekly release cadence for SDKs, but backwards compatibility is preserved. Old SDKs do not break, but no new features. e.g. new regions As a start here, solving the packaging problems for daemons should help with solving the issues SDKs. Agents normally do not depend on the SDKs - they are feature-dependent, but not code-dependent. For Azure, some of their SDKs are already packaged by the "Azure team" - see https://qa.debian.org/developer.php?login=pkg-azure-team@lists.alioth.debian.org We need to reach out to them to share efforts. We strongly encourage cloud providers to provide help with SDKs and packaging! Thursday 2016-11-03 ################### We pondered splitting up into different working groups at this point, but it seemed that almost all the people present were interested in both build tools and testing ideas, so there was no point in splitting up! Build tools =========== What are we wanting to build and publish? ----------------------------------------- * We are wanting to build images that should be useful for the ordinary user. * We want to be building them from a public manifest file / configuration. * We expect to be publishing build logs for those images, alongside / linked near the images themselves. * Publish images and logs on "images.debian.org" / "cloud.debian.org" and *also* upload them to the cloud providers to be directly useful there as well. * (As already said elsewhere) We'll be building our images on Debian infrastructure, and doing some (preferably automated) testing of them there too. * Uploads of official images will be done by humans, after verifying test results. We had some discussion regarding managing credentials for uploads etc. Should credentials be held on Debian machines? Long-term credentials are risky, we'll need to be careful about how we manage things here. For Debian images we can see 3 target groups: * People directly using the images we publish in the cloud * People using our images as a base for customisation * More advanced users using our tools to make their own images and publishing those The tool(s) we choose need to support all those use cases, and maybe more in future. Methodology ----------- We analysed the different tools that people are already using for making various images. Sam suggested a taxonomy for classifying the tools: 1. Whether the tool uses D-I or not 2. Whether the tool uses a VM (creates a VM) to work in 3. Whether the tool has customization hooks/support built-in, or needs hacking on its code for customization (e.g. bootstrap-vz supports customization plugins, whereas build-openstack-debian-image is a simpler script which needs modifications to change behaviour) 4. Whether the tool bootstraps into a mounted filesystem, or creates a tarball that later tools work with (e.g. for importing into a cloud provider). Almost all the known tools are using debootstrap, but some tools use D-I. This can make quite a difference to the output, and maybe(?) to the time taken to run. As a process: we discussed each of the common tools in detail, covering all the good and bad points that people had. The following includes some detail of the technical analysis; it was recognised that issues raised were of course likely to be fixable, but this was an honest evaluation of current state. bootstrap-vz ------------ James demonstrated bootstrap-vz running in AWS to generate an image. Process is: attach volume, install Debian in chroot, create snapshot, register it. bootstrap-vz has support for adding plugins, which is good! Simple tools are great but image building is not a simple task and it helps to be able to extend tools. However, there's some disagreement about how easy plugins are to add / write. There's scope to do lots of things in plugins, but the interfaces are definitely not as clean as they could/should be. For example: just adding a subtask is not enough - you need to add it to code building manifests. Multiple plugins already exist, but probably not enough to cover everybody's needs. There are some strange dependencies. This is fixable, but difficult for a casual user who may just want to make simple changes to an existing image. There are concerns about documentation and error-handling: * There's also little documentation for the end user, only really for developers so far. We want to support end users tweaking and rebuilding images too. The man page is "incoherent". * It does not give proper error messages, but just gives a Python backtrace. This is not uncommon with a lot of Python tools; it's very hostile to new users rather than developers. The code is in Python, but it's messy. There's quite a mix of shell code still in there, left over from the rewrite, The first version of the tool was written in shell, then converted later. *Most* of the configuration data is in the manifests, but not all. Some of the configuration is directly in the Python code (e.g. NTP servers), and the base lists of packages to install are hard-coded. The configuration is not very clean/orthogonal - things like username are defined in multiple places through the code base. There's no templating or inheritance in the manifests, which is a major concern. The mixture of code and configuration makes bootstrap-vz hard to audit, We're aiming to build images that are as consistent as possible, but that will be difficult to do as it stands. Example: changing the cloud provider changes the list of installed packages - not via a manifest but via the bootstrap-vz code directly or even via a plug-in. The code also makes assumptions that could cause problems, e.g. building for EC2 assumes that the build is running on an EC2 instance. (There was a worry about calls to "pip" to install extra code, but that's just an example plugin to demonstrate how to use "pip", nothing more.) There are tests in bootstrap-vz. Yay! Another concern is maintenance - we'd really like a core tool for cloud images to be team-maintained. The single upstream for bootstrap-vz (Anders) has declared he doesn't want to maintain any longer; the cloud team would probably need to become the new upstream. Currently we know that bootstrap-vz is used by AWS, GCE and Oracle. It's not used for Azure or Openstack images so far. Why? There's a dislike of some of the design ("bootstrap-vz is lot of magic") and for some things simple shell scripts and a chroot are easier to follow and audit. But shell scripts are much harder to extend scalably and to add easy customisation. Plug-ins can do anything in the bootstrap-vz codebase - they are code and can directly modify all kinds of state. bootstrap-vz itself is a state machine; it builds a graph of the plugins and travels along it. Little uniformity in existing plugins so far. build-openstack-debian-image ---------------------------- We briefly looked at this. It's used (obviously!) for building openstack images, but the current Azure images are also using a patched version of the script. It's a fairly simple shell script which wraps debootstrap. That means it's quite easy to follow what's going on in terms of configuration *but* it also has no little direct support for customisation. That's a design feature, but it's not what we really want for a generic tool that will support all our use cases. FAI --- Is FAI an appropriate tool? Thomas presented it, describing the latest work he has added: fai-diskimage. More details at http://fai-project.org/ FAI is designed around config spaces and classes. Classes describe distributions (e.g. Debian, CentOS) and extra features; they have priorities which allow for inheritance and extensibility for manifests. The base class is FAIBASE; all machines (images) will inherit information from this. An FAI build can either generate a base tarball in each build, or it can be told to use a (human-generated) cache (basefile) to save time. There is easy support for customization scripts. Invocations: # fai-diskimage -v -u X -S size -c CLASS,CLASS /disk/file.raw to build an image, then # fai-kvm -Vu 5 /disk/file.raw to run the image under kvm to test. FAI has built-in support for a flexible range of partitioning setups. It does not use kpartx, there is no need for it for now. FAI runs as root; for official image builds we'd run it in a VM (or container?). It runs d-i, using DebConf preseeding to configure things. It also runs various scripts to control config inside the target images. The code is (very) mature, with good documentation. It supports package installation using apt, aptitude and yum; which one to use is configurable. It's possible to use shell variables to configure this and many other things. Configuration from command line, database, other sources. Classes can be logically joined, e.g. one class might be dependent on other class(es). Classes have priorities, allowing for inheritance and/or overrides. There's a hooks directory for hooks that can be called at various points. Files in the image can be created, deleted and modified as needed. There's a useful "ainsl" helper that's better than "echo" for lots of things. Files can also be generated from templates. Trees of files can also injected into target images. FAI is mostly written in shell, with Perl helper scripts for various things like partitioning and package maintenance. The design has 2 abstraction levels. Code and configuration are cleanly separated, by design. Classes are typically created by advanced users, then used by ordinary users. To use FAI well takes some effort to grok config space - estimated to take about 1h for most people. FAI is powerful and flexible; it can do more than disk images but this is what we're looking at for now. To update FAI to support a new Debian release takes few hours. You just need to make few builds to test all the changes. For a new deployment or image type, there's a decision that needs to made regarding number of created and/or uses classes. There's a concern about how to debug and understand the FAI code itself - do users need to read the code and debug shell scripts? There is a limited test suite - a task called "test". There aren't any regression tests, and image testing is basically manual. FAI has some classes for cloud images; users can also create their own classes. It's not clear how to do volume resizing? If we used FAI to build cloud images, it would only be for Stretch onwards. Nobody wants to switch to a new tool for Jessie images. *** ACTION: Sam should be able to provide deeper feedback on using FAI to build images by the end of next week. FAI doesn't really have a dry-run mode to see what would happen in an image build. Also configuration is split among many files, and we'd probably end up using many classes. Verifying changes would mean building with appropriate options and checks if it runs successfully. UEFI and GPT have not been tested yet. vmdebootstrap ------------- vmdebootstrap is a much simpler tool. It has a great test suite. Most of the control is via the command line, similar(?) to pbuilder (distribution, mirror, architecture). There are multiple wrapper scripts provided as examples for various image targets. There's scope for one "customise" script to be called after most of the work is done, to tweak the image output. vmdebootstrap uses debootstrap for core functionality; this can be problematic in terms of resolving dependencies when adding more packages. Right now, vmdebootstrap is probably too simple to be used as the tool for building images. It could maybe be used as a basis, with conjunction with other scripts. Sam has written a Python3 library to help with that, but more tools would be needed too. We've already agreed that we want easy customisation for our image building toolchain; there is little here yet. The test suite is written in "yarn", a tool with its own meta language. It looks very good, could we extract it and use it to help test other code? All of the configuration for vmdebootstrap needs to be provided as command line parameters. The config file is just a shell script wrapping it. vmdebootstrap has useful support for building images for foreign architectures, using the --foreign argument; it uses qemu with a binfmt handler to run any foreign binaries. The vmdebootstrap authors might not want to extend it to suit our needs; they want to keep it simple. We might be able to work with a wrapper, but this would mean that we would need to write quite a lot of code, and work around limitations. Overall, vmdebootstrap is ill-suited, but its test suite is great. There's very little auditability - it's just depending on debootstrap to do the right thing. Conclusions ----------- Of all of the tools considered, FAI possibly seems to be the best fit for our needs. This came as a surprise to a few people, but there were no objections. Multiple users (and developers) of other tools are open to evaluating FAI. General concerns about tools and images --------------------------------------- We want a generic tool to build images. It needs to work reliably, and it must be Free Software in Debian. There was some discussion about whether the tool should work with non-free software too, and we agreed that's fine *as an option* but for our normal usage it should only be pulling packages from the Debian main archive. If the tools itself needs something from outside of Debian main, it prevent us from calling it "official". As a general principle, a tool can be in main if it can be used to build DFSG-free images. If users want to use it to build non-free images, that's their choice. An example raised here was ANI on EC2 - Advanced Networking. It's not in the archive (yet). At the moment that means Debian is not on par with other operating systems that do provide that feature. We're potentially telling users to use another OS instead if they need that feature, but this is nothing new. cloud-init ========== Version 0.7.8-1 has been uploaded and is currently in unstable and testing. It fixed a few bugs, including some issues with the fragile test suite. Upstream are Ubuntu people. In theory, cloud-init is good for providing uniformity between distributions. But there are different versions and forks shipped in distributions, so that uniformity is not necessarily true. There appear to be major issue with the Canonical CLA that's in place, and a number of people are not getting their patches upstream. Google don't like cloud-init as it's slow: it typically adds 5 seconds to boot time and that matters to them. Their own Debian Jessie image without cloud-init can boot up in ~3 seconds. They are not eager to add cloud-init, at least until there is uniformity. Upstream does not really care. Current cloud-init is ~20k lines of python. It's object oriented, which probably adds overhead too. So, we have a question: is it time to fork to work around the Canonical CLA? Maybe - time to talk with other users and see if we can consolidate patches etc. There's a forked rewrite of cloud-init in Go which is claimed to be much faster. Should we use that version instead, maybe? It's maintained by CoreOS people, and we're not sure it would be any better. Upstream are accepting patches, but seems to have gone quiet. We suspect they're pushing for a replacement tool called "ignition" instead. There's also concerns about have Go-based programs in base. It's compiled but statically linked. There is a possibility to dynamically link, but only for amd64. :-( Overall, cloud-init is designed to control configuration of cloud images at runtime. It has a lot of functionality, and any replacement should offer all that functionality if it's going to be successful. *** ACTION: Bastian, Marcin and Jimmy volunteer to be a new upstream team for cloud-init in case it comes to that. *** ACTION: Various cloud platform folks will talk to Scott (Ubuntu upstream) about moving to this new upstream to work around the CLA that's blocking people. There's also an open debian-release bug asking for an updated version of cloud-init in the next point release. (http://bugs.debian.org/789214). We should look into that too. Test Suites =========== We've already agreed that we want a test suite, and it should be automated as much as possible. It's time to start thinking about the details! What do we want to test? We want to check that things match our policies. What are they? We came up with an *initial* set of tests in a separate gobby doc, attached in the wiki [4] [4] https://wiki.debian.org/Sprints/2016/DebianCloudNov2016?action=AttachFile&do=view&target=TestIdeas.txt How do we manage tests? We build images. Cloud providers will run their tests, but we shouldn't be waiting on third parties to test our images - that makes no sense. We will need to be running our own tests first. However, we should get information when cloud provider's tests fails - and then add a matching case to our test suite. We might not get exact source code of their tests, but we should get at least descriptions to be able to reproduce them. For Stretch, we may only have a minimal set of tests; we might rely on cloud providers for running tests then? There are two things we need to think about: the tests themselves and the framework to run them. The framework may depend on cloud provider. Tests should test different variants of images too - on different machine types, with various disk options, etc. To run the test suite on various clouds, we'll need the ability to programmatically register an image, start an instance, etc. For this we'll need SDK or API access from the chosen language. For AWS we have libraries, e.g. Boto. For Google we don't have the SDK in Debian yet. There is Apache libcloud, or Google provided projects on github: https://github.com/GoogleCloudPlatform . The Google SDK, for which we have an RFP, is command-line tools, not API access library: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=759578 Versions and updates (again) ============================ We'll build images for stable and point releases. After important fixes we should rebuild - but this might be problematic. On AWS we need to give AMI ID so it will be problematic. Not only security updates, but also feature updates for agents and/or SDKs. So we don't need to keep all of the images in lockstep, but then we'll need to provide a detailed changelog for the images to let users know why there are difference in timestamps. We want to update kernels. Feature releases by updates make more probable for things to break. Also - having dozens of instances just spinning updating may mean measurable costs for users. We agreed to do weekly builds for testing, with weekly and monthly retention, to avoid having too many images. Should we publish all of them in the various clouds?? Different providers might have different meaning of what it means to have image published - let's see how that works out. When we have better testing and building automation, it'll make it easier for users to build images too, reproducing what we're doing. Do we need special tools for transforming images to the formats acceptable for cloud providers? We believe qemu should be able to deal with most of them. Current and future architectures ================================ At the moment most people are only targeting amd64. What other architectures are out there? There are some ARM clouds (e.g. Scaleway). IBM has Power cloud. Linaro has arm64 images for developers. Steve promises to have Open Stack arm64 cloud before Stretch. Currently there's only a limited number architectures out there, but there might be more in the future. But as long as we have generic tooling (and Debian has many architectures) we should be able to deal with it. How to boot arm64 cloud images? UEFI with grub-uefi is the correct way for arm64 servers. Do we want to have multi-arch enabled by default? It should be customizable, i.e. users should be able to generate such an image. We checked with the providers, and all said that users are not asking them for 32-bit support. Maybe we might mirror 32-bit repositories too for the cloud Debian mirrors? Aside: we might want to have some policy about sunsetting some of the architectures. 32-bit architectures are getting less relevant. We've already seen that some packages just won't build on 32-bit machines any more. This is a wider issue in Debian... Human language support and localised images =========================================== By default we're using the "C" locale for all images. How do we allow for users to switch language e.g. through cloud-init. We have quite good language support, but do we need to have it in the cloud? It's a general feeling that for many people this won't matter too much, but it's difficult to know. We really expect that Chinese users will care more than non-English European users, for example. Do our current images have all locales installed? It doesn't look like. We should probably be installing locales-all at least. At least some of the cloud providers provide a non-English UI (web frontend) for selecting images and starting instances. What do other distros do here? Amazon Linux includes localizations and the ability to change language through cloud-init. It's just one image with many different locales. Ubuntu is English-only. *** ACTION: James to ask on the mailing list(s) if anybody wants more languages for the cloud image(s). Other things to consider - which mirror to use? We need to consider this for users in China, for example. Sam suggested that we could ask some questions during first login. This would break automation, but might it be useful for some (really specific) needs? We agreed that should not happen by default! It's probably not worth the engineering effort. Friday 2016-11-04 ################# Supporting services (for platforms, mirrors, finding things) ============================================================ There's a whole host of supporting services that need to work for cloud images to work well... Mirrors ------- The providers aren't sure what to do in terms of mirror support for their cloud images. The httpredirect mirror httpredir.debian.org is deprecated. When somebody has thousands of instances all trying to auto-update in parallel, we'd like to avoid killing mirrors! The right answer might vary by provider, of course. * Google have very fast connections from their cloud sites, such that external mirrors are often faster than internally-provided ones. This was a surprise! * Azure has an internal mirror network. with 25TB of storage. * Amazon have a mirror CDN, and caching headers set up with different expiry to deal with different refresh needs (e.g. packages themselves (*.debs), Packages files) Ordinary mirror and security mirror The best solution is almost definitely the one that works well with the lowest maintenance possible. There is not enough manpower to do much more. James has tried to use S3 for the internal Amazon mirrors. It was really slow to populate this (6h). That's why he moved into a CDN instead. It used to require extra instances to rewrite headers but now cloudfront can set TTL directly? A Google Cloud CDN could serve content from inside GCE, or have a simple instance with redirect? Disk space is not a concern for mirroring really. Low maintenance and monitoring are much higher priorities. Tracefile of the mirror should be monitored to see whether we're in sync; the official mirror script updates it last so it can serve as a canary. Official mirrors see 4 pushes a day. For signature checking, signatures cannot be older than 10 days - so the maximum frequency is 7 days. deb.debian.org is the recommended replacement for httpredir.debian.org, backed by 2 commercial CDNs (Fastly and cloudfront/Amazon). Stretch's version of apt can directly use the CDN behind it, without need to redirect (as was needed by older versions of apt). Fastly has peering connection to GCE; Bastian can pass on the documentation he has. Martin showed some traffic stats for ftp.debian.org; it has a 10G connection on a university network and is seeing roughly 200Mbps on average. James shared his setup of Apache for redirects and expiry times; doc is attached to the wiki page [5]. We looked at the stats for the AWS mirror traffic. It's seeing 500 requests per minute to the interception header host, with 1TB transfer per day. There are details about which files are requested the most - we could maybe get useful information from those statistics? [5] https://wiki.debian.org/Sprints/2016/DebianCloudNov2016?action=AttachFile&do=view&target=CloudFrontProxyInterceptionConfig.txt We also have a CDN with HTTPS enabled, using a security-cdn certificate. After Google has set up a CDN, we can integrate it into our network too. People are still worried about Hashsum mismatch problems from apt; we need to configure small TTLs in CDNs to avoid clients getting stale index files for now. The problem will disappear with the Stretch version of apt - yay! Finding things -------------- We looked at the options here. Ubuntu image finder is a great example of how to do things: cloud-images.ubuntu.com. They have a home page with links to all images of releases. They also have manifests For various architectures: amd64, IBM Z, arm32, etc. The link to AWS goes directly to the launch wizard, which is quite useful. The Oracle link went to the market place instead, but that might be because we were not logged in to Oracle cloud when testing. We should also provide JSON details of our images so users can automate working with our images. *** ACTION: Martin to ask Colin Watson if the code for Ubuntu image locator (cloud-images) is available. or Martin will write a similar thing for Debian. What do users expect from an image finder page? A list of cloud providers? Distributions and versions? We should have a separate page with stable and daily images, with logos on the top to make it easier for users to see what do we offer. Should we try for Debian support in Juju? Should we provide base files, pre-generated when we build images? It'll make life easier for users not on Debian systems. Better handling/publishing/advertising of cloud images by Debian ---------------------------------------------------------------- We need some nice(!) web pages for showing our images. Example for parsing json files: https://msdn.microsoft.com/library/cc836466(v=vs.94).aspx What do we need other than an image finder? James publishes AMI IDs via signed emails and in the Debian wiki (e.g. https://wiki.debian.org/Cloud/AmazonEC2Image/Jessie). But anybody can edit the wiki and change IDs to something malicious. Should we lock down the wiki pages so only certain people can edit? Image finder should help here too. We agreed that we should set up cloud.debian.org as another alias for {cdimage,get}.d.o. Maybe we should register debian.cloud too? (i.e. .cloud TLD). We looked - it's reserved and we assume we'd need to speak with SPI for trademark issues. Getting updated packages into Debian - stable updates, -updates, backports -------------------------------------------------------------------------- How does clamav get on? In volatile times it was pushed through without much problems. There are many possible paths for packages: jessie, jessie-proposed-updates, something else? For fast-moving things, we should be pushing to stable-updates. That will hit the proposed-updates queue for the Stable Release Managers to work through and approve if things look OK. Should the cloud team salvage some of the relevant packages, like python-boto, boto3? Clearly, if we care about things and they need attention then this makes sense. python-libcloud is in the archive, and it seems to be quite recent. There are also Azure CLI packages in NEW. We should contact the maintainer to join the debian cloud team and maybe encourage them to set maintainer to the team, including the related python-azure and python-azure-storage packages. *** ACTION: Tomasz to contact the *boto* maintainers asking about an update before freeze, backporting, maybe salvaging it Website changes to better promote Debian Cloud images ----------------------------------------------------- This clearly needs to happen! At the moment it's difficult to find any information. *** ACTION: James will try to do something with it after getting access. Manoj offers to help too AOB === We went through the list of cloud-related bugs in the BTS, marking lost closed and following up on others. We agreed that we should add locales-all to the list of packages we install by default. *** ACTION: Sledge to write up policy for official images and post it on the website (#706052) What should we do about things like non-free GPU drivers and other drivers that won't go upstream? We could add non-free unofficial cloud images like we already have for installer and live images. We definitely also need to make it easier for people to build their own images for this kind of change. We agreed that we should also build the extra non-free images, and make sure people can find them. Like other non-free Debian images we need to accompany them with appropriate warnings that non-free is bad. They should also *NOT* be listed directly in the same image search area, etc., but maybe in a second area which is linked. Wrap-up ======= We were generally amazed at how productive we'd been in three days. We had worked together very well and made far more progress than anybody could possibly have expected/hoped. There's still work to do, of course, but we left happy and prepared to work together more in future. Steve Z has already offered to host another similar meeting next year at Microsoft. :-) ACTIONS SUMMARY ############### Yes, Sledge is silly and accepted too many action items here... :-) *** ACTION: Sledge to push unattended-upgrades on debian-devel. *** ACTION: Sledge to start organising HSM for the CD build machine *** ACTION: Sledge to switch the version of the openstack image to include timestamp and look at adding changelog CVE fixed should be included in the changelog of the image. *** ACTION: Sledge to talk with the Release Team at the Cambridge Mini-DebConf to see how we can work together in the best way. *** ACTION: Sam should be able to provide deeper feedback on using FAI to build images by the end of next week. *** ACTION: Bastian, Marcin and Jimmy volunteer to be a new upstream team for cloud-init in case it comes to that. *** ACTION: Various cloud platform folks will talk to Scott (Ubuntu upstream) about moving to this new upstream to work around the CLA that's blocking people. *** ACTION: James to ask on the mailing list(s) if anybody wants more languages for the cloud image(s). *** ACTION: Martin to ask Colin Watson if the code for Ubuntu image locator (cloud-images) is available. or Martin will write a similar thing for Debian. *** ACTION: Tomasz to contact the *boto* maintainers asking about an update before freeze, backporting, maybe salvaging it *** ACTION: James will try to do something with it after getting access. Manoj offers to help too *** ACTION: Sledge to write up policy for official images and post it on the website (#706052) We should follow up on progress on these items ASAP... -- Steve McIntyre, Cambridge, UK. steve@einval.com "Since phone messaging became popular, the young generation has lost the ability to read or write anything that is longer than one hundred and sixty characters." -- Ignatios Souvatzis
Attachment:
signature.asc
Description: Digital signature