Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases

To: debian-cloud@lists.debian.org
Subject: Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
From: Ross Vandegrift <rvandegrift@debian.org>
Date: Mon, 25 May 2020 08:43:44 -0700
Message-id: <[🔎] 20200525154344.dercwputtko4agb6@vanvanmojo.kallisti.us>
In-reply-to: <[🔎] c0c42777-5867-7b00-27dd-3b217f50378a@debian.org>
References: <[🔎] 16022b71-c5e8-e8ef-9b9e-076d71072b70@debian.org> <[🔎] 20200524213925.vjacbyeeaqh5kc6p@shell.thinkmo.de> <[🔎] c0c42777-5867-7b00-27dd-3b217f50378a@debian.org>

On Mon, May 25, 2020 at 02:21:48AM +0200, Thomas Goirand wrote:
> On 5/24/20 11:39 PM, Bastian Blank wrote:
> > On Sun, May 24, 2020 at 11:26:40PM +0200, Thomas Goirand wrote:
> >> So I was wondering if we could:
> >> 1/ Make the resulting extracted disk smaller. That'd be done in FAI, and
> >> I have no idea how that would be done. Thomas, can you help, at least
> >> giving some pointers on how we could fix this?
> > 
> > Fix what?
> 
> The fact that the raw image is 2GB once extracted, when it could be
> 1/4th of that.

I don't think it's obvious how to do better.  The only ways I know to
make a raw image smaller than its fs are:
  1) sparse files
  2) compression

FAI is using #1, and you want to avoid #2.  Do you know another way?

> >> 2/ Published the raw disk directly without compression (together with
> >> its compressed form), so one can just point to it with Glance for
> >> downloading. BTW, I don't see the point of having a tarball around the
> >> compressed form, raw.xz is really enough, and would be nicer because
> >> then one can pipe the output of xz directly to the OpenStack client (I
> >> haven't checked, but I think that's maybe possible).
> > 
> > No. Nothing in the download chain supports sparse files, so unwrapped
> > raw images are somewhat out of the question.
> 
> I've done this for 3 Debian releases [2], I don't see why we would loose
> the feature because of a "sparse files" thing which you somehow find
> important. 

I think Bastian's point is that tar is required to enable downloading
the sparse files, since http can't represent the holes.  Otherwise, you
need to transfer the full size of the fs.

I checked one of the older OpenStack images you linked to.  It behaves
just like the FAI raw images, as far as I can tell:
ross@vanvanmojo:~/tmp$ curl -L -o disk.raw https://cdimage.debian.org/cdimage/openstack/archive/8.0.0/debian-8.0.0-openstack-amd64.raw
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   361  100   361    0     0    512      0 --:--:-- --:--:-- --:--:--   512
100 2048M  100 2048M    0     0  11.0M      0  0:03:06  0:03:06 --:--:-- 10.4M
ross@vanvanmojo:~/tmp$ ls -lh disk.raw
-rw-r--r-- 1 ross ross 2.0G May 25 07:57 disk.raw
ross@vanvanmojo:~/tmp$ du -h disk.raw
2.1G    disk.raw

Did I miss something?

> So what you're talking about is just having a sparse *temporary* file,
> before the upload to Glance. Do we care, when what I'm proposing is to
> get rid about this extra step of downloading, before uploading to Glance?

Is avoiding the extra download step more important than reducing the
image size?  Your first mail raised both issues, and FWIW, I thought you
were mostly concerned about the size.

To avoid the extra download for Glance, maybe it makes sense to use the
upload stage of the pipeline.  We could treat the generation of the
preferred format for OpenStack like we treat the EC2 registration step,
for example.

> >> Another thing which bothers me, is that in our current publication,
> >> there's no way to tell what image is from which point release.
> > 
> > What is the significance of that?  We use stuff from security primarily,
> > so the point release don't show what might be in the image.
> 
> Of course the point releases show what will be in the image. For
> example, if a cloud user spawn a new instance using an image which is
> from the latest point release, he knows a bunch of (non-security fixed)
> packages wont need upgrades (for example, at least base-files, but often
> many other as well, like for example tz-data).

As a cloud user, I never want to care about point releases.

There's usually a way to identify the latest image of a given release.
For example, on AWS and GCP, the api can search for the latest debian 10
image.  Many deployment tools integrate this functionality, so I can
always deploy the latest debian 10 image.

I've never used OpenStack though, so I don't know if it has similar
features. 

> Someone may also want to run the image matching a given point release,
> together with snapshot.debian.org (for example, just to test upgrades,
> and many other possible scenarios).

This is a valid use-case, but I don't think we should optimize for it.

By integrating the point release into the version component, a user
would need to know which point release they want.  Currently, using a
debian 10 image gets you the latest point release.  Instead, you'd need
to know that e.g. 10.2 was out, and was the latest.

I think that's a bad user experience - most users that I work with know
nothing about Debian's release processes.  They'd be confused and
frustrauted if they needed to know the point release.  Heck, I don't
know what point relase of buster we're on.

> So yes, point release numbers do have significance. Images with a date
> that first appears as random, and reveal itself only if carefully
> matched to the point release dates aren't user friendly at all.
> 
> If I say: Bastian, can you please give me the image from Buster 10.2, it
> will for sure take you a lot of time to find it out. However, look at
> this archive, which has security updates since 8.6.3:

At the last sprint, we discussed building images more frequently to
integrate security updates.  Most in the group thought the complexity of
lots of images outweighed the small benefit of avoiding the security
downloads.

> By the way, why are we keeping a history of 233 daily Bullseye images?
> [1] Is this of any use to anyone?  The CD team builds images weekly,
> why do we need daily images published at the cloud team? And keep them
> forever, when the CD team does not?

At the last sprint we discussed the stable daily builds, and agreed that
it's not worth keeping them (since they mostly end up being identical).
Probably no one has had time to do anything about it.

Testing & unstable aren't so clear - the notes in [1] indicate that we
had more questions than answers.  Have we hit a point where the cost in
disk space is greater than the cost in effort to answer these questions
and fix?

Ross

[1] - https://gobby.debian.org/export/Sprints/CloudSprint2019/2-%20Building%20images

Reply to:

Follow-Ups:
- Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
  - From: Thomas Goirand <zigo@debian.org>

References:
- Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
  - From: Thomas Goirand <zigo@debian.org>
- Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
  - From: Bastian Blank <waldi@debian.org>
- Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
  - From: Thomas Goirand <zigo@debian.org>

Prev by Date: Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
Next by Date: Re: Cloud team deletagion - resignation
Previous by thread: Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
Next by thread: Re: Publishing raw generic{,cloud} images without tar, and without compression, plus versionning of point releases
Index(es):
- Date
- Thread