[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Caching review



I'm in the middle of building a patch for 718225, I'm having to think carefully about caching at this point, and I think the area of caching could do with a review (or at least a little input or agreement with my proposals/plans from Daniel).

Obviously its too late to get too much improvement in this area into jessie, but knowing the intended direction of improvement for a future version could be very helpful in building this patch, (as well as shaping the next version).

(tl;dr summary at end.)

Brief caching background:
-------------------------------------
The set of cachable files includes downloaded: packages (deb & udeb), installer files (vmlinuz & initrd), and distribution information files (release, packages.gz and contents-[arch].gz files). Additionally, there's possible caching of certain completed build stages, but this is mainly useful only for development of LB, with the exception of caching the basic bootstrap chroot which is required by the current design of the build process.

Caching scenarios include: During a single clean build (don't download the same thing multiple times); Re-running a build process in a directory where a build has previously taken place, without cleaning it out, so potential for many files to be retrieved from the cache; and doing an offline build, where absolutely everything will and must come from the cache (a variation of building in a previously used directory; and not to be confused with using your own local mirror, a completely different thing!).

One further situation complicates things further. The description of the --cache-packages parameter describes that disabling it is not recommended, but in rare setups it is actually faster to re-download (from a local mirror) rather than hit the disk.

Offline building:
-------------------------------------
The only reference to offline building I've seen in LB documentation is within the description for the --cache-indices parameter ("would allow to rebuild an image completely offline, however, you would not get updates anymore", which was introduced in the v1.0 live-helper days. If enabled, a few apt data files are retrieved from a cached copy if they exist, rather than re-installing a few local keys and key packages (if you have any in your config), and an install of aptitude is possibly avoided. To me this param and bit of code are a little puzzling, adding complexity to avoid very little work, and it's not even entirely clear to me how it helps "offline" capabilities, unless only in relation to the possible install of aptitude, which surely could be handled much more cleanly. I can find nothing through google that suggests this parameter is actually needed by anyone, just someone happening to notice it break in an ancient bug report, and a mention of it in an old article.

Furthermore, offline building has actually been broken for at least two and a half years now, since "support for including firmware packages automatically" was included in v3.0-a47, unless you disable inclusion of firmware packages (set --firmware-binary and --firmware-chroot to false). That code has always lacked caching support, preventing offline building. Clearly there seems to be no serious use for it.

My partially complete 718225 work does actually fix the lack of caching in the 'firmware' related code, and thus it's possible that offline building could be workable once again, but should I bother paying it any thought? I'm still trying to figure out how I might best use caching in implementing this patch; if we agreed on offline support being unnecessary and ditching it, it might possibly make things a little less complicated (and --cache-indices could perhaps be ripped out later inline with that).

Update: Just noticed, downloading of mirror 'trace' files (placed in the image as .disk/archive_trace) does not use caching, and has been in place since v1.0.5-2, so offline building can't have worked since then! (Unless perhaps that just silently fails).

No Caching (of package files)
-------------------------------------
(Downloading being faster than retrieving from disk). Apparently a rarely needed capability, which has existed since around v1.0-a22. Nothing on google about it, though I'm sure it does work, so maybe there are people using it, but is anyone? Obviously there's going to be a fair bit of disk activity during the build process anyway, this just reduces it a little. Can the existence of this functionality really be justified? It would be nice to remove unnecessary things like this to keep the code cleaner, so perhaps it could be removed in the next version. Does anyone seriously require this?

Freshness
-------------------------------------
For the installer stage of the build process, everything is retrieved from the cache if a copy is already there, except (currently) the contents-[arch].gz files used to get a list of firmware packages to download. If you check out the installer images (e.g. http://ftp.debian.org/debian/dists/sid/main/installer-amd64/), i.e. the vmlinuz and initrd files used by the install process, you'll see that they rarely, but occasionally are updated. Obviously if you use the daily build it's more frequent. There's actually a security risk in using the daily build (inadequate info files to securely download, which my 718225 patch highlights to the user), so perhaps that's to be avoided, but there's still a decent need for a user to want to replace their cached installer files.

In terms of distribution information files used in a secure-wget-download verification process in my 718225 patch, I intend to use the cached copies, but if a failure occurs, download fresh copies of any cached files used as a second chance, and only really fail if the verification check after that fails also. So this will work perfectly fine in all scenarios.

Contents-[arch].gz files used to get a list of firmware packages, and Packages.gz, used to get a list of udebs may possibly need to be updated, and also trace files (if caching where implemented for them).

So certain files discussed here should be used from the cache during a build, to avoid unnecessary repeat downloads (e.g. contents-[arch].gz is downloaded twice, once during chroot_firmware, and once during installer_debian-installer), but the user may need choice of whether to refresh them at the start of a new build, or whether to allow the cached copies to be used.

How should this best be approached?? There are a few options:
  1. We could default to using the cached copies if available, but provide new flags to force a build to get new ones. (--[refresh/flush]-cached-dist-info and --[refresh/flush]-cached-[di/installer] ?).
  2. We could default to not using them, flushing them by default at the start of the build process, but provide a new flag (--use-cached-dist-info and --use-cached-[di/installer] ?) to avoid the flush and use them.
  3. We could just always flush them at the start of the build, and rely on caching during the build only.
  4. We could add a new option to the clean script, requiring users to run that to flush this info if they want it flushed.

Obviously the third is the simplest for users, but least efficient, and the fourth perhaps requires the most forethought for each build. I'm not certain which I'd consider best, but I think I'd lean towards one of the first two.

As for implementing it, rather than having the first block of code downloading a dist-info file doing a flush if required, it would be cleaner to have a simple 'init' stage to the build process which does that I think. Or possibly an init script executed at the start of the bootstrap script.

Summary (tl;dr)
-------------------------------------
Proposal summary:

  • Ditch unnecessary and broken offline build support (disregard support in new code now, possibly remove --cache-indices in next version).
  • Ditch --cache-packages=false support in next version, if no justification for keeping it?
  • Implement one of the first two options under the 'freshness' heading above for user control over freshness of dist-info and installer (linuz and initrd) files used in a build, in relation to any copies available in the cache from a previous build.

Reply to: