Re: [PATCH v4] Add by-hash support

To: Julien Cristau <jcristau@debian.org>
Cc: debian-dak@lists.debian.org
Subject: Re: [PATCH v4] Add by-hash support
From: Colin Watson <cjwatson@debian.org>
Date: Mon, 16 May 2016 18:28:21 +0100
Message-id: <[🔎] 20160516172821.GS21507@riva.ucam.org>
Mail-followup-to: Julien Cristau <jcristau@debian.org>, debian-dak@lists.debian.org
In-reply-to: <[🔎] 20160516155403.GI2718@betterave.cristau.org>
References: <[🔎] 1462175800-21306-1-git-send-email-jcristau@debian.org> <[🔎] 87zirq2til.fsf@deep-thought.43-1.org> <[🔎] 20160516133039.GG2718@betterave.cristau.org> <[🔎] 87twhyyu8v.fsf@deep-thought.43-1.org> <[🔎] 20160516155403.GI2718@betterave.cristau.org>

Sorry for the late reply.  I hope this is a helpful enough review to
make up for it.

On Mon, May 16, 2016 at 05:54:03PM +0200, Julien Cristau wrote:
> +def clean_byhash(now_date, session):
> +    Logger.log(["Cleaning out unused by-hash files..."])
> +
> +    q = session.execute("""
> +        DELETE FROM hashfile h
> +        USING suite s, archive a
> +        WHERE s.id = h.suite_id
> +          AND a.id = s.archive_id
> +          AND h.unreferenced + a.stayofexecution < CURRENT_TIMESTAMP
> +        RETURNING a.path, s.suite_name, h.path""")
> +    count = q.rowcount
> +
> +    if not Options["No-Action"]:
> +        for base, suite, path in q:
> +            filename = os.path.join(base, 'dists', suite, path)
> +            try:
> +                os.unlink(filename)
> +            except OSError as exc:
> +                if exc.errno != errno.ENOENT:
> +                    raise
> +                Logger.log(['database referred to non-existing file', filename])
> +            else:
> +                Logger.log(['delete hashfile', suite, path])
> +        session.commit()

I initially thought that this means that a no-action run will delete
rows from the database but never remove them from disk, which seems like
a bad idea.  Then I read the earlier thread and found that this isn't
the case, but this still seems to rely on a bit more state than I'm
comfortable with.

My implementation in Launchpad instead works by first deleting rows from
the database, then walking over all the relevant database rows and
adding their hashes to an in-memory data structure, then walking over
all the by-hash files on disk in the relevant suite and removing any
that no longer correspond to the database.  That way, if you miss
anything, you only have to make the database look right and the next run
will fix it up.  One wrinkle is that you need to know which directories
to walk over: we solve that by having the cleaning be done as part of
writing out by-hash files, so we can keep track of which directories
were relevant before any database rows were deleted.

Basically, the database should be the source of truth and every run
should make sure that the filesystem is in sync with it, rather than
just with changes made in the current transaction.  This is generally a
more reliable approach, and it's only negligibly more work at run-time
here.

What about removing directories that are emptied as a result of removing
all their by-hash files?

> +        c.execute("ALTER TABLE suite ADD COLUMN byhash BOOLEAN DEFAULT false")

I recommend having two booleans here: one to publish the by-hash files,
and another to add the flag to Release.  That way it's really easy to
enable just the former for testing, or to disable the latter later if
something goes wrong.

> +        c.execute("""
> +            CREATE TABLE hashfile (
> +                suite_id INTEGER NOT NULL REFERENCES suite(id) ON DELETE CASCADE,
> +                path TEXT NOT NULL,
> +                unreferenced TIMESTAMP,
> +                PRIMARY KEY (suite_id, path)
> +            )
> +             """)

You probably also want an index on hashfile(suite_id) if it's not to be
the sole primary key.

> +                for h in hashfuncs:

I concluded after a while that there was no point including MD5Sum and
SHA1 by-hash entries; there are no versions of apt that support by-hash
but not SHA256, it just uses up inodes for no benefit, and it uses up a
not entirely trivial amount of space on mirrors that mirror all the
links as regular files.  May as well only do this for SHA256 and onward.

> +                    hashfile = os.path.join(os.path.dirname(filename), 'by-hash', h, fileinfo[filename][h])

I know this isn't your fault, but the excessively-clever illegible
initialisation of hashfuncs further up in this file needs to die and be
replaced with a lookup in an explicit mapping or something.  I mean,
seriously.  I used to think that kind of thing was really neat, but then
I realised that code has a communication function (including to my later
self) rather than just being as dense as possible.

Anyway, I *think* this is right, but lacking a test suite or legible
initialisation of hashfuncs it took me a while to satisfy myself that
this actually creates by-hash directories with the correct case.  And of
course it's a timebomb in case anyone has the temerity to popularise a
hash function containing the substring "UM" in future.

Given that each hash algorithm has several associated properties (hash
method, various output spellings, maybe database column names, other
things like whether to write by-hash entries for it), a nice replacement
would be a little class for each hash algorithm that encapsulates all
these and a registry of all the current algorithms.  That's a couple of
dozen lines of readable code in one place and then everything else can
use it.

> +                    query = "SELECT 1 FROM hashfile WHERE path = :p AND suite_id = :id"
> +                    q = session.execute(
> +                            query,
> +                            {'p': hashfile, 'id': suite.suite_id})
> +                    if q.rowcount:
> +                        session.execute('''
> +                            UPDATE hashfile SET unreferenced = NULL
> +                            WHERE path = :p and suite_id = :id''',
> +                            {'p': hashfile, 'id': suite.suite_id})
> +                    else:
> +                        session.execute('''
> +                            INSERT INTO hashfile (path, suite_id)
> +                            VALUES (:p, :id)''',
> +                            {'p': hashfile, 'id': suite.suite_id})

This is shaped quite a bit differently from the LP implementation (which
has "path" referring to the by-name path, and potentially >1 row for
each (suite, path) although only one is ever referenced), so I had to
think about it for a while.

It's a bit hard to reason about the correctness of this in the case
where multiple files hash to the same value.  They'd have to be in the
same directory, but that's not entirely improbable in the case of (say)
multiple empty translation files.  A test suite would be really helpful;
it was invaluable to me when doing similar work and chasing down corner
cases.  How are you testing this at the moment?

I think my implementation involves quite a lot fewer queries: yours is
always O(len(fileinfo) * len(hashfuncs)), whereas mine is O(number of
new by-hash entries) plus a small constant.  You may or may not care
about this; it matters for us because we publish a large number of
archives, and it may matter for Debian in future if
PPAs/bikesheds/whatever happen.

Even if you don't do this, you may find it worth optimising by reading
"SELECT path FROM hashfile WHERE suite_id = :id" or similar into an
in-memory set outside both loops and then checking against that rather
than doing a "SELECT 1"; all those little queries probably dominate the
cost of reading a bunch of paths you aren't using.

> +                    try:
> +                        os.link(filename, hashfile)
> +                    except OSError as exc:
> +                        if exc.errno != errno.EEXIST:
> +                            raise

You do indeed want a hardlink here, as otherwise the repository format
is racy: there's a window where an old by-hash symlink either doesn't
exist or refers to something with a different hash.  It also makes
things generally a lot easier to manage.

However, you can optimise this a bit in the case where you're writing
out multiple hashes: it's perfectly safe for by-hash/SHA1/$(sha1sum foo)
to be a symlink to by-hash/SHA256/$(sha256sum foo).  That's worth a
look, especially if you don't follow my advice above about dropping
MD5Sum and SHA1, but also in general since we might introduce stronger
hashes in future.

Cheers,

-- 
Colin Watson                                       [cjwatson@debian.org]

Reply to:

Follow-Ups:
- Re: [PATCH v4] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>

References:
- [PATCH v3] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Ansgar Burchardt <ansgar@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Ansgar Burchardt <ansgar@debian.org>
- [PATCH v4] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>

Prev by Date: [dak/master] Add by-hash support
Next by Date: [dak/master] debian-security/dak.conf: set changelog export directory
Previous by thread: [PATCH v4] Add by-hash support
Next by thread: Re: [PATCH v4] Add by-hash support
Index(es):
- Date
- Thread