Re: [PATCH v4] Add by-hash support

To: debian-dak@lists.debian.org
Subject: Re: [PATCH v4] Add by-hash support
From: Julien Cristau <jcristau@debian.org>
Date: Tue, 17 May 2016 10:07:35 +0200
Message-id: <[🔎] 20160517080735.GJ2718@betterave.cristau.org>
In-reply-to: <[🔎] 20160516172821.GS21507@riva.ucam.org>
References: <[🔎] 1462175800-21306-1-git-send-email-jcristau@debian.org> <[🔎] 87zirq2til.fsf@deep-thought.43-1.org> <[🔎] 20160516133039.GG2718@betterave.cristau.org> <[🔎] 87twhyyu8v.fsf@deep-thought.43-1.org> <[🔎] 20160516155403.GI2718@betterave.cristau.org> <[🔎] 20160516172821.GS21507@riva.ucam.org>

On Mon, May 16, 2016 at 18:28:21 +0100, Colin Watson wrote:

> Sorry for the late reply.  I hope this is a helpful enough review to
> make up for it.
> 
Definitely!  Much appreciated, thanks.

> On Mon, May 16, 2016 at 05:54:03PM +0200, Julien Cristau wrote:
> > +def clean_byhash(now_date, session):
> > +    Logger.log(["Cleaning out unused by-hash files..."])
> > +
> > +    q = session.execute("""
> > +        DELETE FROM hashfile h
> > +        USING suite s, archive a
> > +        WHERE s.id = h.suite_id
> > +          AND a.id = s.archive_id
> > +          AND h.unreferenced + a.stayofexecution < CURRENT_TIMESTAMP
> > +        RETURNING a.path, s.suite_name, h.path""")
> > +    count = q.rowcount
> > +
> > +    if not Options["No-Action"]:
> > +        for base, suite, path in q:
> > +            filename = os.path.join(base, 'dists', suite, path)
> > +            try:
> > +                os.unlink(filename)
> > +            except OSError as exc:
> > +                if exc.errno != errno.ENOENT:
> > +                    raise
> > +                Logger.log(['database referred to non-existing file', filename])
> > +            else:
> > +                Logger.log(['delete hashfile', suite, path])
> > +        session.commit()
> 
> I initially thought that this means that a no-action run will delete
> rows from the database but never remove them from disk, which seems like
> a bad idea.  Then I read the earlier thread and found that this isn't
> the case, but this still seems to rely on a bit more state than I'm
> comfortable with.
> 
> My implementation in Launchpad instead works by first deleting rows from
> the database, then walking over all the relevant database rows and
> adding their hashes to an in-memory data structure, then walking over
> all the by-hash files on disk in the relevant suite and removing any
> that no longer correspond to the database.  That way, if you miss
> anything, you only have to make the database look right and the next run
> will fix it up.  One wrinkle is that you need to know which directories
> to walk over: we solve that by having the cleaning be done as part of
> writing out by-hash files, so we can keep track of which directories
> were relevant before any database rows were deleted.
> 
> Basically, the database should be the source of truth and every run
> should make sure that the filesystem is in sync with it, rather than
> just with changes made in the current transaction.  This is generally a
> more reliable approach, and it's only negligibly more work at run-time
> here.
> 
I'll have to think about this.

> What about removing directories that are emptied as a result of removing
> all their by-hash files?
> 
AFAICT the existing clean_empty_directories() will take care of that.

> > +        c.execute("ALTER TABLE suite ADD COLUMN byhash BOOLEAN DEFAULT false")
> 
> I recommend having two booleans here: one to publish the by-hash files,
> and another to add the flag to Release.  That way it's really easy to
> enable just the former for testing, or to disable the latter later if
> something goes wrong.
> 
When I asked, Joerg said just one boolean was fine.

[...]
> > +                    hashfile = os.path.join(os.path.dirname(filename), 'by-hash', h, fileinfo[filename][h])
> 
> I know this isn't your fault, but the excessively-clever illegible
> initialisation of hashfuncs further up in this file needs to die and be
> replaced with a lookup in an explicit mapping or something.  I mean,
> seriously.  I used to think that kind of thing was really neat, but then
> I realised that code has a communication function (including to my later
> self) rather than just being as dense as possible.
> 
> Anyway, I *think* this is right, but lacking a test suite or legible
> initialisation of hashfuncs it took me a while to satisfy myself that
> this actually creates by-hash directories with the correct case.  And of
> course it's a timebomb in case anyone has the temerity to popularise a
> hash function containing the substring "UM" in future.
> 
> Given that each hash algorithm has several associated properties (hash
> method, various output spellings, maybe database column names, other
> things like whether to write by-hash entries for it), a nice replacement
> would be a little class for each hash algorithm that encapsulates all
> these and a registry of all the current algorithms.  That's a couple of
> dozen lines of readable code in one place and then everything else can
> use it.
> 
I did have to think about this for a bit, so it's hard to disagree.
I'll look into doing what you suggest.

> > +                    query = "SELECT 1 FROM hashfile WHERE path = :p AND suite_id = :id"
> > +                    q = session.execute(
> > +                            query,
> > +                            {'p': hashfile, 'id': suite.suite_id})
> > +                    if q.rowcount:
> > +                        session.execute('''
> > +                            UPDATE hashfile SET unreferenced = NULL
> > +                            WHERE path = :p and suite_id = :id''',
> > +                            {'p': hashfile, 'id': suite.suite_id})
> > +                    else:
> > +                        session.execute('''
> > +                            INSERT INTO hashfile (path, suite_id)
> > +                            VALUES (:p, :id)''',
> > +                            {'p': hashfile, 'id': suite.suite_id})
> 
> This is shaped quite a bit differently from the LP implementation (which
> has "path" referring to the by-name path, and potentially >1 row for
> each (suite, path) although only one is ever referenced), so I had to
> think about it for a while.
> 
> It's a bit hard to reason about the correctness of this in the case
> where multiple files hash to the same value.  They'd have to be in the
> same directory, but that's not entirely improbable in the case of (say)
> multiple empty translation files.  A test suite would be really helpful;
> it was invaluable to me when doing similar work and chasing down corner
> cases.  How are you testing this at the moment?
> 
Yeah, that's not pretty.  I'm only testing manually, so it's likely I'll
have missed things.

> I think my implementation involves quite a lot fewer queries: yours is
> always O(len(fileinfo) * len(hashfuncs)), whereas mine is O(number of
> new by-hash entries) plus a small constant.  You may or may not care
> about this; it matters for us because we publish a large number of
> archives, and it may matter for Debian in future if
> PPAs/bikesheds/whatever happen.
> 
> Even if you don't do this, you may find it worth optimising by reading
> "SELECT path FROM hashfile WHERE suite_id = :id" or similar into an
> in-memory set outside both loops and then checking against that rather
> than doing a "SELECT 1"; all those little queries probably dominate the
> cost of reading a bunch of paths you aren't using.
> 
I'll fix that up to reduce the number of queries, thanks.

> > +                    try:
> > +                        os.link(filename, hashfile)
> > +                    except OSError as exc:
> > +                        if exc.errno != errno.EEXIST:
> > +                            raise
> 
> You do indeed want a hardlink here, as otherwise the repository format
> is racy: there's a window where an old by-hash symlink either doesn't
> exist or refers to something with a different hash.  It also makes
> things generally a lot easier to manage.
> 
> However, you can optimise this a bit in the case where you're writing
> out multiple hashes: it's perfectly safe for by-hash/SHA1/$(sha1sum foo)
> to be a symlink to by-hash/SHA256/$(sha256sum foo).  That's worth a
> look, especially if you don't follow my advice above about dropping
> MD5Sum and SHA1, but also in general since we might introduce stronger
> hashes in future.
> 
I think for now we'll go with the hardlinks and see how it goes.  Can
always change some of them to symlinks (or drop the md5/sha1 ones) later
if they become a pain.

Cheers,
Julien

Reply to:

References:
- [PATCH v3] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Ansgar Burchardt <ansgar@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Ansgar Burchardt <ansgar@debian.org>
- [PATCH v4] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>
- Re: [PATCH v4] Add by-hash support
  - From: Colin Watson <cjwatson@debian.org>

Prev by Date: [dak/master] Remove more changelog-related settings
Next by Date: [dak/master] Move Release files to zzz-dists
Previous by thread: Re: [PATCH v4] Add by-hash support
Next by thread: [PATCH] dak: don't ignore SystemExit
Index(es):
- Date
- Thread