Re: [PATCH v4] Add by-hash support
On Mon, May 16, 2016 at 18:28:21 +0100, Colin Watson wrote:
> Sorry for the late reply. I hope this is a helpful enough review to
> make up for it.
>
Definitely! Much appreciated, thanks.
> On Mon, May 16, 2016 at 05:54:03PM +0200, Julien Cristau wrote:
> > +def clean_byhash(now_date, session):
> > + Logger.log(["Cleaning out unused by-hash files..."])
> > +
> > + q = session.execute("""
> > + DELETE FROM hashfile h
> > + USING suite s, archive a
> > + WHERE s.id = h.suite_id
> > + AND a.id = s.archive_id
> > + AND h.unreferenced + a.stayofexecution < CURRENT_TIMESTAMP
> > + RETURNING a.path, s.suite_name, h.path""")
> > + count = q.rowcount
> > +
> > + if not Options["No-Action"]:
> > + for base, suite, path in q:
> > + filename = os.path.join(base, 'dists', suite, path)
> > + try:
> > + os.unlink(filename)
> > + except OSError as exc:
> > + if exc.errno != errno.ENOENT:
> > + raise
> > + Logger.log(['database referred to non-existing file', filename])
> > + else:
> > + Logger.log(['delete hashfile', suite, path])
> > + session.commit()
>
> I initially thought that this means that a no-action run will delete
> rows from the database but never remove them from disk, which seems like
> a bad idea. Then I read the earlier thread and found that this isn't
> the case, but this still seems to rely on a bit more state than I'm
> comfortable with.
>
> My implementation in Launchpad instead works by first deleting rows from
> the database, then walking over all the relevant database rows and
> adding their hashes to an in-memory data structure, then walking over
> all the by-hash files on disk in the relevant suite and removing any
> that no longer correspond to the database. That way, if you miss
> anything, you only have to make the database look right and the next run
> will fix it up. One wrinkle is that you need to know which directories
> to walk over: we solve that by having the cleaning be done as part of
> writing out by-hash files, so we can keep track of which directories
> were relevant before any database rows were deleted.
>
> Basically, the database should be the source of truth and every run
> should make sure that the filesystem is in sync with it, rather than
> just with changes made in the current transaction. This is generally a
> more reliable approach, and it's only negligibly more work at run-time
> here.
>
I'll have to think about this.
> What about removing directories that are emptied as a result of removing
> all their by-hash files?
>
AFAICT the existing clean_empty_directories() will take care of that.
> > + c.execute("ALTER TABLE suite ADD COLUMN byhash BOOLEAN DEFAULT false")
>
> I recommend having two booleans here: one to publish the by-hash files,
> and another to add the flag to Release. That way it's really easy to
> enable just the former for testing, or to disable the latter later if
> something goes wrong.
>
When I asked, Joerg said just one boolean was fine.
[...]
> > + hashfile = os.path.join(os.path.dirname(filename), 'by-hash', h, fileinfo[filename][h])
>
> I know this isn't your fault, but the excessively-clever illegible
> initialisation of hashfuncs further up in this file needs to die and be
> replaced with a lookup in an explicit mapping or something. I mean,
> seriously. I used to think that kind of thing was really neat, but then
> I realised that code has a communication function (including to my later
> self) rather than just being as dense as possible.
>
> Anyway, I *think* this is right, but lacking a test suite or legible
> initialisation of hashfuncs it took me a while to satisfy myself that
> this actually creates by-hash directories with the correct case. And of
> course it's a timebomb in case anyone has the temerity to popularise a
> hash function containing the substring "UM" in future.
>
> Given that each hash algorithm has several associated properties (hash
> method, various output spellings, maybe database column names, other
> things like whether to write by-hash entries for it), a nice replacement
> would be a little class for each hash algorithm that encapsulates all
> these and a registry of all the current algorithms. That's a couple of
> dozen lines of readable code in one place and then everything else can
> use it.
>
I did have to think about this for a bit, so it's hard to disagree.
I'll look into doing what you suggest.
> > + query = "SELECT 1 FROM hashfile WHERE path = :p AND suite_id = :id"
> > + q = session.execute(
> > + query,
> > + {'p': hashfile, 'id': suite.suite_id})
> > + if q.rowcount:
> > + session.execute('''
> > + UPDATE hashfile SET unreferenced = NULL
> > + WHERE path = :p and suite_id = :id''',
> > + {'p': hashfile, 'id': suite.suite_id})
> > + else:
> > + session.execute('''
> > + INSERT INTO hashfile (path, suite_id)
> > + VALUES (:p, :id)''',
> > + {'p': hashfile, 'id': suite.suite_id})
>
> This is shaped quite a bit differently from the LP implementation (which
> has "path" referring to the by-name path, and potentially >1 row for
> each (suite, path) although only one is ever referenced), so I had to
> think about it for a while.
>
> It's a bit hard to reason about the correctness of this in the case
> where multiple files hash to the same value. They'd have to be in the
> same directory, but that's not entirely improbable in the case of (say)
> multiple empty translation files. A test suite would be really helpful;
> it was invaluable to me when doing similar work and chasing down corner
> cases. How are you testing this at the moment?
>
Yeah, that's not pretty. I'm only testing manually, so it's likely I'll
have missed things.
> I think my implementation involves quite a lot fewer queries: yours is
> always O(len(fileinfo) * len(hashfuncs)), whereas mine is O(number of
> new by-hash entries) plus a small constant. You may or may not care
> about this; it matters for us because we publish a large number of
> archives, and it may matter for Debian in future if
> PPAs/bikesheds/whatever happen.
>
> Even if you don't do this, you may find it worth optimising by reading
> "SELECT path FROM hashfile WHERE suite_id = :id" or similar into an
> in-memory set outside both loops and then checking against that rather
> than doing a "SELECT 1"; all those little queries probably dominate the
> cost of reading a bunch of paths you aren't using.
>
I'll fix that up to reduce the number of queries, thanks.
> > + try:
> > + os.link(filename, hashfile)
> > + except OSError as exc:
> > + if exc.errno != errno.EEXIST:
> > + raise
>
> You do indeed want a hardlink here, as otherwise the repository format
> is racy: there's a window where an old by-hash symlink either doesn't
> exist or refers to something with a different hash. It also makes
> things generally a lot easier to manage.
>
> However, you can optimise this a bit in the case where you're writing
> out multiple hashes: it's perfectly safe for by-hash/SHA1/$(sha1sum foo)
> to be a symlink to by-hash/SHA256/$(sha256sum foo). That's worth a
> look, especially if you don't follow my advice above about dropping
> MD5Sum and SHA1, but also in general since we might introduce stronger
> hashes in future.
>
I think for now we'll go with the hardlinks and see how it goes. Can
always change some of them to symlinks (or drop the md5/sha1 ones) later
if they become a pain.
Cheers,
Julien
Reply to: