Re: [PATCH v3] Add by-hash support
Hi Ansgar,
thanks for the review.
On Mon, May 16, 2016 at 12:21:06 +0200, Ansgar Burchardt wrote:
> Julien Cristau writes:
> > --- a/dak/clean_suites.py
> > +++ b/dak/clean_suites.py
> > @@ -384,6 +384,33 @@ SELECT f.id, f.fingerprint FROM fingerprint f
> > +def clean_byhash(now_date, session):
> > + Logger.log(["Cleaning out unused by-hash files..."])
> > +
> > + q = session.execute("""
> > + DELETE FROM hashfile h
> > + USING suite s, archive a
> > + WHERE s.id = h.suite_id
> > + AND a.id = s.archive_id
> > + AND h.unreferenced + a.stayofexecution < CURRENT_TIMESTAMP
> > + RETURNING a.path, s.suite_name, h.path""")
> > + count = q.rowcount
> > +
> > + if not Options["No-Action"]:
> > + for base, suite, path in q.fetchall():
>
> The fetchall() is not necessary: `for row in q` will iterate over all
> rows returned.
>
Will fix.
> > + filename = os.path.join(base, 'dists', suite, path)
> > + if os.path.isfile(filename):
> > + Logger.log(['delete hashfile', suite, path])
> > + os.unlink(filename)
> > + else:
> > + Logger.log(['database referred to non-existing file', filename])
>
> I would prefer just trying to unlink the file and catching the
> does-not-exist error. Besides `os.path.isfile(filename)` returns
> false if `filename` is a dangling symlink which we still want to remove.
>
OK.
> In case of `No-Action`, the transaction should be rolled back.
>
I was following what the other clean* functions in that file seem to be
doing, AFAICT they rely on the final session.rollback() in main(). I
can add it here, though.
> > --- a/dak/generate_releases.py
> > +++ b/dak/generate_releases.py
> > + if suite.byhash:
> > + query = """
> > + UPDATE hashfile SET unreferenced = CURRENT_TIMESTAMP
> > + WHERE suite_id = :id AND unreferenced IS NULL"""
> > + session.execute(query, {'id': suite.suite_id})
> > +
> > + hashes = filter(lambda h: h in hashfuncs, ('MD5Sum', 'SHA1', 'SHA256'))
> > + strong_hash = hashes[-1]
>
> Hmm, I don't like this assumption, but then I have nothing better for
> now.
>
You mean the assumption that the suite has at least one of md5, sha1,
sha256?
> > + query = "SELECT 1 FROM hashfile WHERE path = :p AND suite_id = :id"
> > + q = session.execute(
> > + query,
> > + {'p': hashfile, 'id': suite.suite_id})
> > + if q.rowcount:
> > + session.execute('''
> > + UPDATE hashfile SET unreferenced = NULL
> > + WHERE path = :p and suite_id = :id''',
> > + {'p': hashfile, 'id': suite.suite_id})
> > + else:
> > + session.execute('''
> > + INSERT INTO hashfile (path, suite_id, unreferenced)
> > + VALUES (:p, :id, NULL)''',
> > + {'p': hashfile, 'id': suite.suite_id})
>
> I look forward to `INSERT ... ON CONFLICT UPDATE ...` in Postgres 9.5 :)
>
I thought the same thing when writing this ;)
> The default value for `unreferenced` should not be specified: it's fine
> to take the one from the table definition.
>
Will fix.
> > + if h == strong_hash:
> > + with open(filename, 'rb') as src:
> > + contents = src.read()
> > + with open(hashfile + '.new', 'wb') as dst:
> > + dst.write(contents)
> > + os.rename(hashfile + '.new', hashfile)
> > + os.unlink(filename)
>
> This is just `os.rename(filename, hashfile)`? Or `shutil.move` for
> the theoretic case of `filename` and `hashfile` being on different
> filesystems.
>
Not quite. If you're calling generate-releases twice in a row, e.g.
after an error, filename is already a symlink, so moving would break
things. I could check for that explicitly, if you prefer?
> > + os.symlink(os.path.join('by-hash', h, fileinfo[filename][h]),
> > + filename)
> > + elif not os.path.exists(hashfile):
> > + os.symlink(os.path.join('..', strong_hash, fileinfo[filename][strong_hash]),
> > + hashfile)
>
> I wonder if hardlinks are the better choice (no risk of dangling
> symlinks), but symlinks are probably better for mirrors that don't
> mirror hardlinks as such, but duplicate the data?
>
No preference here.
Cheers,
Julien
Reply to: