Re: [PATCH v3] Add by-hash support

To: debian-dak@lists.debian.org, cjwatson@debian.org
Subject: Re: [PATCH v3] Add by-hash support
From: Julien Cristau <jcristau@debian.org>
Date: Mon, 16 May 2016 15:30:39 +0200
Message-id: <[🔎] 20160516133039.GG2718@betterave.cristau.org>
In-reply-to: <[🔎] 87zirq2til.fsf@deep-thought.43-1.org>
References: <[🔎] 1462175800-21306-1-git-send-email-jcristau@debian.org> <[🔎] 87zirq2til.fsf@deep-thought.43-1.org>

Hi Ansgar,

thanks for the review.

On Mon, May 16, 2016 at 12:21:06 +0200, Ansgar Burchardt wrote:

> Julien Cristau writes:
> > --- a/dak/clean_suites.py
> > +++ b/dak/clean_suites.py
> > @@ -384,6 +384,33 @@ SELECT f.id, f.fingerprint FROM fingerprint f
> > +def clean_byhash(now_date, session):
> > +    Logger.log(["Cleaning out unused by-hash files..."])
> > +
> > +    q = session.execute("""
> > +        DELETE FROM hashfile h
> > +        USING suite s, archive a
> > +        WHERE s.id = h.suite_id
> > +          AND a.id = s.archive_id
> > +          AND h.unreferenced + a.stayofexecution < CURRENT_TIMESTAMP
> > +        RETURNING a.path, s.suite_name, h.path""")
> > +    count = q.rowcount
> > +
> > +    if not Options["No-Action"]:
> > +        for base, suite, path in q.fetchall():
> 
> The fetchall() is not necessary: `for row in q` will iterate over all
> rows returned.
> 
Will fix.

> > +            filename = os.path.join(base, 'dists', suite, path)
> > +            if os.path.isfile(filename):
> > +                Logger.log(['delete hashfile', suite, path])
> > +                os.unlink(filename)
> > +            else:
> > +                Logger.log(['database referred to non-existing file', filename])
> 
> I would prefer just trying to unlink the file and catching the
> does-not-exist error.  Besides `os.path.isfile(filename)` returns
> false if `filename` is a dangling symlink which we still want to remove.
> 
OK.

> In case of `No-Action`, the transaction should be rolled back.
> 
I was following what the other clean* functions in that file seem to be
doing, AFAICT they rely on the final session.rollback() in main().  I
can add it here, though.

> > --- a/dak/generate_releases.py
> > +++ b/dak/generate_releases.py
> > +        if suite.byhash:
> > +            query = """
> > +                UPDATE hashfile SET unreferenced = CURRENT_TIMESTAMP
> > +                WHERE suite_id = :id AND unreferenced IS NULL"""
> > +            session.execute(query, {'id': suite.suite_id})
> > +
> > +            hashes = filter(lambda h: h in hashfuncs, ('MD5Sum', 'SHA1', 'SHA256'))
> > +            strong_hash = hashes[-1]
> 
> Hmm, I don't like this assumption, but then I have nothing better for
> now.
> 
You mean the assumption that the suite has at least one of md5, sha1,
sha256?

> > +                    query = "SELECT 1 FROM hashfile WHERE path = :p AND suite_id = :id"
> > +                    q = session.execute(
> > +                            query,
> > +                            {'p': hashfile, 'id': suite.suite_id})
> > +                    if q.rowcount:
> > +                        session.execute('''
> > +                            UPDATE hashfile SET unreferenced = NULL
> > +                            WHERE path = :p and suite_id = :id''',
> > +                            {'p': hashfile, 'id': suite.suite_id})
> > +                    else:
> > +                        session.execute('''
> > +                            INSERT INTO hashfile (path, suite_id, unreferenced)
> > +                            VALUES (:p, :id, NULL)''',
> > +                            {'p': hashfile, 'id': suite.suite_id})
> 
> I look forward to `INSERT ... ON CONFLICT UPDATE ...` in Postgres 9.5 :)
> 
I thought the same thing when writing this ;)

> The default value for `unreferenced` should not be specified: it's fine
> to take the one from the table definition.
> 
Will fix.

> > +                    if h == strong_hash:
> > +                        with open(filename, 'rb') as src:
> > +                            contents = src.read()
> > +                        with open(hashfile + '.new', 'wb') as dst:
> > +                            dst.write(contents)
> > +                        os.rename(hashfile + '.new', hashfile)
> > +                        os.unlink(filename)
> 
> This is just `os.rename(filename, hashfile)`?  Or `shutil.move` for
> the theoretic case of `filename` and `hashfile` being on different
> filesystems.
> 
Not quite.  If you're calling generate-releases twice in a row, e.g.
after an error, filename is already a symlink, so moving would break
things.  I could check for that explicitly, if you prefer?

> > +                        os.symlink(os.path.join('by-hash', h, fileinfo[filename][h]),
> > +                                   filename)
> > +                    elif not os.path.exists(hashfile):
> > +                        os.symlink(os.path.join('..', strong_hash, fileinfo[filename][strong_hash]),
> > +                                   hashfile)
> 
> I wonder if hardlinks are the better choice (no risk of dangling
> symlinks), but symlinks are probably better for mirrors that don't
> mirror hardlinks as such, but duplicate the data?
> 
No preference here.

Cheers,
Julien

Reply to:

Follow-Ups:
- Re: [PATCH v3] Add by-hash support
  - From: Ansgar Burchardt <ansgar@debian.org>

References:
- [PATCH v3] Add by-hash support
  - From: Julien Cristau <jcristau@debian.org>
- Re: [PATCH v3] Add by-hash support
  - From: Ansgar Burchardt <ansgar@debian.org>

Prev by Date: Re: [PATCH v3] Add by-hash support
Next by Date: Re: [PATCH v3] Add by-hash support
Previous by thread: Re: [PATCH v3] Add by-hash support
Next by thread: Re: [PATCH v3] Add by-hash support
Index(es):
- Date
- Thread