Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)

To: Yaroslav Halchenko <yoh@dartmouth.edu>
Cc: debian-snapshot@lists.debian.org
Subject: Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
From: Peter Palfrader <peter@palfrader.org>
Date: Wed, 11 Apr 2012 20:08:56 +0200
Message-id: <[🔎] 20120411180856.GF16930@anguilla.noreply.org>
Mail-followup-to: Peter Palfrader <peter@palfrader.org>, Yaroslav Halchenko <yoh@dartmouth.edu>, debian-snapshot@lists.debian.org
In-reply-to: <[🔎] 20120411165910.GA29616@onerussian.com>
References: <[🔎] E1SHfpd-0004s4-3Z@sibelius.debian.org> <[🔎] 20120410233650.GS29616@onerussian.com> <[🔎] 20120411061635.GG17373@anguilla.noreply.org> <[🔎] 20120411165910.GA29616@onerussian.com>

On Wed, 11 Apr 2012, Yaroslav Halchenko wrote:

> Hi Peter,

(Cc'ed to the list, since this might be of general interest)
> Please consider accepting the attached patch for that check

Yes, that check has also a couple more pending changes cumulated on
sibelius.  I'll upate it when I'm done with moving stuff around on
that host in a couple days.

> And now that I have got a list of mismatched sums -- what would be the
> reasonable course of action?  here they are FWIW
> 
> 1caa9a6a88c34626afbeb4bbb5db8832b1417d49: Hash mismatch (d1f4a593b1196e66b1b53d769d5e73fa115498c0)
> 1c3d3c7d1dfec8a5d1d6bf953f6b8eb09788cad4: Hash mismatch (c73e3b7247b14bcf9432f539f9f651597a728b29)
> 30cd285f222bd8fadcf3742205cebd085809de1c: Hash mismatch (0f0a4469e5efa0d107f2a3e750d6fef8b51ff282)
> 38a77ce4bf31ffd69e5e7073a4f0740cedbfd7d6: Hash mismatch (82f1ead3d54c2cf6c6add5127ee11e90cbc02955)
> 3c0d79752cce99995f3a1e6923107f9cf80b33d4: Hash mismatch (7aadc2d7cac00304b282ce105360b65f65f0bcc4)
> 483eea0bd65f9a1ecb4d4fa27e158848c937f9f8: Hash mismatch (1c50e0494a53714a8d415a0e53d0e1b9aa9f1360)
> 6d2065bc0263ed2c0b1a8384fa7a35033ad2f294: Hash mismatch (c497409bdacaa07930d596a175cfb26cbb051e53)
> 76faaa2584edeefb2579d7d45cd57a7c22eecd47: Hash mismatch (dc592a06e8eece238cb0938774b769525a4bbde3)
> 986f3cf33f8de3267532c04c696592ba1209c003: Hash mismatch (0d6e6610ad52063560b777d08c1a1800145e40b0)
> 9bc3b26606e2fa33a44704d9dc5d208b3c2dcfa9: Hash mismatch (11038d3f281efb21b4d21d4e63d4de259897cd78)
> 9fb2c2d3940139a42d09f66844013e3c30b36089: Hash mismatch (e9cf05c6ee09d49e5ef8496433785d7de4dd0d57)
> a12ea622a1ef676bf93ce0498f54583e825bc368: Hash mismatch (5e7105a8d989f2eea462a385b4bc56a90d8d4485)
> a9258240b860351f8af29cdf0648fc8523fa27c6: Hash mismatch (96d4b1f29c9672fa130bd109d2c0c59d561e925d)
> b0742bd352bec323c2b8d6b2168bdf01a6ab2352: Hash mismatch (b148d4ba7f1e07a59df3538b30a05a8df41788b9)
> ca3325cf34f362af3c17350497300ec05da8659f: Hash mismatch (e4d9adad7c115b5e0e6fbe4962953a73b6540302)
> d12afdbcbe213d07ce921d9a96be8f18f6407172: Hash mismatch (c8524c42e8200b6745dcf027bbfbbaf1cc07e3a8)
> cf7640a2dade0db3edea3350779be8bd0cc6f571: Hash mismatch (2ad2419e034ff1c208aa5ef6201f19648116f419)
> df9ca2daa7a2d6e9722cf1b7ec56936996aad01d: Hash mismatch (5a712cb7e2b1be37a0d5606426e24c69197077c0)
> eb9b738c1821a3cfa3ac7589b79eb3e7f1d34b1d: Hash mismatch (2693b9be2ca06bf8ac9cafb0cc42b0b78596cd13)
> f051f318f46e8adb66c067ad48e0b2899c0509f5: Hash mismatch (5d4b4d340be14ed062a5e96bc43af33d2fd9ec40)

Well, it depends.  This particular ruby bug only affects files that are
larger than 2^29 bytes (~530mb).  if that explains for all of the above,
then great.  If not, then bigs might have flipped, or the filesystem
might have eaten stuff, or ...

Here's what I usually do:
} 7f5cafdad2e67cc8c039aa19f867bc6426da5223: Hash mismatch (791defea01ea731291bb8a3d57dc6e73df4a94e7)

Find out what the associated filename is:
| snapshot=> select * from file join node using (node_id) where hash='7f5cafdad2e67cc8c039aa19f867bc6426da5223';
|  node_id  | file_id  |                  name                   |   size    |                   hash                   | parent | first | last  
| ----------+----------+-----------------------------------------+-----------+------------------------------------------+--------+-------+-------
|  11273925 | 11062863 | texlive-extra_2011.20120314.orig.tar.gz | 877635913 | 7f5cafdad2e67cc8c039aa19f867bc6426da5223 |  14809 | 22253 | 22549
| (1 row)
| 
| snapshot=> select * from file join node using (node_id) where hash='791defea01ea731291bb8a3d57dc6e73df4a94e7';
|  node_id | file_id | name | size | hash | parent | first | last 
| ---------+---------+------+------+------+--------+-------+------
| (0 rows)

Maybe also find out when and where it first/last appeared:
| snapshot=> select * from directory where directory_id = 14809;
|  directory_id |            path            | node_id 
| --------------+----------------------------+---------
|         14809 | /pool/main/t/texlive-extra |  909346
| (1 row)
| 
| snapshot=> select * from mirrorrun where mirrorrun_id = 22253;
|  mirrorrun_id | archive_id |         run         |            mirrorrun_uuid            |   importing_host    
| --------------+------------+---------------------+--------------------------------------+---------------------
|         22253 |          1 | 2012-03-14 16:09:48 | 17cc0f44-5923-44c0-918b-bf6df9362b88 | sibelius.debian.org
| (1 row)

(archive_id 1 is debian.  select * from archive_id; for a list)

Now that I know that, I try to find out what the hashsum is supposed to
be from other sources, like a .dsc file, or a Packages/Sources file,
Release file etc.

If the original checksum was indeed correct, and the file got corrupted
on disk (yes, that also happens), I try to find the file elsewhere, or
reconstruct it (if it's say a Packages.gz file and we have a
Packages.bz2 file).  This is sometimes tricky, sometimes luck is
involved and sometimes it might not be possible at all.

If the checksum that the file actually has turns out to be the correct
one, and the one we handle it with is wrong, it's time for some surgery.

Fixing the farm is easy: Move the file to the correct place.  If we
already have it as its correct name too, then just rm the wrong name.
| mv 7f/5c/7f5cafdad2e67cc8c039aa19f867bc6426da5223 79/1d/791defea01ea731291bb8a3d57dc6e73df4a94e7 

Correcting the database is more involved.  If there is no existing file
with the correct name, update the file table with the new digest:
| snapshot=> begin ;
| BEGIN
| snapshot=> update file set hash='791defea01ea731291bb8a3d57dc6e73df4a94e7' where hash='7f5cafdad2e67cc8c039aa19f867bc6426da5223';
| UPDATE 1

Also, if the file is referenced from either file_binpkg_mapping or
file_srcpkg_mapping, update that as well:

| snapshot=> update file_srcpkg_mapping set hash='791defea01ea731291bb8a3d57dc6e73df4a94e7' where hash='7f5cafdad2e67cc8c039aa19f867bc6426da5223';
| UPDATE 1

and commit.

| snapshot=> commit;
| COMMIT

That's the easy case.  If the file is already there with the correct
digest, you might have to or want to merge entries if they appear in
consecutive mirroruns (see the mirrorun table to find out if one entry's
last is the run before the other's first).  In that case, update either
first or last of the correct node (node_id), and drop the other node
together with it's file entry.

Same thing applies to entries in the file_{bin,src}pkg_mapping.  If the
correct digest is already associated with the package/version in
question, drop the incorrect one, else update its hash to fix it.

HTH.
-- 
                           |  .''`.       ** Debian **
      Peter Palfrader      | : :' :      The  universal
 http://www.palfrader.org/ | `. `'      Operating System
                           |   `-    http://www.debian.org/

Reply to:

Follow-Ups:
- Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
  - From: Yaroslav Halchenko <debian@onerussian.com>

References:
- [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
  - From: Peter Palfrader <peter@palfrader.org>
- Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
  - From: Yaroslav Halchenko <yoh@dartmouth.edu>
- Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
  - From: Peter Palfrader <peter@palfrader.org>
- Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
  - From: Yaroslav Halchenko <yoh@dartmouth.edu>

Prev by Date: Bug#668351: snapshot.debian.org: JSON interface returns outdated information
Next by Date: [snapshot/master] Exit with 1 if we found errors
Previous by thread: Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
Next by thread: Re: [snapshot/master] Use SHA1.file().hexdigest, because ruby calculates the SHA1 wrong on large input otherwise (#668294)
Index(es):
- Date
- Thread