Re: Request to get Permission for Data extraction
- To: Simon Josefsson <simon@josefsson.org>
- Cc: debian-snapshot@lists.debian.org
- Subject: Re: Request to get Permission for Data extraction
- From: Linus Nordberg <linus@glasklarteknik.se>
- Date: Sat, 12 Apr 2025 18:04:54 +0200
- Message-id: <[🔎] 87wmbpqsbd.fsf@nordberg.se>
- In-reply-to: <87o6xph33i.fsf@josefsson.org> (Simon Josefsson's message of "Tue, 25 Mar 2025 10:27:45 +0100")
- References: <34242d40f667ec25aa0caee0e70d8648@stud.hs-merseburg.de> <87ikofyxzi.fsf@nordberg.se> <87ikoek477.fsf@josefsson.org> <877c4k7znq.fsf@nordberg.se> <87o6xph33i.fsf@josefsson.org>
Simon Josefsson <simon@josefsson.org> wrote
Tue, 25 Mar 2025 10:27:45 +0100:
> Linus Nordberg <linus@glasklarteknik.se> writes:
>
>> Simon Josefsson <simon@josefsson.org> wrote
>> Wed, 12 Mar 2025 10:07:08 +0100:
>>
>>>> As pointed out in another response to your request, it might make sense
>>>> for you to ask for (a copy of) the metadata kept in the database.
>>>
>>> Could the snapshot team make those public?
>>>
>>> It is harder than it should be to mirror snapshot locally. You have to
>>> screenscrape the web interface to get full data. This creates
>>> unnecessary load, so it would be nice if at least the list of filenames
>>> (essentially SHA1 hashes) could be published. Right now this
>>> information is hidden. As far as I understood earlier discussions on
>>> this, that hiding is intentional (for reasons I couldn't understand).
>>
>> Hi Simon,
>>
>> Do you want to operate a full Snapshot mirror, contributing to the
>> operations of the Snapshot service? Snapshot has a method for mirroring
>> the farm described in [mirror/README][]. In addition to that you would
>> set up postgresql for replication, to keep your db up to date with the
>> primary.
>>
>> If not, have you tried accessing the Snapshot database using the
>> 'snapshot-guest' user? The pgsql client would have to make its
>> connection from a Debian machine allowed to connect to the db (on the
>> primary or any of the replicas). I don't know how to compile the list of
>> these machines but DSA surely do.
>
> Hi! My idea has been to announce my personal mirror of snapshot (in use
> for a year or so already, hosted at Hetzner), and assuming it has been
> operational for another year or so with good public availability, it
> could be discussed if it make sense to include it as another official
> Snapshot mirror. So not a clear answer to your question, but at least
> sharing my thinking.
Thanks. Let me share some of my current thinking in return.
The Snapshot service still doesn't have a proper owner. This makes every
small thing take forever (or just slightly less time) to get done,
including seemingly simple decisions.
Snapshot has excellent support from DSA (thanks to pkern AFAICT) but is
still limping at the service level. I don't know exactly how to solve
this but I'm interested in finding a solution. Suggestions welcome (on
or off list).
>> I don't remember the discussion about hiding information on which files
>> exist in the farm. What arguments were posed for doing that?
>
> I never understood the arguments. The replies I got about a year ago on
> IRC on made me believe that the snapshot team do not want to make the
> database of filename to SHA1 hashes public, and that you did not want to
> see non-official mirrors.
>
> Perhaps this was a misunderstanding, or things have changed?
>
> Could you make (say, a daily) export of the database publicly available?
>
> If not, what is the reason for not making this information public?
I think this is a question of how to distribute the database and not
about keeping any information hidden from the public. With a PostgreSQL
on-disk size of ~100G, how should a daily export be made available?
Also, when you say "publicly available" do you mean available to the
internet at large or to Debian DD's? Because the former would probably
need some kind of rate limiting while the latter is supposedly already
the case (did you try 'snapshot-guest'?).
Reply to: