On Sun, Mar 29, 2015 at 06:42:20PM +0200, Lucas Nussbaum wrote:
> Currently three dumps are generated every day:
> - usql.sql.gz, with all the data except the ldap, really_active_dds, and
> pts relations, which are considered "private" data and not suitable for
> wide exposure.
fine. Even if (as tille noted) sometime some bugs data vanish. or at least, I
fail to import them (never tracked down the issue, somehow it solves by
itself... annoying anyway)
> - udd-bugs.sql.xz, with only the bugs data (both archived and
> unarchived -- Andreas Tille was the main user of that -- is this still
> needed?)
> - udd-popcon.sql.xz, with only the popcon data (codesearch.d.n needed
> that -- is this still needed?)
given that I'm interested in the public udd mirror I'm interested in the whole
data.
> 1) what is the rationale for the public UDD mirror. Is there a way this
> could be provided from Debian infrastructure, for example by
> whitelisting specific hosts that need UDD access? Is there something
> here that could be acceptable for DSA (Cced)?
DSA claerly states that they won't open udd to non d.o hosts:
From #debian-admin@OFTC on 2015-02-10:
[09:16:54] <h01ger> which could be fixed easily if jenkins.d.n could access 5432
on udd.d.o directly (read only...)
[09:17:57] <h01ger> could this be done rather short term? i'm happy to file a rt
ticket but i would like to know if this can be done quickly
or if should disable all these jobs for nw
[09:18:07] <h01ger> failing jobs are a pita.
[09:18:24] <h01ger> the src ip is 46.16.73.183
[09:22:53] <weasel> we would prefer not to open postgres to non-debian.org
systems.
[09:23:14] <luca> i think weasel is being polite
[09:23:20] <h01ger> prefer or not doing it?
[09:23:24] <luca> we don't open postgres to non-debian.org systems
I think DSA position is rasonable, yet UDD data is really valuable not only for
debian-related projects (jenkins in the above quote) but for people who want to
do random stuff with debian's data for whatever reason. IMHO providing a means
for that people to access the data is really nice (not just the db dump; not
everybody is able to setup postgres).
> 2) what is the rationale for the more frequent dumps. It's currently
> being dumped once a day. It's never going to be "in sync" with the
> live instance, unfortunately.
For many uses a lag of 24 hours is acceptable. Yet ~6 hours would be far better
of course. Also the data collection from udd is not in real time. I think nobody
is expecting UDD to have real-time data, but when the data you provide can be
2-days old it's bad.
FYI the udd-mirror importer runs hourly (actually downloding+importing the dump
if it's changed).
> 3) Would dumps in "custom format" (pg_dump -Fc) work for you? they allow
> parallel restore with pg_restore.
aye, it would be good. (Even if I think you should keep providing the plain
dump, at least to have the data in standard SQL)
Also, a change of the compression format would be fine.
> 4) Could some tables be excluded from the dumps?
It won't work for me.
> 5) Couldn't you trigger the dumps from your side, by calling pg_dump
> inside an SSH connection to ullmann.d.o?
Given that I'm no DD I can't, yet paulproteus maybe can set it up.
Yet, I'd prefer to download standard dumps from udd.d.o.
--
regards,
Mattia Rizzolo
GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`.
more about me: http://mapreri.org : :' :
Launchpad user: https://launchpad.net/~mapreri `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia `-
Attachment:
signature.asc
Description: Digital signature