[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#781459: udd: please provide dumps more often



On Sun, Mar 29, 2015 at 06:42:20PM +0200, Lucas Nussbaum wrote:
> Currently three dumps are generated every day:
> - usql.sql.gz, with all the data except the ldap, really_active_dds, and
>   pts relations, which are considered "private" data and not suitable for
>   wide exposure.

fine. Even if (as tille noted) sometime some bugs data vanish. or at least, I
fail to import them (never tracked down the issue, somehow it solves by
itself... annoying anyway)

> - udd-bugs.sql.xz, with only the bugs data (both archived and
>   unarchived -- Andreas Tille was the main user of that -- is this still
>   needed?)
> - udd-popcon.sql.xz, with only the popcon data (codesearch.d.n needed
>   that -- is this still needed?)

given that I'm interested in the public udd mirror I'm interested in the whole
data.

> 1) what is the rationale for the public UDD mirror. Is there a way this
>    could be provided from Debian infrastructure, for example by
>    whitelisting specific hosts that need UDD access? Is there something
>    here that could be acceptable for DSA (Cced)?

DSA claerly states that they won't open udd to non d.o hosts:

From #debian-admin@OFTC on 2015-02-10:
[09:16:54] <h01ger> which could be fixed easily if jenkins.d.n could access 5432
                    on udd.d.o directly (read only...)
[09:17:57] <h01ger> could this be done rather short term? i'm happy to file a rt
                    ticket but i would like to know if this can be done quickly
                    or if should disable all these jobs for nw
[09:18:07] <h01ger> failing jobs are a pita.
[09:18:24] <h01ger> the src ip is 46.16.73.183
[09:22:53] <weasel> we would prefer not to open postgres to non-debian.org
                    systems.
[09:23:14] <luca>   i think weasel is being polite
[09:23:20] <h01ger> prefer or not doing it?
[09:23:24] <luca>    we don't open postgres to non-debian.org systems


I think DSA position is rasonable, yet UDD data is really valuable not only for
debian-related projects (jenkins in the above quote) but for people who want to
do random stuff with debian's data for whatever reason. IMHO providing a means
for that people to access the data is really nice (not just the db dump; not
everybody is able to setup postgres).

> 2) what is the rationale for the more frequent dumps. It's currently
>    being dumped once a day. It's never going to be "in sync" with the
>    live instance, unfortunately.

For many uses a lag of 24 hours is acceptable. Yet ~6 hours would be far better
of course. Also the data collection from udd is not in real time. I think nobody
is expecting UDD to have real-time data, but when the data you provide can be
2-days old it's bad.

FYI the udd-mirror importer runs hourly (actually downloding+importing the dump
if it's changed).

> 3) Would dumps in "custom format" (pg_dump -Fc) work for you? they allow
>    parallel restore with pg_restore.

aye, it would be good. (Even if I think you should keep providing the plain
dump, at least to have the data in standard SQL)
Also, a change of the compression format would be fine.

> 4) Could some tables be excluded from the dumps?

It won't work for me.

> 5) Couldn't you trigger the dumps from your side, by calling pg_dump
>    inside an SSH connection to ullmann.d.o?

Given that I'm no DD I can't, yet paulproteus maybe can set it up.
Yet, I'd prefer to download standard dumps from udd.d.o.


-- 
regards,
                        Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540         .''`.
more about me:  http://mapreri.org                                 : :'  :
Launchpad user: https://launchpad.net/~mapreri                     `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia     `-

Attachment: signature.asc
Description: Digital signature


Reply to: