On Sun, Mar 29, 2015 at 06:42:20PM +0200, Lucas Nussbaum wrote: > Currently three dumps are generated every day: > - usql.sql.gz, with all the data except the ldap, really_active_dds, and > pts relations, which are considered "private" data and not suitable for > wide exposure. fine. Even if (as tille noted) sometime some bugs data vanish. or at least, I fail to import them (never tracked down the issue, somehow it solves by itself... annoying anyway) > - udd-bugs.sql.xz, with only the bugs data (both archived and > unarchived -- Andreas Tille was the main user of that -- is this still > needed?) > - udd-popcon.sql.xz, with only the popcon data (codesearch.d.n needed > that -- is this still needed?) given that I'm interested in the public udd mirror I'm interested in the whole data. > 1) what is the rationale for the public UDD mirror. Is there a way this > could be provided from Debian infrastructure, for example by > whitelisting specific hosts that need UDD access? Is there something > here that could be acceptable for DSA (Cced)? DSA claerly states that they won't open udd to non d.o hosts: From #debian-admin@OFTC on 2015-02-10: [09:16:54] <h01ger> which could be fixed easily if jenkins.d.n could access 5432 on udd.d.o directly (read only...) [09:17:57] <h01ger> could this be done rather short term? i'm happy to file a rt ticket but i would like to know if this can be done quickly or if should disable all these jobs for nw [09:18:07] <h01ger> failing jobs are a pita. [09:18:24] <h01ger> the src ip is 46.16.73.183 [09:22:53] <weasel> we would prefer not to open postgres to non-debian.org systems. [09:23:14] <luca> i think weasel is being polite [09:23:20] <h01ger> prefer or not doing it? [09:23:24] <luca> we don't open postgres to non-debian.org systems I think DSA position is rasonable, yet UDD data is really valuable not only for debian-related projects (jenkins in the above quote) but for people who want to do random stuff with debian's data for whatever reason. IMHO providing a means for that people to access the data is really nice (not just the db dump; not everybody is able to setup postgres). > 2) what is the rationale for the more frequent dumps. It's currently > being dumped once a day. It's never going to be "in sync" with the > live instance, unfortunately. For many uses a lag of 24 hours is acceptable. Yet ~6 hours would be far better of course. Also the data collection from udd is not in real time. I think nobody is expecting UDD to have real-time data, but when the data you provide can be 2-days old it's bad. FYI the udd-mirror importer runs hourly (actually downloding+importing the dump if it's changed). > 3) Would dumps in "custom format" (pg_dump -Fc) work for you? they allow > parallel restore with pg_restore. aye, it would be good. (Even if I think you should keep providing the plain dump, at least to have the data in standard SQL) Also, a change of the compression format would be fine. > 4) Could some tables be excluded from the dumps? It won't work for me. > 5) Couldn't you trigger the dumps from your side, by calling pg_dump > inside an SSH connection to ullmann.d.o? Given that I'm no DD I can't, yet paulproteus maybe can set it up. Yet, I'd prefer to download standard dumps from udd.d.o. -- regards, Mattia Rizzolo GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`. more about me: http://mapreri.org : :' : Launchpad user: https://launchpad.net/~mapreri `. `'` Debian QA page: https://qa.debian.org/developer.php?login=mattia `-
Attachment:
signature.asc
Description: Digital signature