Some thoughts about fields in UDD
Hi,
I did some checking investigation in UDD and stumbled about some
data I do not see any sense:
** The column 'origin' **
udd=> SELECT origin, COUNT(*) FROM packages GROUP BY origin;
origin | count
--------+--------
| 817717
debian | 188
Debian | 11
(3 rows)
udd=> SELECT DISTINCT package, source, origin, section, component FROM packages WHERE origin ilike 'debian';
package | source | origin | section | component
----------------+----------------+--------+------------------+-----------------------
libdpkg-dev | dpkg | debian | libdevel | main
modutils | modutils | debian | admin | main
emifreq-applet | emifreq-applet | Debian | gnome | main
modutils-full | modutils | debian | debian-installer | main/debian-installer
libdpkg-perl | dpkg | debian | perl | main
dpkg | dpkg | debian | admin | main
dpkg-dev | dpkg | debian | utils | main
dselect | dpkg | debian | admin | main
modutils-basic | modutils | debian | debian-installer | main/debian-installer
(9 rows)
However
udd=> SELECT origin, COUNT(*) FROM ubuntu_packages GROUP BY origin;
origin | count
--------+--------
Ubuntu | 295722
(1 row)
So in ubuntu_packages the column is consistently filled - but what is
the sense of this column? Just telling that a package is part of a
certain table? Shouldn't this be simply possible in a view. For me
this is a total waste of disk space (even if I would assume the column
would be properly used in packages).
** The column 'bugs' **
udd=> SELECT bugs, COUNT(*) FROM packages GROUP BY bugs;
bugs | count
--------------------------------------+--------
| 817703
debbugs://bugs.debian.org | 155
http://sourceforge.net/projects/xfe/ | 30
mailto:formorer@debian.org | 2
mailto:rra@debian.org | 15
debbugs://bugs.debian.org/ | 11
(6 rows)
udd=> SELECT DISTINCT package, source, section, component, bugs FROM packages WHERE bugs != '';
package | source | section | component | bugs
--------------------------+--------------------------+----------+-----------+--------------------------------------
emifreq-applet | emifreq-applet | gnome | main | debbugs://bugs.debian.org/
dpkg | dpkg | admin | main | debbugs://bugs.debian.org
libdpkg-dev | dpkg | libdevel | main | debbugs://bugs.debian.org
dselect | dpkg | admin | main | debbugs://bugs.debian.org
xfe-i18n | xfe | x11 | main | http://sourceforge.net/projects/xfe/
libafs-perl | libafs-perl | perl | main | mailto:rra@debian.org
xfe | xfe | x11 | main | http://sourceforge.net/projects/xfe/
libdpkg-perl | dpkg | perl | main | debbugs://bugs.debian.org
dpkg-dev | dpkg | utils | main | debbugs://bugs.debian.org
debian-backports-keyring | debian-backports-keyring | misc | main | mailto:formorer@debian.org
xfe-themes | xfe | x11 | main | http://sourceforge.net/projects/xfe/
(11 rows)
So the packages table is featuring two columns which are obviosely not needed
because they contain redundant information (if at all, with some questionable
exceptions in the case of the bugs field). The situation with the bugs field
is a bit different in ubuntu_packages:
udd=> SELECT bugs, release, COUNT(*) FROM ubuntu_packages GROUP BY bugs, release;
bugs | release | count
--------------------------------------------+----------+-------
https://bugs.launchpad.net/ubuntu/+filebug | jaunty | 42275
https://bugs.launchpad.net/ubuntu/+filebug | maverick | 50509
https://bugs.launchpad.net/ubuntu/+filebug | karmic | 45346
https://bugs.launchpad.net/ubuntu/+filebug | lucid | 47377
mailto:ubuntu-users@lists.ubuntu.com | dapper | 29933
mailto:ubuntu-users@lists.ubuntu.com | hardy | 39104
mailto:ubuntu-users@lists.ubuntu.com | intrepid | 41178
(7 rows)
However I would call the information redundant as well because it can be easily
calculated from the release field.
Finally I'm not convinced that te column 'tag' is efficiently kept in the
packages table. We have the debtags table which contains all needed
information in a structured form. What is the sense of having a copy of
the very same information in an aggregated form in as many instances as
a package has version*architectures in the packages table. I did no
measures but intuitively that's simply a waste of space while I can
not imagine a use case where the performance gain is high enough to
justify this waste (but I might be wrong here).
Kind regards
Andreas.
--
http://fam-tille.de
Reply to: