[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: split descriptions Re: PROPOSAL to sarge+1 - Split main in sub-repositories



hi

first of all I redid the computazion using  gzip

 cp /var/lib/apt/lists/*sarge_main_binary-i386_Packages /tmp/main_all
 cd /tmp
 egrep  '^ |^Description|^Packa' main_all >  main_descr
 egrep -v '^ |^Description' main_all >  main_data
 ls -s main_*
 gzip -v main_*
 ls -s main_*

result:  3044 main_all.gz  1220 main_data.gz  1672 main_descr.gz

as you see we are saving 1824kb here (each time we do not download
the descriptions), or 152 kb, (downloading both)

 ---

second of all, I want to stress that we may split descriptions and not
use the macro system that I suggest: I ack. that the macro system is too
 ambitious , it would break some programs.

 ---

then, some answers

Matt Zimmerman wrote:
On Sun, Sep 05, 2004 at 10:36:22AM +0200, Andrea Mennucc wrote:
Matt Zimmerman wrote:

If you skipped downloading descriptions, then you would not have
descriptions for any new packages, and the ones that you have could be wrong
(some of them do contain version-specific data).

you would have short description at least

it would also be quite easy to add a web interface:
APT would download the packages dependencies and  then download
descriptions of packages that are missing (using the above interface)

And the mirrors would run this web interface as well?  Or all descriptions
would then be served from a single server?  Either way, it doesn't scale.

you serious? did you do your math?

Suppose that I add such an interface to debian.org:
how many packages are added _daily_ to Debian/unstable ?
From Debian weekly, I would say ~3 : then the web interface would serve
the _descriptions_ of these packages (~ 500bytes) to all people tracking
Debian/unstable or Debian/testing. Suppose that this is ~5000 people:
this adds ~ 2500kB of traffic on the Debian main site _daily_

But , suppose that 2 (two!) of those developers downloads the skimmed Packages.gz file from debian.org : this would save 3600kb of traffic on
.debian.org. And it would save traffic on each and every mirror, as well.

So, we are trading ~+2500kb of traffic on the central server for a bigger saving in the central server and all mirrors as well.

Do you see it?

Then, once each week, those people would download the complete Descriptions.gz file , to have updated description of older packages as well. In this case, they will nonetheless save 159kb.

And, did you think of i18n ?
Why should people who do not speak english (as a first language)
download the Descriptions in English, and store them in HD?
By splitting descriptions , they may download them in their favourite
language.

This is what general-purpose data compression does; there is no need to
invent a macro language.

There is a huge literature explaining the shortcomings of "general-purpose data compression" when it is applied to "non general data" , that is, data that has some known structure.
For example, try
-) use bzip2 or gzip for  XML file, and compare that with
  binary formats
  http://www-106.ibm.com/developerworks/xml/library/x-tipcomp.html
-) use gzip on a dictionary (a list of ordered words) without
   preprocessing them to shorten common prefixes
-) use bzip2 or gzip for a raw image, compared to using
   lossless JPEG2000
-) idem wrt audio and flac

(if "general-purpose data compression" was so efficient, why would people invent other special-purpose systems?)

quite on the opposite: I study and teach compression, and I can tell you
that macros could provide a benefit

You say this, but you do not show it.

When I have time, I will test it all and show results to you.
Anyway, please forget the macro idea , for the sake of the
"split descriptions" idea.

 If the Packages file is suitably
ordered, gzip should come quite close to the efficiency of your macro
scheme, without the incredible increase in complexity that you propose,
which would then require implementation in the hundreds of tools which parse
the Packages file.

I do not agree.
If "the Packages file is suitably ordered" this may break
some programs as well.

But, again, forget the macro system for a while. Lets concentrate on the idea of splitting descriptions outside.


depends on the point of view....

when APT memory-maps all those files, that is a LOT of memory (for older systems): a macro language would decrease that and ease older systems (and also newer ones)


APT doesn't mmap the Packages files.


either way, it reads them from hard disk: the fatter they are, the slower APT will be


The fact is, Debian unstable only gets
new versions of packages once a day.

if you call that "only"


I do.

well, I dont. I dont see the need to waste time and bandwidth, daily

A long time ago, I set up a cron job
which runs:    ......

your solution assumes that
1) you keep your PC on 24h/day
2) you do not pay for connection


It assumes neither.  It only assumes that you are capable of scheduling a
job to run at a time when it is convenient for you.  You can buy a device
for a few USD which switches current on/off based on the time of day, and
modern PCs can switch themselves on based on a BIOS setting.


BTW the whole thread is based on the need of people who do not have modern PCs (such as me)

Whether you pay for bandwidth is also irrelevant.  You would pay for that
bandwidth regardless of whether you download attended or unattended

but bandwidth cost is relevant for the idea of splitting descriptions out of Packages

  My
unstable system seems to download on the order of 10-20M of debs per day for
an upgrade, and about 2M of Packages files, so even if you could reduce that
by 50%, that is only a 5-10% savings on the total download.


that is money nonetheless. How many times are you offered an easy and simple way to save 5-10% of your money?
(if you are offered, please tell me ! :-)

As is so unfortunately common when someone offers a "solution" to a problem
like this in Debian, the "facts" are unverified,

I did a test of the savings, with bzip2 that I showed last time, and
with gzip, that I show on top here.

the claims misleading, and
the solutions are worse than the original problem.

I showed you numbers, not "misleading claims".

It seems that you had a very bad night of sleep lately, Mr Zimmerman.


If you would prefer not to download package descriptions, I suggest that you
provide Packages files with descriptions filtered as a service for yourself
and anyone who is interested.  Perhaps you can convince your local mirror to
host a symlink farm which lets you implement this easily.  I do not think
that this approach is appropriate for the official Debian archive.

(if I find the time) that is exactly what I will do

(*) and, BTW, if I am lazy, I do not even need to create a special web interface for download of packages descriptions: there is already one, it is called packages.debian.org


 ----------------------

BTW , the "gettext" implementation of i18n in Debian does not scale at all: what if Debian and Linux really becomes popular, and most of the packages are translated in ~50 languages?

a.



Reply to: