[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1011343: WISHLIST: Offical ALL-IN-ONE images?



Hi,

Zhang Boyang wrote:
> These experiments all succeeded. Thank you very much! Good Job! :)

Thank you for testing and challenging.


I wrote:
> > For now i decided to take the 50 seconds with dash.

> if you really want to reduce runtime I would suggest using
> `sort -s -u -k 2 merged_md5sum.txt' instead of processing each line
> by hand.

The task is to identify those which need newly computed MD5 because they
might have changed. Mostly i know which directories are suspects, because
they are on hard disk and get mapped back into the emerging ISO. Their
MD5s get recomputed from the files on hard disk.
Some other paths in the md5sum.txt may appear multiple times. In this case
it is clear that the data of the file in the emerging ISO stem from iso1.
But it is not clear which of the multiple lines in md5sum.txt stems from
iso1. So the MD5 has to be recomputed from the file in mounted iso1.


> I saw there are some other logic to process md5 records from
> different group of files, so we can use `grep' and `grep -v' to split them,
> process them separately, then merge them finally.

That's a great idea.
The majority of files is in ./pool and surely needs no recomputing, even
if listed multiple times (due to overlapping ISO pools).

This here

  ( fgrep ' ./pool/' <merged_md5sum.txt | uniq
    fgrep -v ' ./pool/' <merged_md5sum.txt | polish_md5sum_txt ) \
  | sort -k 2 >temp_file

needs 1.9 seconds instead of 7.2 seconds with the old

  polish_md5sum_txt >temp_file

Times were measured by date '+%s.%N' around the polishing commands.
polish_md5sum_txt and its subordinate were slightly modified for the new
method to read from stdin and to not expect any ./pool file.
The latter brought 0.9 seconds.

The number of lines in md5sum.txt is then the same as with the old method.
My test loop with md5sum -c on the mounted result ISO reports no
mismatches. (It is annoying that gzip inserts a time stamp, so that the
Packages.gz files differ although they bear the same uncompressed
content. So the md5sum.txt file shows differences, too, from run to run.)


> Unfortunately the option `-s' of `sort' is not standard

I understand that it is needed to keep sort -k 2 from distinguishing
lines with differences outside of -k 2 so that sort -u could throw out
surplus lines with duplicate paths.

But with above code sort -u is not needed.
The pool lines have to be identical even if duplicate paths appear at
all. (I only know of one old debian package which existed with different
content but same name, long ago.) So uniq can do its job.
The other lines are made unique by the shell function polish_md5sum_txt.

Complexity-wise this replaces a slow O(n) algorithm by a faster O(n) and
an additional O(n * log(n)) run. At some size of Debian the slow speed
of the linear loop will be compensated by the sorting complexity.
But there is still room: A sort of 11,000 lines lasts about 0.03 seconds.

I will probably commit this change tomorrow. Now it needs cleaning and
handling of the new dependency uniq.


Have a nice day :)

Thomas


Reply to: