Re: Load on master.

To: Guy Maor <maor@debian.org>
Cc: debian-devel@lists.debian.org
Subject: Re: Load on master.
From: Jason Gunthorpe <jgg@gpu.srv.ualberta.ca>
Date: Thu, 5 Nov 1998 23:55:21 -0700 (MST)
Message-id: <[🔎] Pine.LNX.3.96.981105233635.26835Y-200000@Wakko.ualberta.ca>
In-reply-to: <[🔎] 87btml73rs.fsf@femto.dyn.ml.org>

On 5 Nov 1998, Guy Maor wrote:

> > Perhaps could you extract the md5, control file and contents list of each
> > package once and stick it someplace then have the other scripts simply use
> > that information wherever possible?
> 
> Yes, I was planning on doing just with a DBM, relying on stat info to
> check if it's still accurate.  That will save the disks from melting.
> If the Contents sort is still too expensive, I'll use a btree.

I personally would rely on the filename. We are supposed to assure that
the files never change once the enter the archive.. Consider when the disk
went bad, we found it was returning corrupted data all over the place, if
your md5 script hand run again during that period we would have no way to
tell if a file had been damaged :<

Anyhow, I wrote a quick little program that could replace mkcontent's back
end. It uses about as much ram as the raw output from -fsys-tarfile and
is -really- fast. It might need a bit of tweaking, give it a wirl..

~/List in my home dir (will) contain the output from all the debs in hamm
main to test this. The input file looks like:

::admin/foo_1.2.deb
./
usr/bin/
usr/bin/foo
::admin/bar_1.2.deb
./
usr/bin/

The output is probably slightly different than what your script generates,
I drop duplicate dir entries (so there is no bin/) and maybe there are
others..

It takes 9 mins to generate the raw data for hamm. It totals about 4Meg.
Running the program on it:

debian{jgg}~#time ./contents < List > /dev/null
Total of 141528 nodes = 3396672
Total of 1535947 string bytes

real    0m5.300s
user    0m4.570s
sys     0m0.360s

So.. If you store all the information in a dbm we should be able to bring
the total processing down to below 10 mins!

Jason

// -*- mode: cpp; mode: fold -*-
// Description								/*{{{*/
// $Id: apt-get.cc,v 1.72 1998/06/09 05:32:27 jgg Exp $
/* ######################################################################
   
   contents - Archive contents generator
   
   This program is a back end for an archive contents generator. It takes
   the output from dpkg-deb -fsys-tarfile and merges it into a memory
   database of all previos output. This database is stored as a set
   of btrees linked across directories to form a tree of all files+dirs
   given to it. The tree will also be sorted as it is built up thus 
   removing the massive sort time overhead.
   
   By breaking all the pathnames into components and storing them 
   seperately a space savings is realized by not duplicating the string
   over and over again. Ultimately this saving is sacrificed to storage of
   the tree structure itself but the tree structure yeilds a speed gain
   in the sorting and processing. Ultimately it takes about 5 seconds to
   do 141000 nodes and about 5 meg of ram.

   The tree looks something like:
   
     usr/
      / \             / libslang
   bin/ lib/ --> libc6
        /   \         \ libfoo
   games/  sbin/
   
   The ---> is the DirDown link
   
   This program is covered under the GNU GPL Version 2 or later.
   Originally written by Jason Gunthorpe <jgg@debian.org>
   
   ##################################################################### */
									/*}}}*/
// Include Files							/*{{{*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <iostream>
#include <malloc.h>
									/*}}}*/

struct Node
{
   Node *BTreeLeft;
   Node *BTreeRight;
   Node *DirDown;
   Node *Dups;
   const char *Path;
   const char *Package;

   static unsigned int Counter;
   static unsigned int StrCounter;
   
   void *operator new(size_t Amount);
   void operator delete(void *) {};
   
   Node() : BTreeLeft(0), BTreeRight(0), DirDown(0), Dups(0), Path(0), Package(0) {Counter++;};
};
unsigned int Node::Counter = 0;
unsigned int Node::StrCounter = 0;

// Mystrdup - Custom strdup						/*{{{*/
// ---------------------------------------------------------------------
/* This strdup also uses a large block allocator to eliminate glibc
   overhead */
char *Mystrdup(const char *From)
{
   static char *Pool = 0;
   static unsigned long Left = 0;

   unsigned int Len = strlen(From) + 1;
   if (Left <= Len)
   {
      Left = 4096*10;
      Pool = (char *)malloc(Left);
   }
   
   memcpy(Pool,From,Len);
   Left -= Len;
   Node::StrCounter += Len;
   
   char *Res = Pool;
   Pool += Len;
   return Res;
}
									/*}}}*/
// Node::operator new - Big block allocator				/*{{{*/
// ---------------------------------------------------------------------
/* This eliminates glib'cs malloc overhead by allocating large blocks and
   having a continuos set of Nodes. This takes about 8 bytes off each nodes
   space needs. Freeing is not supported. */
void *Node::operator new(size_t Amount)
{
   static Node *Pool = 0;
   static unsigned long Left = 0;
   
   if (Left == 0)
   {
      Left = 10000;
      Pool = (Node *)malloc(Amount*Left);
   }
   
   Left--;
   return Pool++;
}
									/*}}}*/

// Grab - Grab a new node representing Name under Top			/*{{{*/
// ---------------------------------------------------------------------
/* This grabs a new node representing the pathname component Name under
   the node Top. The node is given the name Package. It is assumed that Name
   is inside of top. If a duplicate already entered name is found then 
   a note is made on the Dup list and the previos in-tree node is returned. */
Node *Grab(Node *Top,const char *Name,const char *Package)
{
   /* We drop down to the next dir level each call. This simplifies
      the calling routine */
   if (Top->DirDown == 0)
   {
      Node *Item = new Node;
      Item->Path = Mystrdup(Name);
      Item->Package = Package;
      Top->DirDown = Item;
      return Item;
   }
   Top = Top->DirDown;
   
   int Res;
   while (1)
   {
      Res = strcmp(Name,Top->Path);
      
      // Collision!
      if (Res == 0)
      {
	 // Look for an already existing Dup
	 for (Node *I = Top->Dups; I != 0; I = I->Dups)
	    if (I->Package == Package)
	       return Top;

	 // Add the dup in
	 Node *Item = new Node;
	 Item->Path = Top->Path;
	 Item->Package = Package;
	 Item->Dups = Top->Dups;
	 Top->Dups = Item;
	 return Top;
      }
      
      // Continue to traverse the tree
      if (Res < 0)
      {
	 if (Top->BTreeLeft == 0)
	    break;
	 Top = Top->BTreeLeft;
      }      
      else
      {
	 if (Top->BTreeRight == 0)
	    break;
	 Top = Top->BTreeRight;
      }      
   }

   // The item was not found in the tree
   Node *Item = new Node;
   Item->Path = Mystrdup(Name);
   Item->Package = Package;
   
   // Link it into the tree
   if (Res < 0)
   {
      Item->BTreeLeft = Top->BTreeLeft;
      Top->BTreeLeft = Item;
   }
   else
   {
      Item->BTreeRight = Top->BTreeRight;
      Top->BTreeRight = Item;
   }
   
   return Item;
}
									/*}}}*/
// Add - Add a path to the tree						/*{{{*/
// ---------------------------------------------------------------------
/* This takes a full pathname and adds it into the tree. We split the
   pathname into directory fragments adding each one as we go. Technically
   in output from tar this should result in hitting previous items. */
void Add(Node *Root,const char *Dir,const char *Package)
{
   // Drop leading slashes
   while (*Dir == '/' && *Dir != 0)
      Dir++;
   
   // Run over the string and grab out each bit up to and including a /
   const char *Start = Dir;
   const char *I = Dir;
   while (*I != 0)
   {
      if (*I != '/' || I - Start <= 1)
      {
	 I++;
	 continue;
      }      
      I++;
      
      // Copy the path fragment over
      char Tmp[1024];
      strncpy(Tmp,Start,I - Start);
      Tmp[I - Start] = 0;
      
      // Grab a node for it
      Root = Grab(Root,Tmp,Package);
      
      Start = I;
   }
   
   // The final component if it does not have a trailing /
   if (I - Start >= 1)
      Root = Grab(Root,Start,Package);
}
									/*}}}*/
// WriteSpace - Write a given number of white space chars		/*{{{*/
// ---------------------------------------------------------------------
/* We mod 8 it and write tabs where possible. */
void WriteSpace(signed int Number)
{
   if (Number <= 0)
      Number = 1;
   for (; (Number % 8) != 0; Number--)
      cout << ' ';
   for (; Number != 0; Number -= 8)
      cout << '\t';
}
									/*}}}*/
// Print - Display the tree						/*{{{*/
// ---------------------------------------------------------------------
/* This is the final result function. It takes the tree and recursively
   calls itself and runs over each section of the tree printing out
   the pathname and the hit packages. We use Buf to build the pathname
   summed over all the directory parents of this node. */
void Print(Node *Top, char *Buf)
{
   if (Top == 0)
      return;

   // Go left
   Print(Top->BTreeLeft,Buf);

   // Print the current dir location and then descend to lower dirs
   char *OldEnd = Buf + strlen(Buf);
   if (Top->Path != 0)
   {
      strcat(Buf,Top->Path);
      
      // Do not show the item if it is a directory with dups
      if (Top->Path[strlen(Top->Path)-1] != '/' || Top->Dups == 0)
      {
	 cout << Buf;
	 WriteSpace(60 - strlen(Buf));
	 for (Node *I = Top; I != 0; I = I->Dups)
	 {
	    if (I != Top)
	       cout << ',';
	    cout << I->Package;
	 }
	 cout << endl;
      }      
   }   
   
   // Go along the directory link
   Print(Top->DirDown,Buf);
   *OldEnd = 0;
   
   // Go right
   Print(Top->BTreeRight,Buf);  
}
									/*}}}*/

int main()
{
   Node Root;
   char *CurPackage = 0;
   
   // Iterate over each line in stdin
   while (!feof(stdin))
   {
      char Line[1024];
      if (fgets(Line,sizeof(Line),stdin) == 0)
	 break;
      
      // Strip trailing trash
      for (char *I = Line + strlen(Line); I >= Line && (*I == 0 || 
	     *I == '\n' || *I == '\r'); I--)
	 *I = 0;
      
      // If the line starts with :: then it is a package name, store it
      if (Line[0] == ':' && Line[1] == ':')
      {
	 // Isolate the section tag..
	 char *I = Line + strlen(Line);
	 for (; I > Line + 2 && *I != '/'; I--);
	 if (*I == '/')
	    I--;
	 for (; I > Line + 2 && *I != '/'; I--);
	 if (*I == '/')
	    I++;
	 
	 // Isolate the package name
	 char *I2;
	 for (I2 = I; *I2 != 0 && *I2 != '_'; I2++);
	 *I2 = 0;
	 
	 CurPackage = Mystrdup(I);
	 continue;
      }

      // Add the line to the tree
      Add(&Root,Line,CurPackage);
   }
   
   // Print the tree
   char Buffer[1024];
   Buffer[0] = 0;
   Print(&Root,Buffer);
   
   cerr << "Total of " << Node::Counter << " nodes = " << Node::Counter*sizeof(Node) << endl;
   cerr << "Total of " << Node::StrCounter << " string bytes" << endl;
   return 0;
}

Reply to:

Follow-Ups:
- Re: Load on master.
  - From: Richard Braakman <dark@xs4all.nl>

References:
- Re: Load on master.
  - From: Guy Maor <maor@debian.org>

Prev by Date: Re: Freeze Arches
Next by Date: exim the standard MTA for slink?
Previous by thread: Re: Load on master.
Next by thread: Re: Load on master.
Index(es):
- Date
- Thread