[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFH: Debian derivatives census



On Thu, 2020-09-03 at 14:12 -0400, Jeremiah C. Foster wrote: 

> I would like to add that I've recently learned that the Derivatives
> Census can help determine programmatically the delta between Debian and
> a Derivative (if things are correctly configured.) For a distribution
> such as ours which aims for binary compatibility and wants to stay as
> close to Debian as possible, this is extremely valuable. 

I think you are referring to the patch generation?

https://wiki.debian.org/Derivatives/Integration#Patches

The size of the metadata about the patches is what is causing the
memory issues.

The patch generation itself currently can only be run on the Debian
servers at LeaseWeb because it relies on access to the snapshot.d.o
database and hash based filesystem. There is a TODO item about porting
it to the snapshot.d.o API instead so that derivatives who have private
apt repositories can also run it locally.

> I feel that is our responsibility to contribute back to Debian (which
> we try to do) everything we can and I think that contributing time and
> effort is the least we can do.

Excellent, please take a look at the census codebase and the wiki pages
I have linked to and run the codebase locally to see how it works.

> The Debian package tracker will be of particular interest to me because
> of the ability to understand the delta from Debian to a derivative. I'm
> more than happy to contribute in any way I can and will review those
> URLs to find some low-hanging fruit to get me started.

The main work needed on the package tracker is to replace the Ubuntu
panel with a patches panel that links to available patches in various
places including from the derivatives census.

https://bugs.debian.org/779400

> Is there are preferred channel for communication?
> Is the mailing list preferred over IRC?

This thread and the debian-derivatives mailing list and IRC channel are
good places to discuss the census and I'll respond in either of them.

> Regarding RAM and CPUs, I have a VM running Bullseye at Linode which we
> can use for Gitlab runners or the like. Perhaps this will be of use?

The RAM issue is mainly caused by part of the service not being written
in a scalable way, since it just loads giant YAML files into memory.
Throwing more RAM at the problem or making the memory storage more
efficient would be the wrong approach, since eventually the patch
metadata in YAML files will exceed the available RAM. A database would
be a better way to do it. So we need changes to the codebase to store
the data in a database instead plus a script to stream the YAML into
the database without loading it all into RAM. A couple of links I
gathered on the problem.

https://habr.com/en/post/458518/
https://news.ycombinator.com/item?id=20401055
https://stackoverflow.com/questions/429162/how-to-process-a-yaml-stream-in-python

-- 
bye,
pabs

https://wiki.debian.org/PaulWise

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: