[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: unzip.h and unzip.c files in source packages.

Le Tue, Dec 15, 2009 at 04:05:38PM +0000, Neil Williams a écrit :
> Just because a filename is reused doesn't mean that the packages
> put the same code into that filename.

Le Tue, Dec 15, 2009 at 06:41:51PM +0000, Stephen Gran a écrit :
> Whether or not these are actually the same files, I would have thought
> it was clear that option two is the better option in general?

Le Tue, Dec 15, 2009 at 02:07:45PM -0500, Michael Gilbert a écrit :
> The security-tracker's known embedded code copies list [0] would be
> a good resource of reference source code that should be searched in
> these lintian checks.
> Anyway, implementing this could involve some significant work, and I
> personally do not have the time for it, but it would be incredibly
> useful; especially from a security standpoint since dealing with
> embedded code is very tedious and time-consuming.

Le Tue, Dec 15, 2009 at 04:29:04PM -0600, Raphael Geissert a écrit :
> http://source.debian.net/source/search?q=%22Decryption+code+comes+from+crypt.c+by+Info-ZIP%22&defs=&refs=&path=&hist=
> It yields better results.

Thank you all for your anwers.

I think that the fact that the copies of the minizip an other commonly
duplicated sources files are missed in some debian/copyright declarations are a
clear demonstration that we do not manage to conform to our policy to
centralise and reproduce all copyright statements in a single location.
Luckily, this does not systematically result in a license violation, in
particular if we agree on the unicity of the source and binary packages (and at
least when considering the GPL it must be considered as one single entity).
But there are at least a couple of more serious cases, for instance when the
missed file is an MD5 implementation with an advertisement clause…

So while I am all in favor of a relaxation of our policy of copyright
documentation, I also agree that there would be a use for an automatic
detection of the most commonly duplicated source files. As Neil and Raphael
suggested, there are better ways than relying on the file's name. I was hoping
that for some of them that are not mutating too fast, maybe a MD5-based
approach could be useful. Inspection of some unzip.c and md5.h files makes me
more pessimistic as it uncovered additional difficulties. In some projects, a
boilerplate claiming copyright and indicating an additional license is added to
all files, apparently regardless if they have been modified or not. In other
cases, the whitespace environment of the original license statement has been

Nevertheless, if a good combination of MD5 sums and heuristics can be found for
the biggest cases, this would have applications beyond just making Lintian
checks. First, the list of packages sharing source code could be used to expand
some entries of to the security team's list of embeded code copies. Second,
pre-made copyright declarations could be made in DEP-5 format, to save time to
everybody. I will report if I make progress on this part.

Have a nice day,

Charles Plessy
Tsurumi, Kanagawa, Japan

Reply to: