[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFC: Clonewise - Detecting code reuse and embedded code copies



This is interesting. is this related to http://www.fossology.org/projects/fossology fosology in any way? 
mike

On Tue, Apr 17, 2012 at 6:35 AM, Silvio Cesare <silvio.cesare@gmail.com> wrote:
The Debian Package clonewise-core (currently in the mentors archive)
http://mentors.debian.net/package/clonewise-core
http://www.foocodechu.com/downloads/clonewise
--

Clonewise is a tool for detecting code reuse in Debian packages. This is also
known as detecting embedded code copies. Debian maintains a database of
packages that embed code in the security tracker. Clonewise is a tool to
automate and supplement the manual tracking of packages.

The primary use of it is for the security team who may identify a vulnerability
in a library and want to know if that library is reused and embedded in any
other Debian packages.

-- QUICK GUIDE

You might want to install the Clonewise database instead of generating it
(which can take several days when you first run Clonewise).

Download it from http://www.foocodechu.com/downloads/clonewise/

Example usage to discover if the source package libpng is reused in other
Debian packages is as follows:

$ Clonewise -vv libpng
libpng CLONED_IN_SOURCE afterstep (18.457640)
               MATCH png.c (5.605583) (33.000000)
               MATCH pngtrans.c (6.409078) (57.000000)
               MATCH pngwtran.c (6.442979) (80.000000)
       libpng CLONED_IN_PACKAGE libafterimage-dev
       libpng CLONED_IN_PACKAGE afterstep
       libpng CLONED_IN_PACKAGE afterstep-data
       libpng CLONED_IN_PACKAGE libafterimage0
       libpng CLONED_IN_PACKAGE afterstep-dbg
       libpng CLONED_IN_PACKAGE libafterstep1
libpng CLONED_IN_SOURCE fltk1.1 (44.336105)
               MATCH png.c (5.605583) (58.000000)
               MATCH pngerror.c (6.442979) (57.000000)
               MATCH pngmem.c (6.442979) (85.000000)
               MATCH pngpread.c (6.514438) (52.000000)
               MATCH pngrio.c (6.478071) (77.000000)
               MATCH pngtrans.c (6.409078) (63.000000)
               MATCH pngwtran.c (6.442979) (80.000000)
       libpng CLONED_IN_PACKAGE fltk1.1-doc
       libpng CLONED_IN_PACKAGE fltk1.1-games
       libpng CLONED_IN_PACKAGE libfltk1.1
       libpng CLONED_IN_PACKAGE libfltk1.1-dbg
       libpng CLONED_IN_PACKAGE libfltk1.1-dev
[ snip ]

So libpng is embedded in the source packages afterstep and fltk1.1.
Looking at my version of the embedded-code-copies file on the security
tracker, I can see that fltk1.1 is actually referenced as libfltk1.1 and has
been fixed a while ago. The security tracker is meant to report the source
package name, so this should probably be fixed. Clonewise otherwise
ignores embedded code copies that have been fixed (according to the
security tracker). I can't see afterstep in the tracker, so again, we might
need to make an update. We don't know if afterstep has been patched
to use a system library so we need to investigate more - like seeing
if libpng is a dependency of the afterstep package. In real usage, if libpng
is buggy, it's probably important to do this and check the afterstep package
to see if is vulnerable to a libpng bug.

The matching files have a weight and a score that represents the significance
of the file in the repository and and the similarity of the file between the
two packages.

CLONED_IN_SOURCE are the source packages.
CLONED_PACKAGE are the binary packages built from the source package.

-- BUILDING THE DATABASE

If you don't install clonewise-database, then the database of the package
repository will probably need to be built the first time you run Clonewise.
You will need to be the superuser to do this and in all likelihood it will
take several days to complete.

Clonewise will run Clonewise-BuildDatabase when the database has not been
built. It will download the entire Debian source repository, unpack the
packages and generate signatures for each package.

-- CONFIGURATION FILES

There are a number of configuration files in Clonewise.

/var/lib/Clonewise/extensions - contains a list of filename extensions that
are used to identify source code. Clonewise ignores all reuse of non program
code in package contents and this is how it knows this.

/var/lib/Clonewise/threshold - is the default threshold of the amount of code
reuse that needs to occur before Clonewise reports it. If you get too many
false positives, then increase this number. You can also override this
threshold on the command line with Clonewise -C <threshold>.

/var/lib/Clonewise/ignore-these-fixed - is a list of package pairs from
the embedded-code-copies file maintained in the Debian security tracker where
it has been reported that the packages in question have been modified so
system wide libraries are being used and there is no embedded code in the
build.

/var/lib/Clonewise/ignore-these-false-positives - is a list of package pairs
that should not be reported as having code reuse. This file is intended to
contain known false positives.

-- HELPER UTILITIES

Clonewise-ParseDatabase is a program to parse Debian's embedded-code-copies
file maintained in the security tracker. Probably the main use of it is to
generate the content for the ignore-these-fixed configuration file.

To list the package pairs of embedded code that are reported to have been
"fixed", run this command:

$ Clonewise-ParseDatabase -f <embedded-code-copies-file>

The output of that command can go directly into the ignore-these-fixed
configuration file. For example:

# Clonewise-ParseDatabase -f <embedded-code-copies> >
/var/lib/Clonewise/ignore-these-fixed

You might want to run that command whenever the upstream version of the
embedded-code-copies file is changed to reflect that a package has been fixed
to avoid an embedded code copy.

The -u option is for identifying unfixed embedded code copies. The command
run without any options prints all embedded code copies in the Clonewise
native format.

Another utility which is probably only useful for developers is:

$ Clonewise-RunTests

This is useful for comparing Clonewise's results against Debian's manually
created embedded-code-copies file maintained in the security tracker.

-- COMMAND LINE OPTIONS

The command line options for Clonewise are:

-e              Report all internal errors.

-o xml          Output in XML.

-C <threshold>  Override threshold configuration on how much code reuse needs
               to occur before reporting.

-v              Verbose - show more information.

-vv             Really verbose - show why packages are reported as reusing
               code. This is the option most people want.S

-vvv            Show scores for all packages. Not really useful for non
               developers.

-a              Run analysis over entire database and show all embedded code
               copies. When using this option, no package name argument is
               required on the command line.

-s              Don't use ssdeep to do a fuzzy check of similar content. This
               will increase the false positive rate, but can also increase
               the true positive rate. Probably not useful for non developers.

-t              Don't use filename extensions when compring packages. This is
               useful if you are looking for reuse of a package's contents
               that is not based on program code.

-- EXTENDED DESCRIPTION OF THE NUMBERS IN THE OUTPUT

What are the numbers in the output of Clonewise? They represent weights and
scores.

$ Clonewise -vv libpng
libpng CLONED_IN_SOURCE afterstep (18.457640)
               MATCH png.c (5.605583) (33.000000)
               MATCH pngtrans.c (6.409078) (57.000000)
               MATCH pngwtran.c (6.442979) (80.000000)
[ snip ]

png.c has a weight of 5.605583. The more frequent png.c occurs accross packages
in the Debian source repository, the lower the weight. For example, if
extensions were not used and README was matched, then the weight would be
very low because the filename README occurs in almost every package.

png.c has a similarity of 33.000000. This means that ssdeep identified a
similarity of 33% between png.c in the afterstep and libpng package. Because it
is greater than 0, it probably means that they derive from the same source in
some earlier version of libpng.

The score of 18.45760 is an accumulation of the weights in the matching files.
This score is what the Clonewise threshold is compared against. If this score
is greater than the threshold, Clonewise reports code reuse to have occured.
The higher this number, the much more believable it is that code reuse has
occured.

-- HOW DOES IT WORK?

It's a simple idea really. If two packages' source trees share the same
filenames, and the content looks similar according to a fuzzy hash, then they
share code.

Each filename has a weight based on the inverse document frequency. This
is a fancy way of saying if the same filename is common to lots of packages
then it has a lower weight.

Each matching file is counted and the weights all add up. If the sum weight
exceeds a threshold, Clonewise will report it.

 -- Silvio Cesare <silvio.cesare@gmail.com>


--
To UNSUBSCRIBE, email to debian-mentors-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CA+ygN1JA3DpDNJFYzy_BzJe2iurvHUhmy9rxShy3kFbE3pohQ@mail.gmail.com




--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

Reply to: