[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Estadistical proyect.


sorry for my late answer, I read this list only rarely.

* angel [2010-04-21 10:58 +0200]:
> In case Debian is interested in having someone work on a statistical
> project of any nature (the project would be supervised by professors
> of the Statistics and Operations Research department of the University
> of Granada and and Debian would not be charged with any costs
> whatsoever), please feel free to contact me.

Since your request has been misunderstood partly, I subsume it at first:

You want to do a small to medium-sized statistical project and you don't
want to program in a non-statistical language during this project.
Debian needs to provide you access to the data required to complete your
project and most importantly needs to tell you what information it is
interested in.

Access to the data is easy:  Stefano already mentioned Ultimate Debian
Database (UDD) as possible data source.  It should contain everything
you would need in such a project.  If you don't want to create a local
copy of the database using the provided dump you almost certainly should
get a guest-account on alioth.debian.org to access the database without
any problems.

As you might know, Debian divides bugs by severities, the most important
ones are "critical", "serious" and "grave", these are release critical.
Additionally there are "important", "normal", "minor" and "wishlist".
Besides severities there are things like "fixed" and "found" and so on,
all this is documented on http://bugs.debian.org/ below the heading "Bug
tracking system documentation".

We want to release the next Debian release Squeeze soon and to do this
we need to get the number of release critical bugs in testing, currently
named Squeeze, to 0 (except some ignored bugs).  To be able to improve
this process and thus improve Debian in general it would be nice if we
would have some statistical data about how bugs have been fixed in the
past. Understanding where we still have problems is the first step to
resolve them. :)

Some questions to be answered could be:

 * How long does it take on average until a bug with a particular
   severity gets fixed and how is the variance?  Does this change if
   Debian is frozen (frozen describes the time before the actual release
   with a more restricted package migration to testing)?

 * Is there a correlation between the number of installations according
   to popcon and the time until a bug gets fixed?

 * Is there a correlation between the size of a package or the packages
   priority [1] and the average bug fixing time?

 * How does the number of people maintaining a package relate to bug
   fixing time and how does it relate to the number of unfixed bugs per
   package?  Rationale for this is that we encourage single maintainers
   of important packages (with priority important and priority required)
   to switch to maintaining these packages in a team but we lack data to
   support our guess that team maintenance is really more efficient.

 * How does the number of bugs reported per time unit change after a new
   upstream release is packaged in comparison to releasing a new Debian
   revision?  Does which part of the upstream version number has been
   incremented have any influence on it? I expect for example a 2.0
   release to contain more bugs on average than a 2.2 release.

 * We currently migrate packages that don't introduce new release
   critical bugs to testing after being in unstable for 10 days (unless
   the maintainer or the release team overwrite this or other packages
   prevent it).  Is this a good choice according to the time after an
   upload when most release critical bugs are reported?

 * Is there a relation between bug fixing and the time the last
   maintainer upload happend?  Similar question is if there is
   a relation between bug fixing and the number of non maintainer
   uploads (NMUs) in the last n years?

 * How does the probability of a release critical bug being fixed in
   a NMU instead of a regular maintainer upload raise over time?

 * Is there a significant difference between bug fixing in officially
   maintained packages and in packages maintained by the QA team
   respectively orphaned packages?

 * Are certain programming languages or packages in certain sections
   more prone to FTBFS (fails to build from source) bugs, security
   related bugs or bugs in general?

 * A combination of probability of an release critical bugs being filed
   over time in conjunction with the two former questions and the number
   of installations according to popcon could possibly be used to help
   people to decide if a package should be removed from Debian or
   orphaned if the former maintainer lost interest.  Currently I neither
   have an idea how this could be done in a sane way nor if it can be
   done in a sane way at all.

I think you got the idea.  If you deal with data and these question some
time you should get a feeling which one might be relevant and
interesting and which one can be ignored.  If you find something one
would not expect this should be the point where you start to dig in that
direction.  In general Debian wants "some useful and/or interesting data
about how bugs are handled", everything I wrote above should be
considered as a suggestion that can be used as guideline until you get
the intuition to decide what looks promising.  In no case I would expect
every mentioned point to be addressed by such a project, just choose
what you like and what seems to be valuable.  I also would consider
"selected questioned answered, everything is as one would expect, here's
the data, nothing to see here, move along" as a very helpful result.

Due the distributed nature of Debian it's difficult to provide a Debian
Developer as exclusive contact person.  If your university required
this, especially if she or he would need to partly evaluate your work,
this could be a problem, unless of course someone steps in and agrees to
be your mentor for this project.

If you are still interested in doing this project the obvious steps seem
to be:

 1. Install a local copy of UDD or get access to it using an account on
    alioth.debian.org.  Drop me a mail if you have problems with
    registering on [2] or if I should ask an alioth admin for approval.

 2. Make yourself familiar with the database schema [3] to be able to
    extract the required information.  If necessary you would need to
    gain basic SQL knowledge to get the data, but I guess a statistician
    knows some SQL.

 3. Talk to your professor.

 4. Do the project.  If reasonable you could provide intermediate
    results to this list, but this is not necessary.

 5. Send the results of your work to this list and to press@debian.org.
    The press team will ensure that it will be mentioned in our regular
    Debian newsletter and possibly other places.  It would be good,
    though not required, if the thesis could be placed somewhere on our
    website.  Debian tries to ensure that the information on its website
    is free, the people maintaining it know exactly if a free license is
    required or just recommended for this.  A minimal way to put it
    under a free license is writing the following text in the mail when
    you send your results (though this would not prevent others from
    using parts of your paper without attribution, but choosing another
    license with an attribution clause is easy):

    | License and Copyright of the attached document:
    | Copyright (c) 2010 Name <E-mail address>
    | Permission to use, copy, modify, and/or distribute this software
    | for any purpose with or without fee is hereby granted.


P.S.: Please CC me on answers.

 [1] http://www.debian.org/doc/debian-policy/ch-archive.html#s-priorities
 [2] https://alioth.debian.org/account/register.php
 [3] http://udd.debian.org/schema/

Reply to: