[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#557199: [Debian-med-packaging] r-cran-epir_0.9-19-1_i386.changes REJECTED



Le Thu, Jan 07, 2010 at 09:51:05PM +0100, Joerg Jaspert a écrit :
> 
> >> more than ASCII format, what we need is the preferred form for making
> >> modifications. Binary format by itself is not a problem since there is no loss
> >> of information between both formats. I am not against including a text dump of
> >> the R object, but I would like to make clear that if this becomes a requirement
> >> for R packages to enter in Debian, then many packages from the gnu-r section
> >> are probably RC-buggy…
> 
> > I would like to know your conclusion on *Rdata files. They are example data
> > files for the documentation and the regression tests. Many r-cran-* packages
> > contain them. My personal opinion is that since they can be read, written,
> > modified, and exported with R, they are a ‘preferential form’ for modification.
> 
> > I am currently holding my work on the r-cran-* packages I co-maintain until I
> > get your answer.
> 
> How are they usually modified? The format in which that happens is what
> we need (together with the ability to do that within Debian).

Hi Joerg,

While each of them is different, I think I can say that they are usually not
modified. Their value is to stay the same for years, so that examples derived
from them are reproductible. Here are a couple of examples from the core R
package:

     The data give the speed of cars and the distances taken to stop.
     Note that the data were recorded in the 1920s.

     This data set provides information on the fate of passengers on
     the fatal maiden voyage of the ocean liner ‘Titanic’, summarized
     according to economic status (class), sex, age and survival.

     The ‘Indometh’ data frame has 66 rows and 3 columns of data on the
     pharmacokinetics of indomethicin.

     The (approximately) quarterly approval rating for the President of
     the United states from the first quarter of 1945 to the last
     quarter of 1974.

Interestingly, the datasets shipped in the core R source package are not in
binary format, but in R code format, for instance:

cars <- data.frame(
speed = c(4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13,
  13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18,
  18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25),
dist =  c(2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26,
  34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56,
  76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85))

"presidents" <-
structure(c(NA, 87, 82, 75, 63, 50, 43, 32, 35, 60, 54, 55, 36, 39, NA, 
NA, 69, 57, 57, 51, 45, 37, 46, 39, 36, 24, 32, 23, 25, 32, NA, 32, 59, 
74, 75, 60, 71, 61, 71, 57, 71, 68, 79, 73, 76, 71, 67, 75, 79, 62, 63, 
57, 60, 49, 48, 52, 57, 62, 61, 66, 71, 62, 61, 57, 72, 83, 71, 78, 79, 
71, 62, 74, 76, 64, 62, 57, 80, 73, 69, 69, 71, 64, 69, 62, 63, 46, 56, 
44, 44, 52, 38, 46, 36, 49, 35, 44, 59, 65, 65, 56, 66, 53, 61, 52, 51, 
48, 54, 49, 49, 61, NA, NA, 68, 44, 40, 27, 28, 25, 24, 24),
.Tsp = c(1945, 1974.75, 4), class = "ts")

The example above is interesting because there are missing values (NA). Dealing
with missing value is a delicate issue in statistics, and correcting the above
table to fill the missing value would make it lose its interest as an example
of a time serie with missing values. The Rdata files are examples of real data,
not scientific references meant to be corrected or extended.

My opinion is therefore that the binary format offers the same freedoms as the
R code format, or as a CSV table, an Excel table, an Openoffice table, etc.
What the author used to produce the R objects is of little relevance as it is
more a disposable intermediate than a source that should stay available for
helping people to modify. Note that there is no evidence that all Rdata files
come from R code as above. My wild guess is that many have been imported as a
CSV table at some point. To be carricatural, I would say that the Rdata format
is not less obscure as a .csv.gz format. Both need an command line to be
transformed to csv format.

I hope I have not been confusing. If you would like external opinion, I suggest
to contact to our Debian expert Dirk Eddelbuettel (edd@debian.org). His work on
and with R is reckognised internationally.

Have a nice day and thanks for the fast answer, I really appreciate it.

-- 
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan



Reply to: