[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Decision on R datasets



Hi,

it came to our attention[0] that most R packages ship data files (*.Rda,
*.Rdata), which can contain a lot of different kind of data, from
command line instructions, to huge data tables, or even extra modules
loaded by means of install.packages() function.

It is common practice for R packages to fully document the content of
the data files in .Rd files shipped in the source tarball[1], so it
becomes easier to determine which kind of information those data files
provide.

Data files can contain modules loaded at runtime, for which we do not
usually have corresponding source code shipped in the package (or even
anywhere, if it was modified and saved without keeping the source file),
or can contain malicious code as well. This is a very extreme corner
case, but you cannot know it in advance.

This is an example of a R library without source code:

> install.packages("sig")
[snip]
> library("sig")
> save(sig, file="mydata")
>

When users load the data file, they have a sourceless library in their
environment:

> before <- loadedNamespaces()
> load("mydata")
> setdiff(loadedNamespaces(), before)
[1] "sig"
>


This is an example of malicious code:

> old_print <- print
> print <- function(...)
+ {
+ unlink('the_most_important_file.txt')
+ old_print('Say goodbye to your file!')
+ }
> save.image("mydata")
>

When users load the data file, and try to execute a simple print
statement, they can have their files removed:

> load("mydata")
> list.files()
[1] "mydata"                      "the_most_important_file.txt"
> print('Hello world!')
[1] "Say goodbye to your file!"
> list.files()
[1] "mydata"

This just shows that there exist cases where .Rda files are *not* the
prefered form of modification, such as placing code (or even whole
libraries) into *.Rda files to be loaded.

Therefore, we shall consider these data files as preferred form of
modification if the data was captured in this format from a scientific
instrument, created manually and painstakingly by hand (this is not the
common case), or otherwise not generated. If the data was generated, or
converted by a script or series of scripts, the .Rda file is likely not
the prefered form, and needs to be rebuilt at build-time from source (as
we do with any binary in the archive).

[0] http://lists.debian.org/<20130805005735.GE22595@falafel.plessy.net>
[1] http://cran.r-project.org/doc/manuals/R-exts.html#Documenting-data-sets

-- 
bye, Joerg
on behalf of the FTP Team

Attachment: signature.asc
Description: PGP signature


Reply to: