replication - the ultimate standard for strengthening scientific evidence using findings gathered using independent investigators, data, analytical methods, laboratories, instruments, etc.
Replication is particularly important in studies that can impact broad or regulatory decisions.
What's wrong with replication?
Some studies cannot be replicated due to lack of
- time
- money
- or the uniqueness of the study.
Reproducible research uses analytic data and code so that others may reproduce findings. Why do we need it?
- New technologies are increasing data collection throughput; data are more complex and extremely high dimensional.
- Existing databases can be merged into new 'megadatabases.'
- Computing power is greatly increased, allowing for more sophisticated analyses.
- For every field "X," there is a field "Computational X."
An example of reproducible research is air pollution and health research. We are estimating small (but important) health effects in the presence of much stronger signals. Results inform substantial policy decisions and affect many stakeholders. EPA regulations can cost billions of dollars. Complex statistical methods are needed and subjected to intense scrutiny.
What do we need for reproducible research?
- analytic data are available
- analytic code are available
- documentation of code and data
- standard means of distribution
Who are the players in reproducibility?
Authors
- want to make their research reproducible
- want tools to make their lives easier
Readers
- want to reproduce and/or expand upon interesting findings
- want tools to make their lives easier
Challenges
- Authors must undertake considerable effort to put data/results on the web and may not have resources like a web server.
- Readers must download data/results individually and piece together which data go with which code sections, etc.
- Readers may not have the same resources as the authors.
- There are few tools to help readers/authors, although the toolbox is growing.
In reality ...
Authors
- just put stuff on the web
- there are Journal supplementary materials
- there are some central databases for various fields
Readers
- just download the data and try to figure it out
- piece together the software and run it
Literate (Statistical) Programming
- An article is a stream of text and code
- Analysis is divided into text and code "chunks"
- Each code chunk loads data and computes results.
- Presentation code formats results (tables, figures, etc.)
- Article text explains what's going on
- Literate programs can be weaved to produce human-readable documents and tangled to produce machine-readable documents
Literate programming is a general concept that requires
- a documentation language (human readable)
- a programming language (machine readable)
Sweave (main website)
- uses LaTeX and R as the documentation and programming languages.
- was developed by Friedrich Leisch (member of the R Core) and is maintained by R Core
Sweave has many limitations, though. knitr is an alternative (more recent) package.
knitr (main website)
- brings together many features added on to Sweave to address limitations
- uses R as the programming language (although others are allowed) and a variety of documentation languages (LaTeX, Markdown, HTML)
- was developed by Yihui Xie (while a graduate student in statistics at Iowa State)
Summary
- Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate.
- Infrastructure is needed for creating and distributing reproducible documents, beyond what is currently available
- There is a growing number of tools for creating reproducible documents