Neil Thawani - Blog - Coursera - Reproducible Research - Week 1

replication - the ultimate standard for strengthening scientific evidence using findings gathered using independent investigators, data, analytical methods, laboratories, instruments, etc.

Replication is particularly important in studies that can impact broad or regulatory decisions.

What's wrong with replication?

Some studies cannot be replicated due to lack of

time
money
or the uniqueness of the study.

Reproducible research uses analytic data and code so that others may reproduce findings. Why do we need it?

New technologies are increasing data collection throughput; data are more complex and extremely high dimensional.
Existing databases can be merged into new 'megadatabases.'
Computing power is greatly increased, allowing for more sophisticated analyses.
For every field "X," there is a field "Computational X."

An example of reproducible research is air pollution and health research. We are estimating small (but important) health effects in the presence of much stronger signals. Results inform substantial policy decisions and affect many stakeholders. EPA regulations can cost billions of dollars. Complex statistical methods are needed and subjected to intense scrutiny.

What do we need for reproducible research?

analytic data are available
analytic code are available
documentation of code and data
standard means of distribution

Who are the players in reproducibility?

Authors

want to make their research reproducible
want tools to make their lives easier

Readers

want to reproduce and/or expand upon interesting findings
want tools to make their lives easier

Challenges

Authors must undertake considerable effort to put data/results on the web and may not have resources like a web server.
Readers must download data/results individually and piece together which data go with which code sections, etc.
Readers may not have the same resources as the authors.
There are few tools to help readers/authors, although the toolbox is growing.

In reality ...

Authors

just put stuff on the web
there are Journal supplementary materials
there are some central databases for various fields

Readers

just download the data and try to figure it out
piece together the software and run it

Literate (Statistical) Programming

An article is a stream of text and code
Analysis is divided into text and code "chunks"
Each code chunk loads data and computes results.
Presentation code formats results (tables, figures, etc.)
Article text explains what's going on
Literate programs can be weaved to produce human-readable documents and tangled to produce machine-readable documents

Literate programming is a general concept that requires

a documentation language (human readable)
a programming language (machine readable)

Sweave (main website)

uses LaTeX and R as the documentation and programming languages.
was developed by Friedrich Leisch (member of the R Core) and is maintained by R Core

Sweave has many limitations, though. knitr is an alternative (more recent) package.

knitr (main website)

brings together many features added on to Sweave to address limitations
uses R as the programming language (although others are allowed) and a variety of documentation languages (LaTeX, Markdown, HTML)
was developed by Yihui Xie (while a graduate student in statistics at Iowa State)

Summary

Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate.
Infrastructure is needed for creating and distributing reproducible documents, beyond what is currently available
There is a growing number of tools for creating reproducible documents

Coursera - Reproducible Research - Week 1 - Concepts and Ideas