Neil Thawani - Blog - Coursera - Getting and Cleaning Data - Week 1

Four things you should have after going from a raw data set to a tidy data set:

the raw data
a tidy data set
a code book describing each variable and its values in the tidy data set
an explicit and exact recipe you used to go from steps 1 to steps 2 and 3

Raw Data is in the right format if you did not:

run software on the data
manipulate any of the numbers in the data
remove any data from the data set
summarize the data in any way

The following standards are available in the guide How to share data with a statistician.

Tidy data has the following properties:

each variable you measure should be in one column
each different observation of that variable should be in a different row
there should be one table for each "kind" of variable
if you have multiple tables, they should include a column in the table that allows them to be linked
each table should be in its own file

A common format for this document is a Word/text file (or Markdown). There should be a section called "Study Design" that has a thorough description of how you collected the data. There must be a section called "Code Book" that describes each variable and its units.

The Code Book should contain information about:

variables (including units) in the data set not contained in the tidy data
the summary choices you made
the experimental study design you used

The Instruction List

is ideally a computer script in R or Python
the input for the script is the raw data
the output is the processed, tidy data
there are no parameters to the script

In some cases, it will not be possible to script every step. In that case, you should provide instructions like steps:

Take the raw file, run version 3.1.2 of the summarize software with parameters a=1, b=2, c=3
Run the software separately for each sample.
Take column three of outputfile.txt for each sample and that is the corresponding row in the output data set.

Why is the instruction list important?

Coursera - Getting and Cleaning Data - Week 1 - Components of Tidy Data