Four things you should have after going from a raw data set to a tidy data set:
- the raw data
- a tidy data set
- a code book describing each variable and its values in the tidy data set
- an explicit and exact recipe you used to go from steps 1 to steps 2 and 3
Raw Data is in the right format if you did not:
- run software on the data
- manipulate any of the numbers in the data
- remove any data from the data set
- summarize the data in any way
The following standards are available in the guide How to share data with a statistician.
Tidy data has the following properties:
- each variable you measure should be in one column
- each different observation of that variable should be in a different row
- there should be one table for each "kind" of variable
- if you have multiple tables, they should include a column in the table that allows them to be linked
- each table should be in its own file
A common format for this document is a Word/text file (or Markdown). There should be a section called "Study Design" that has a thorough description of how you collected the data. There must be a section called "Code Book" that describes each variable and its units.
The Code Book should contain information about:
- variables (including units) in the data set not contained in the tidy data
- the summary choices you made
- the experimental study design you used
The Instruction List
- is ideally a computer script in R or Python
- the input for the script is the raw data
- the output is the processed, tidy data
- there are no parameters to the script
In some cases, it will not be possible to script every step. In that case, you should provide instructions like steps:
- Take the raw file, run version 3.1.2 of the summarize software with parameters a=1, b=2, c=3
- Run the software separately for each sample.
- Take column three of outputfile.txt for each sample and that is the corresponding row in the output data set.