Neil Thawani - Blog - Coursera - Reproducible Research - Week 1

Steps in a data analysis

Define the question
Define the ideal data set
Determine what data you can access
Obtain the data
Clean the data
Exploratory data analysis
Statistical prediction/modeling
Interpret results
Challenge results
Synthesize/write up results
Create reproducible code

You will have either a surplus or insufficient information in order to solve your problems. Defining a question as narrowly as possible will help to reduce the noise in solving your problem.

e.g.,
Start with a general question: Can I automatically detect e-mails that are spam or not? Make it concrete: Can I use quantitative characteristics of the emails to classify them as spam?

Defining the ideal data set

descriptive - a whole population
exploratory - a random sample with many variables measured
inferential - the right population, randomly sampled
predictive - a training and test data set from the same population
causal - data from a randomized study
mechanistic - data about all components of the system

Determine what data you can access

Sometimes you can find free data on the web
Other times you may need to buy the data
Be sure to respect Terms of Use
If the data doesn't exist, you may need to generate it yourself

Obtain the data

Try to obtain the raw data
Be sure to reference the source
Polite emails go a long way
If you load the data from an internet source, record the URL and time accessed

Clean the data

Raw data often needs to be processed
If it is pre-processed, make sure you understand how
Understand the source of the data (census sample, convenience sample, etc.)
May need reformatting, subsampling - record those steps
Determine if the data are good enough - if not, quit or change data

Subsampling our data set

We need to generate a test set and a training set (prediction)

Exploratory data analysis

Look at summaries of the data
Check for missing data
Create exploratory plots
Perform exploratory analyses (e.g., clustering)

Statistical prediction/modeling

should be informed by the results of your exploratory analysis
exact methods depend on the question of interest
transformations/processing should be accounted for when necessary
measures of uncertainty should be reported

Interpret results

Use the appropriate language (describes, correlates with/associated with, leads to/causes, predicts)
Give an explanation
Interpret coefficients
Interpret measures of uncertainty

Challenge results

Challenge all steps: question, data source, processing, analysis, conclusions
Challenge measures of uncertainty
Challenge choices of terms to include in models
Think of potential alternative analyses

Synthesize/write-up results

Lead with the question
Summarize the analyses into the story
Don't include every analysis; include it if it is needed for the story or to address a challenge
Order analyses according to the story, rather than chronologically
Include "pretty" figures that contribute to the story

Lastly, create reproducible code using Markdown, knitr, Rstudio. It will make the evidence for your conclusions much more powerful.

Coursera - Reproducible Research - Week 1 - Structure of a Data Analysis