Steps in a data analysis
- Define the question
- Define the ideal data set
- Determine what data you can access
- Obtain the data
- Clean the data
- Exploratory data analysis
- Statistical prediction/modeling
- Interpret results
- Challenge results
- Synthesize/write up results
- Create reproducible code
You will have either a surplus or insufficient information in order to solve your problems. Defining a question as narrowly as possible will help to reduce the noise in solving your problem.
e.g.,
Start with a general question: Can I automatically detect e-mails that are spam or not?
Make it concrete: Can I use quantitative characteristics of the emails to classify them as spam?
Defining the ideal data set
- descriptive - a whole population
- exploratory - a random sample with many variables measured
- inferential - the right population, randomly sampled
- predictive - a training and test data set from the same population
- causal - data from a randomized study
- mechanistic - data about all components of the system
Determine what data you can access
- Sometimes you can find free data on the web
- Other times you may need to buy the data
- Be sure to respect Terms of Use
- If the data doesn't exist, you may need to generate it yourself
Obtain the data
- Try to obtain the raw data
- Be sure to reference the source
- Polite emails go a long way
- If you load the data from an internet source, record the URL and time accessed
Clean the data
- Raw data often needs to be processed
- If it is pre-processed, make sure you understand how
- Understand the source of the data (census sample, convenience sample, etc.)
- May need reformatting, subsampling - record those steps
- Determine if the data are good enough - if not, quit or change data
Subsampling our data set
- We need to generate a test set and a training set (prediction)
Exploratory data analysis
- Look at summaries of the data
- Check for missing data
- Create exploratory plots
- Perform exploratory analyses (e.g., clustering)
Statistical prediction/modeling
- should be informed by the results of your exploratory analysis
- exact methods depend on the question of interest
- transformations/processing should be accounted for when necessary
- measures of uncertainty should be reported
Interpret results
- Use the appropriate language (describes, correlates with/associated with, leads to/causes, predicts)
- Give an explanation
- Interpret coefficients
- Interpret measures of uncertainty
Challenge results
- Challenge all steps: question, data source, processing, analysis, conclusions
- Challenge measures of uncertainty
- Challenge choices of terms to include in models
- Think of potential alternative analyses
Synthesize/write-up results
- Lead with the question
- Summarize the analyses into the story
- Don't include every analysis; include it if it is needed for the story or to address a challenge
- Order analyses according to the story, rather than chronologically
- Include "pretty" figures that contribute to the story
Lastly, create reproducible code using Markdown, knitr, Rstudio. It will make the evidence for your conclusions much more powerful.