Neil Thawani - Blog - Coursera - The Data Scientist’s Toolbox

New dream? I guess I'll invest in myself.

Today I enrolled in the Coursera Data Science Specialization track. Here are some notes from the course I took today and a few accompanying thoughts.

The key challenge in data science is that you're either in a situation where you:

don't have enough information to solve your problem and need to seek it out or
are overwhelmed with a surplus of information and need to filter out the erroneous parts.

According to McKinsey, "Big Data is the next frontier for innovation, competition, and productivity."

statistics - the science of learning from data

To me, Big Data is a buzzword. It's truly the application of statistics to solving complex problems in a complex world by turning information into actionable insights. Something interesting that I found, though, is that there are difference kinds of data science depending on the field you're working in. While this seems intuitive, the labels were helpful. Biostatistics, data science, machine learning, natural language processing, signal processing, business analytics, econometrics, and statistical process control are all branches of the same tree.

They also showed this Venn diagram:

data scientist venn diagram

and defined what it is exactly that data scientists do:

define the question of interest
define the ideal data set
get and clean the data
explore the data (plots, clustering, find patterns)
perform statistical prediction/modeling
interpret, challenge, and finally synthesize the results in a reproducible way
share the results

The primary statistical analysis tool in this track will be R. The lecturer, Dr. Jeff Leek, went over some important R functions using the function rnorm as an example:

access the help file

> ?rnorm

search help files

> help.search("rnorm")

get arguments

> args("rnorm")

see the function body

> rnorm

During this time, I installed R and Rstudio before delving into how to ask questions and find answers the smart way, based on a document by two software engineers. I e-mailed this document to my work account to review in the morning.

After that, he went over the Data Science Specialization track by overviewing:

Getting Data
Exploratory Data Analysis (graphing and plotting data in R)
Reproducible Research
Statistical Inference
Regression Models
Practical Machine Learning
Building Data Products

Then it was time for the Weekly Quiz, which was pretty straightforward until the last question: What are R packages that provide machine learning functionality? After some quick Googling, I Ctrl+F'd my way to:

gbm - Various forms of gradient boosting are implemented in package gbm (tree-based functional gradient descent boosting)

pamr - The shrunken centroids classifier and utilities for gene expression analyses are implemented in package pamr

Well, that was fun. Looking forward to starting the next course tomorrow.

Coursera - The Data Scientist’s Toolbox - Week 1