Neil Thawani - Blog - Coursera - Getting and Cleaning Data - Week 1

Reading Local Files

Loading flat files - read.table()

Some more important parameters

quote - you can tell R whether there are any quoted values; quote="" means no quotes

na.strings - set the character that represents a missing value

nrows - how many rows to read from the file

skip - number of lines to skip before starting to read

Reading Excel Files

read.xlsx, read.xlsx2 are in library(xlsx)

write.xslx() will write out an Excel file with similar arguments

The XLConnect package (link to XLConnect Vignette) has more options for writing and manipulating Excel files.

In general, it is advised that you store your data in either a database, .txt, or .csv file, since they are easier to distribute.

Reading XML

library(XML) - the XML library

    doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
    rootNode <- xmlRoot(doc)

xmlName(rootNode) - gets the name of the root node

Get the names of the nested elements under the root node: names(rootNode)

You can access parts of the XML document similar to how you access a list.

rootNode[[1]] - access the first element and its children

rootNode[[1]][[1]] - access the first child of the first element

xmlSApply(rootNode, xmlValue) - programmatically extract parts of the file; gets all the text from the document

You can access information directly using XPath.

extract specific nodes using XPath: xmlSApply(rootNode, "//nodeName", xmlValue)

Reading JSON

    library(jsonlite)
    jsonData <- fromJSON(url)
    names(jsonData)

Accessing nested objects in JSON

    > names(jsonData$objectName)
    > jsonData$objectName$subObject

Writing Data Frames to JSON

    > myJson <- toJSON(dataset, pretty=TRUE)
    > cat(myjson) # output to console

Convert back to Data Frame

    dataset2 <- fromJSON(myJson)
    head(dataset2)

Coursera - Getting and Cleaning Data - Week 1 - Reading Files