Reading Tabular Data
read.table, read.cv - used for reading tabular data
readLines - used for reading lines of a text file
source - used for reading in R code files (inverse of dump)
dget - used for reading in R code files (inverse of dput)
load - used for reading in saved workspaces
unserialize - used for reading single R objects in binary form
Writing Data
- write.table
- writeLines
- dump
- dput
- save
- serialize
Reading Data Files with read.table
Arguments:
For small to moderately sized datasets, you can call read.table without any other arguments.
> data <- read.table("foo.txt")
R will automatically
- skip lines that begin with #
- figure out how many rows there are and how much memory needs to be allocated
- figure out what type of variable is in each column - explicitly stating this makes R run faster
read.csv is identical to read.table, except the default separator is a comma
Reading Large Tables
The help page for read.table contains many hints. Memorizing it is advised.
Make a rough calculation of the memory required to store your dataset. If N > the amount of RAM on your computer, it won't be possible.
Set comment.char = " " if there are no commented lines in your file.
Reading in Larger Datasets with read.table
Specifying the colClasses argument can make read.table run significantly faster. To figure out the classes of each column:
> initial <- read.table("datatable.txt", nrows = 100) # or 1000 > classes <- sapply(initial, class) > tabAll <- read.table("datatable.txt", colClasses = classes)
Set nrows. Use the Unix tool wc to calculate the number of lines in a file.
When using R with larger data sets, it helps to know:
- how much memory is available?
- what other applications are in use?
- are there other users logged on to the same system?
- what OS?
- is the OS 32- or 64-bit?
e.g., Calculating Memory Requirements
A data frame has 1,500,000 rows and 120 columns, all of which are numeric data. Roughly how much memory is required to store this data?
1,500,000 rows * 120 columns * 8 bytes/numeric = 1,440,000,000 bytes / 1,024 bytes/kb / 1,024 kb/mb / 1,024 mb/gb = 1.34 gigabytes = N
Note: There are 2^20 bytes/mb, since 2^10 = 1,024.
A rule of thumb: You'll need 2N RAM to read in a dataset that requires N memory.
Textual Data Formats: dput() and dump()
dumping and dputing are useful because the resulting textual format is editable (and recoverable in case of corruption)
dump and dput preserve the metadata (unlike write.table or writeLines) so that the user doesn't have to specify it again
textual data formats can work better with version control programs (like subversion or git), which can only track meaningful changes in text files
dput takes an arbitrary R object and will create some R code that will reconstruct the object in R.
y has two columns: a and b. dput constructs some R code with a list with two elements, the row names and the class. This metadata (row names, class) is not particularly useful, but can be output to a file to reconstruct it later.
> y <- data.frame(a = 1, b = "a") > dput(y) structure(list(a = 1, \t\tb = structure(1L, .Label = "a", \t\t\t\tclass = "factor")), \t.Names = c("a", "b"), row.names = c(NA, -1L), \tclass = "data.frame") > dput(y, file = "y.R") > new.y <- dget("y.R") > new.y \ta\tb 1\t1\ta
dget can only be used on a single R object, whereas dump can be used on multiple R objects.
> x <- "foo" > y <- data.frame(a = 1, b = "a") > dump(c("x", "y"), file = "data.R") > rm(x, y) > source("data.R") > y \ta\tb 1\t1\ta > a [1] "foo"
R Connections - Interfaces to the Outside World
Data are read in using connection interfaces. Connections can be made to files or to other, more exotic things.
file - opens a connection to a file
gzfile - opens a connection to a file compressed with gzip
bzfile - opens a connection to a file compressed with bzip2
url - opens a connection to a webpage
File connections
> str(file) function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"))
description is the name of the file
open is a code indicating:
- r - read only
- w - writing (and initializing a new file)
- a - appending
- rb, wb, ab - reading, writing or appending in binary mode (Windows)
In general, we often don't need to deal with the connection interface directly.
> con <- file("foo.txt", "r") > data <- read.csv(con) > close(con)
is the same as
> data <- read.csv("foo.txt")
Reading Lines of a Text File
> con <- gzfile("words.gz") > x <- readLines(con, 10) > x [1]\t"a"\t"d"\t"g" [5]\t"b"\t"e"\t"h" [9]\t"c"\t"f"
writeLines takes a character vector and writes each element one line at a time to a text file.
readLines can be useful for reading in lines of webpages.
> con <- url("http://www.jhsph.edu", "r") > x <- readLines(con) > head(x) # prints out HTML of the webpage, line by line