Handling large data sets in R


Background:


Recently, along with the co-author, I made a presentation on options to handle large data sets using R at NYC DataScience Academy.

You can watch the presentation here

This blog presents an overview of the presentation covering the available options to process large data sets in R efficiently.


The Problem with large data sets in R:



How big is a large data set:


We can categorize large data sets in R across two broad categories:

We will go through the solution approach for each of these situations in the following sections.


Medium sized datasets (< 2 GB)


Try to reduce the size of the file before loading it into R



Pre-allocate number of rows and pre-define column classes


Read optimization example :

  1. read in a few records of the input file , identify the classes of the input file and assign that column class to the input file while reading the entire data set

  2. calculate approximate row count of the data set based on the size of the file , number of fields in the column ( or using wc in command line ) and define nrow= parameter

  3. define comment.char parameter

bigfile.sample <- read.csv("data/SAT_Results2014.csv",  
                           stringsAsFactors=FALSE, header=T, nrows=20)  

bigfile.colclass <- sapply(bigfile.sample,class)

bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", 
                    stringsAsFactors=FALSE, header=T,nrow=10000, 
                    colClasses=attendance.colclass, comment.char=""))  


These simple changes will significantly improve the loading operation in R.


Alternately, use fread option from package data.table


Following table shows optimization steps while reading the file and relative performance improvement achieved.

url <- "./311_Service_2014.csv"
#File size (MB) : 844
#1,844,515 rows 52 columns


#Standard Read.csv ####
==========================================================================
system.time(DF1 <- read.csv(url,stringsAsFactors=FALSE))
#user  system elapsed 
#243.38    5.49  249.73


#Optimized Read.csv ####
==========================================================================
system.time(length(readLines(url)))
#Number of lines : 1844516
#user  system elapsed 
#106.56    2.47  109.63 

classes <- c("numeric",rep("character",48),rep("numeric",2), "character")

system.time(DF2 <- read.csv(url, header = TRUE, sep = ",",  stringsAsFactors = FALSE, nrow = 1844516, colClasses = classes))
#user  system elapsed 
#173.73    3.43  182.73 

#fread ####
==========================================================================
library(data.table)

system.time(DT1 <- fread(url))
#user  system elapsed 
#80.10    1.09   81.30 


#Summary ####
==========================================================================
##    user  system elapsed  Method
##   243.38   5.49   249.73  read.csv (first time)
##   173.73   3.43   182.73  Optimized read.csv
##    80.10    1.09   81.30  fread


Use pipe operators to overwrite files with intermediate results and minimize data set duplication through process steps, if is an appropriate solution to your processing requirements.


Parallel Processing


Parallelism approach runs several computations at the same time and takes advantage of multiple cores or CPUs on a single system or across systems. Following R packages are used for parallel processing in R.

Explicit Parallelism (user controlled)

example:
-rmpi(Message Processing Interface)
-snow(Simple Network of Workstations)


Implicit parallelism (system abstraction)
example:
-doMC/foreach

Given below is an example of multi-core registration using doMC


# enable parallel processing for computationally intensive operations.

library(doMC)
registerDoMC(cores = 4)


Medium sized datasets (2 - 10 GB)


For medium sized data sets which are too-big for in-memory processing but too-small-for-distributed-computing files, following R Packages come in handy.

bigmemory

bigmemory is part of the “big” family which consists of several packages that perform analysis on large data sets. bigmemory uses several matrix objects but we will only focus on big.matrix.

big.matrix is a R object that uses a pointer to a C++ data structure. The location of the pointer to the C++ matrix can be saved to the disk or RAM and shared with other users in different sessions.

By loading the pointer object, users can access the data set without reading the entire set into R.

The following sample code will give a better understanding of how to use bigmemory:

example

# User / Session 1

library(bigmemory)
library(biganalytics)
library(bigtabulate)

#Create big.matrix 

setwd("/Users/sundar/dev")

school.matrix <- read.big.matrix(
    "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", 
    type ="integer", header = TRUE, backingfile = "school.bin", 
    descriptorfile ="school.desc", extraCols =NULL) 

# Get the location of the pointer to school.matrix. 
desc <- describe(school.matrix)

str(school.matrix)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
##   ..@ address:<externalptr>
# process big matrix in active session. 

colsums.session1 <- sum(as.numeric(school.matrix[,3])) 
colsums.session1
## [1] 67147
# save the location to disk to share the object .
dput(desc , file="/tmp/A.desc")
# Session 2
setwd("/Users/sundar/dev")

library (bigmemory)
library (biganalytics)

# Read the pointer from disk .
shared.desc <- dget("/tmp/A.desc")

# Attach to the pointer in RAM.
shared.bigobject <- attach.big.matrix(shared.desc)

# Check our results .
colsums.session2 <- sum(shared.bigobject[,3]) 
colsums.session2
## [1] 67147

As one can see, bigmemory is a powerful option to read and process big files and share the object as pointer to the matrix object across sessions, which can be treated as a normal R data object.

However, there is a limitation with bigmemory, C++ matrices allow only one type of data. Therefore the data set has to be only one class of data.

That leads us to the next package to handle large data sets in R


ff

ff is another package dealing with large data sets similar to bigmemory. It uses a pointer as well but to a flat binary file stored in the disk, and it can be shared across different sessions.
One advantage ff has over bigmemory is that it supports multiple data class types in the data set unlike bigmemory.

example

library(ff)
                                 
# creating the file
school.ff <- read.csv.ffdf(file="/Users/sundar/dev/mixed_matrix_SAT__College_Board__2010_School_Level_Results.csv")

#creates a ffdf object 
class(school.ff)
## [1] "ffdf"
# ffdf is a virtual dataframe
str(school.ff)
## List of 3
##  $ virtual: 'data.frame':    5 obs. of  7 variables:
##  .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...
##  .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE
##  .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE
##  .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE
##  .. $ PhysicalElementNo: int  1 2 3 4 5
##  .. $ PhysicalFirstCol : int  1 1 1 1 1
##  .. $ PhysicalLastCol  : int  1 1 1 1 1
##  .. - attr(*, "Dim")= int  157 5
##  .. - attr(*, "Dimorder")= int  1 2
##  $ physical: List of 5
##  .. $ characters           : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd045531d5b.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr "aabc"
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Number.of.Test.Takers: list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd053ac64eb.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Critical.Reading.Mean: list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd05b15ab37.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Mathematics.Mean     : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd06b9bd698.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Writing.Mean         : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd04425cc59.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  $ row.names:  NULL
## - attributes: List of 2
##  .. $ names: chr [1:3] "virtual" "physical" "row.names"
##  .. $ class: chr "ffdf"
# ffdf object can be treated as any other R object
sum(school.ff[,3])
## [1] 66029


Very Large datasets


There are two options to process very large data sets ( > 10GB) in R.

  1. Use integrated environment packages like Rhipe to leverage Hadoop MapReduce framework.

  2. Use RHadoop directly on hadoop distributed system.

Storing large files in databases and connecting through DBI/ODBC calls from R is also an option worth considering.


Conclusion:


As you would have realized by now, R does provide many options to handle data files , whatever size they come in - small, medium or large.

Go ahead and analyse that data set in full, the one that you have been holding off till now due to system memory size limitations.