R cookbook

cookbook-r Must Watch!



MustWatch



Installing and using packages

Problem

You want to do install and use a package.

Solution

If you are using a GUI for R, there is likely a menu-driven way of installing packages. This is how to do it from the command line: install.packages('reshape2') In each new R session where you use the package, you will have to load it: library(reshape2) If you use the package in a script, put this line in the script. To update all your installed packages to the latest versions available: update.packages() If you are using R on Linux, some of the R packages may be installed at the system level by the root user, and can’t be updated this way, since you won’t haver permission to overwrite them.

Indexing into a data structure

Problem

You want to get part of a data structure.

Solution

Elements from a vector, matrix, or data frame can be extracted using numeric indexing, or by using a boolean vector of the appropriate length. In many of the examples, below, there are multiple ways of doing the same thing.

Indexing with numbers and names

With a vector: # A sample vector v <- c(1,4,4,3,2,2,3) v[c(2,3,4)] #> [1] 4 4 3 v[2:4] #> [1] 4 4 3 v[c(2,4,3)] #> [1] 4 3 4 With a data frame: # Create a sample data frame data <- read.table(header=T, text=' subject sex size 1 M 7 2 F 6 3 F 9 4 M 11 ') # Get the element at row 1, column 3 data[1,3] #> [1] 7 data[1,"size"] #> [1] 7 # Get rows 1 and 2, and all columns data[1:2, ] #> subject sex size #> 1 1 M 7 #> 2 2 F 6 data[c(1,2), ] #> subject sex size #> 1 1 M 7 #> 2 2 F 6 # Get rows 1 and 2, and only column 2 data[1:2, 2] #> [1] M F #> Levels: F M data[c(1,2), 2] #> [1] M F #> Levels: F M # Get rows 1 and 2, and only the columns named "sex" and "size" data[1:2, c("sex","size")] #> sex size #> 1 M 7 #> 2 F 6 data[c(1,2), c(2,3)] #> sex size #> 1 M 7 #> 2 F 6

Indexing with a boolean vector

With the vector v from above: v > 2 #> [1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE v[v>2] #> [1] 4 4 3 3 v[ c(F,T,T,T,F,F,T)] #> [1] 4 4 3 3 With the data frame from above: # A boolean vector data$subject < 3 #> [1] TRUE TRUE FALSE FALSE data[data$subject < 3, ] #> subject sex size #> 1 1 M 7 #> 2 2 F 6 data[c(TRUE,TRUE,FALSE,FALSE), ] #> subject sex size #> 1 1 M 7 #> 2 2 F 6 # It is also possible to get the numeric indices of the TRUEs which(data$subject < 3) #> [1] 1 2

Negative indexing

Unlike in some other programming languages, when you use negative numbers for indexing in R, it doesn’t mean to index backward from the end. Instead, it means to drop the element at that index, counting the usual way, from the beginning. # Here's the vector again. v #> [1] 1 4 4 3 2 2 3 # Drop the first element v[-1] #> [1] 4 4 3 2 2 3 # Drop first three v[-1:-3] #> [1] 3 2 2 3 # Drop just the last element v[-length(v)] #> [1] 1 4 4 3 2 2

Getting a subset of a data structure

Problem

You want to do get a subset of the elements of a vector, matrix, or data frame.

Solution

To get a subset based on some conditional criterion, the subset() function or indexing using square brackets can be used. In the examples here, both ways are shown. # A sample vector v <- c(1,4,4,3,2,2,3) subset(v, v<3) #> [1] 1 2 2 v[v<3] #> [1] 1 2 2 # Another vector t <- c("small", "small", "large", "medium") # Remove "small" entries subset(t, t!="small") #> [1] "large" "medium" t[t!="small"] #> [1] "large" "medium" One important difference between the two methods is that you can assign values to elements with square bracket indexing, but you cannot with subset(). v[v<3] <- 9 subset(v, v<3) <- 9 #> Error in subset(v, v < 3) <- 9: could not find function "subset<-" With data frames: # A sample data frame data <- read.table(header=T, text=' subject sex size 1 M 7 2 F 6 3 F 9 4 M 11 ') subset(data, subject < 3) #> subject sex size #> 1 1 M 7 #> 2 2 F 6 data[data$subject < 3, ] #> subject sex size #> 1 1 M 7 #> 2 2 F 6 # Subset of particular rows and columns subset(data, subject < 3, select = -subject) #> sex size #> 1 M 7 #> 2 F 6 subset(data, subject < 3, select = c(sex,size)) #> sex size #> 1 M 7 #> 2 F 6 subset(data, subject < 3, select = sex:size) #> sex size #> 1 M 7 #> 2 F 6 data[data$subject < 3, c("sex","size")] #> sex size #> 1 M 7 #> 2 F 6 # Logical AND of two conditions subset(data, subject < 3 & sex=="M") #> subject sex size #> 1 1 M 7 data[data$subject < 3 & data$sex=="M", ] #> subject sex size #> 1 1 M 7 # Logical OR of two conditions subset(data, subject < 3 | sex=="M") #> subject sex size #> 1 1 M 7 #> 2 2 F 6 #> 4 4 M 11 data[data$subject < 3 | data$sex=="M", ] #> subject sex size #> 1 1 M 7 #> 2 2 F 6 #> 4 4 M 11 # Condition based on transformed data subset(data, log2(size) > 3 ) #> subject sex size #> 3 3 F 9 #> 4 4 M 11 data[log2(data$size) > 3, ] #> subject sex size #> 3 3 F 9 #> 4 4 M 11 # Subset if elements are in another vector subset(data, subject %in% c(1,3)) #> subject sex size #> 1 1 M 7 #> 3 3 F 9 data[data$subject %in% c(1,3), ] #> subject sex size #> 1 1 M 7 #> 3 3 F 9

Making a vector filled with values

Problem

You want to create a vector with values already filled in.

Solution

rep(1, 50) # [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 # [39] 1 1 1 1 1 1 1 1 1 1 1 1 rep(F, 20) # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE # [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE rep(1:5, 4) # 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 rep(1:5, each=4) # 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 # Use it on a factor rep(factor(LETTERS[1:3]), 5) # A B C A B C A B C A B C A B C # Levels: A B C

Information about variables

Problem

You want to find information about variables.

Solution

Here are some sample variables to work with in the examples below: x <- 6 n <- 1:4 let <- LETTERS[1:4] df <- data.frame(n, let)

Information about existence

# List currently defined variables ls() #> [1] "df" "filename" "let" "n" "old_dir" "x" # Check if a variable named "x" exists exists("x") #> [1] TRUE # Check if "y" exists exists("y") #> [1] FALSE # Delete variable x rm(x) x #> Error in eval(expr, envir, enclos): object 'x' not found

Information about size/structure

# Get information about structure str(n) #> int [1:4] 1 2 3 4 str(df) #> 'data.frame': 4 obs. of 2 variables: #> $ n : int 1 2 3 4 #> $ let: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 # Get the length of a vector length(n) #> [1] 4 # Length probably doesn't give us what we want here: length(df) #> [1] 2 # Number of rows nrow(df) #> [1] 4 # Number of columns ncol(df) #> [1] 2 # Get rows and columns dim(df) #> [1] 4 2

Working with NULL, NA, and NaN

Problem

You want to properly handle NULL, NA, or NaN values.

Solution

Sometimes your data will include NULL, NA, or NaN. These work somewhat differently from “normal” values, and may require explicit testing. Here are some examples of comparisons with these values: x <- NULL x > 5 # logical(0) y <- NA y > 5 # NA z <- NaN z > 5 # NA Here’s how to test whether a variable has one of these values: is.null(x) # TRUE is.na(y) # TRUE is.nan(z) # TRUE Note that NULL is different from the other two. NULL means that there is no value, while NA and NaN mean that there is some value, although one that is perhaps not usable. Here’s an illustration of the difference: # Is y null? is.null(y) # FALSE # Is x NA? is.na(x) # logical(0) # Warning message: # In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' In the first case, it checks if y is NULL, and the answer is no. In the second case, it tries to check if x is `NA, but there is no value to be checked.

Ignoring “bad” values in vector summary functions

If you run functions like mean() or sum() on a vector containing NA or NaN, they will return NA and NaN, which is generally unhelpful, though this will alert you to the presence of the bad value. Many of these functions take the flag na.rm, which tells them to ignore these values. vy <- c(1, 2, 3, NA, 5) # 1 2 3 NA 5 mean(vy) # NA mean(vy, na.rm=TRUE) # 2.75 vz <- c(1, 2, 3, NaN, 5) # 1 2 3 NaN 5 sum(vz) # NaN sum(vz, na.rm=TRUE) # 11 # NULL isn't a problem, because it doesn't exist vx <- c(1, 2, 3, NULL, 5) # 1 2 3 5 sum(vx) # 11

Removing bad values from a vector

These values can be removed from a vector by filtering using is.na() or is.nan(). vy # 1 2 3 NA 5 vy[ !is.na(vy) ] # 1 2 3 5 vz # 1 2 3 NaN 5 vz[ !is.nan(vz) ] # 1 2 3 5

Notes

There are also the infinite numerical values Inf and -Inf, and the associated functions is.finite() and is.infinite().

Generating random numbers

Problem

You want to generate random numbers.

Solution

For uniformly distributed (flat) random numbers, use runif(). By default, its range is from 0 to 1. runif(1) #> [1] 0.09006613 # Get a vector of 4 numbers runif(4) #> [1] 0.6972299 0.9505426 0.8297167 0.9779939 # Get a vector of 3 numbers from 0 to 100 runif(3, min=0, max=100) #> [1] 83.702278 3.062253 5.388360 # Get 3 integers from 0 to 100 # Use max=101 because it will never actually equal 101 floor(runif(3, min=0, max=101)) #> [1] 11 67 1 # This will do the same thing sample(1:100, 3, replace=TRUE) #> [1] 8 63 64 # To generate integers WITHOUT replacement: sample(1:100, 3, replace=FALSE) #> [1] 76 25 52 To generate numbers from a normal distribution, use rnorm(). By default the mean is 0 and the standard deviation is 1. rnorm(4) #> [1] -2.3308287 -0.9073857 -0.7638332 -0.2193786 # Use a different mean and standard deviation rnorm(4, mean=50, sd=10) #> [1] 59.20927 40.12440 44.58840 41.97056 # To check that the distribution looks right, make a histogram of the numbers x <- rnorm(400, mean=50, sd=10) hist(x) plot of chunk unnamed-chunk-3

Generating repeatable sequences of random numbers

Problem

You want to generate a sequence of random numbers, and then generate that same sequence again later.

Solution

Use set.seed(), and pass in a number as the seed. set.seed(423) runif(3) #> [1] 0.1089715 0.5973455 0.9726307 set.seed(423) runif(3) #> [1] 0.1089715 0.5973455 0.9726307

Saving the state of the random number generator

Problem

You want to save and restore the state of the random number generator

Solution

Save .Random.seed to another variable and then later restore that value to .Random.seed. # For this example, set the random seed set.seed(423) runif(3) #> [1] 0.1089715 0.5973455 0.9726307 # Save the seed oldseed <- .Random.seed runif(3) #> [1] 0.7973768 0.2278427 0.5189830 # Do some other stuff with RNG here, such as: # runif(30) # ... # Restore the seed .Random.seed <- oldseed # Get the same random numbers as before, after saving the seed runif(3) #> [1] 0.7973768 0.2278427 0.5189830 If no random number generator has been used in your R session, the variable .Random.seed will not exist. If you cannot be certain that an RNG has been used before attempting to save, the seed, you should check for it before saving and restoring: oldseed <- NULL if (exists(".Random.seed")) oldseed <- .Random.seed # Do some other stuff with RNG here, such as: # runif(30) # ... if (!is.null(oldseed)) .Random.seed <- oldseed

Saving and restoring the state of the RNG in functions

If you attempt to restore the state of the random number generator within a function by using .Random.seed <- x, it will not work, because this operation changes a local variable named .Random.seed, instead of the variable in the global envrionment. Here are two examples. What these functions are supposed to do is generate some random numbers, while leaving the state of the RNG unchanged. # This is the bad version bad_rand_restore <- function() { if (exists(".Random.seed")) oldseed <- .Random.seed else oldseed <- NULL print(runif(3)) if (!is.null(oldseed)) .Random.seed <- oldseed else rm(".Random.seed") } # This is the good version rand_restore <- function() { if (exists(".Random.seed", .GlobalEnv)) oldseed <- .GlobalEnv$.Random.seed else oldseed <- NULL print(runif(3)) if (!is.null(oldseed)) .GlobalEnv$.Random.seed <- oldseed else rm(".Random.seed", envir = .GlobalEnv) } # The bad version doesn't properly reset the RNG state, so random numbers keep # changing. set.seed(423) bad_rand_restore() #> [1] 0.1089715 0.5973455 0.9726307 bad_rand_restore() #> [1] 0.7973768 0.2278427 0.5189830 bad_rand_restore() #> [1] 0.6929255 0.8104453 0.1019465 # The good version correctly resets the RNG state each time, so random numbers # stay the same. set.seed(423) rand_restore() #> [1] 0.1089715 0.5973455 0.9726307 rand_restore() #> [1] 0.1089715 0.5973455 0.9726307 rand_restore() #> [1] 0.1089715 0.5973455 0.9726307

Notes

.Random.seed should not be modified by the user.

Rounding numbers

Problem

You want to round numbers.

Solution

There are many ways of rounding: to the nearest integer, up, down, or toward zero. x <- seq(-2.5, 2.5, by=.5) # Round to nearest, with .5 values rounded to even number. round(x) #> [1] -2 -2 -2 -1 0 0 0 1 2 2 2 # Round up ceiling(x) #> [1] -2 -2 -1 -1 0 0 1 1 2 2 3 # Round down floor(x) #> [1] -3 -2 -2 -1 -1 0 0 1 1 2 2 # Round toward zero trunc(x) #> [1] -2 -2 -1 -1 0 0 0 1 1 2 2 It is also possible to round to other decimal places: x <- c(.001, .07, 1.2, 44.02, 738, 9927) # Round to one decimal place round(x, digits=1) #> [1] 0.0 0.1 1.2 44.0 738.0 9927.0 # Round to tens place round(x, digits=-1) #> [1] 0 0 0 40 740 9930 # Round to nearest 5 round(x/5)*5 #> [1] 0 0 0 45 740 9925 # Round to nearest .02 round(x/.02)*.02 #> [1] 0.00 0.08 1.20 44.02 738.00 9927.00

Comparing floating point numbers

Problem

Comparing floating point numbers does not always work as you expect. For example: 0.3 == 3*.1 #> [1] FALSE (0.1 + 0.1 + 0.1) - 0.3 #> [1] 5.551115e-17 x <- seq(0, 1, by=.1) x #> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 10*x - round(10*x) #> [1] 0.000000e+00 0.000000e+00 0.000000e+00 4.440892e-16 0.000000e+00 0.000000e+00 #> [7] 8.881784e-16 8.881784e-16 0.000000e+00 0.000000e+00 0.000000e+00

Solution

There is no universal solution, because this issue is inherent to the storage format for non-integer (floating point) numbers, in R and computers in general. See this page for more information and workarounds: http://www.mathworks.com/support/tech-notes/1100/1108.html. This is written for Matlab, but much of it will be relevant for R as well.

Creating strings from variables

Problem

You want to do create a string from variables.

Solution

The two common ways of creating strings from variables are the paste function and the sprintf function. paste is more useful for vectors, and sprintf is more useful for precise control of the output.

Using paste()

a <- "apple" b <- "banana" # Put a and b together, with a space in between: paste(a, b) #> [1] "apple banana" # With no space, use sep="", or use paste0(): paste(a, b, sep="") #> [1] "applebanana" paste0(a, b) #> [1] "applebanana" # With a comma and space: paste(a, b, sep=", ") #> [1] "apple, banana" # With a vector d <- c("fig", "grapefruit", "honeydew") # If the input is a vector, use collapse to put the elements together: paste(d, collapse=", ") #> [1] "fig, grapefruit, honeydew" # If the input is a scalar and a vector, it puts the scalar with each # element of the vector, and returns a vector: paste(a, d) #> [1] "apple fig" "apple grapefruit" "apple honeydew" # Use sep and collapse: paste(a, d, sep="-", collapse=", ") #> [1] "apple-fig, apple-grapefruit, apple-honeydew"

Using sprintf()

Another way is to use sprintf function. This is derived from the function of the same name in the C programming language. To substitute in a string or string variable, use %s: a <- "string" sprintf("This is where a %s goes.", a) #> [1] "This is where a string goes." For integers, use %d or a variant: x <- 8 sprintf("Regular:%d", x) #> [1] "Regular:8" # Can print to take some number of characters, leading with spaces. sprintf("Leading spaces:%4d", x) #> [1] "Leading spaces: 8" # Can also lead with zeros instead. sprintf("Leading zeros:%04d", x) #> [1] "Leading zeros:0008" For floating-point numbers, use %f for standard notation, and %e or %E for exponential notation. You can also use %g or %G for a “smart” formatter that automatically switches between the two formats, depending on where the significant digits are. The following examples are taken from the R help page for sprintf: sprintf("%f", pi) # "3.141593" sprintf("%.3f", pi) # "3.142" sprintf("%1.0f", pi) # "3" sprintf("%5.1f", pi) # " 3.1" sprintf("%05.1f", pi) # "003.1" sprintf("%+f", pi) # "+3.141593" sprintf("% f", pi) # " 3.141593" sprintf("%-10f", pi) # "3.141593 " (left justified) sprintf("%e", pi) #"3.141593e+00" sprintf("%E", pi) # "3.141593E+00" sprintf("%g", pi) # "3.14159" sprintf("%g", 1e6 * pi) # "3.14159e+06" (exponential) sprintf("%.9g", 1e6 * pi) # "3141592.65" ("fixed") sprintf("%G", 1e-6 * pi) # "3.14159E-06" In the %m.nf format specification: The m represents the field width, which is the minimum number of characters in the output string, and can be padded with leading spaces, or zeros if there is a zero in front of m. The n represents precision, which the number of digits after the decimal. Other miscellaneous things: x <- "string" sprintf("Substitute in multiple strings: %s %s", x, "string2") #> [1] "Substitute in multiple strings: string string2" # To print a percent sign, use "%%" sprintf("A single percent sign here %%") #> [1] "A single percent sign here %"

Creating a formula from a string

Problem

You want to create a formula from a string.

Solution

It can be useful to create a formula from a string. This often occurs in functions where the formula arguments are passed in as strings. In the most basic case, use as.formula(): # This returns a string: "y ~ x1 + x2" #> [1] "y ~ x1 + x2" # This returns a formula: as.formula("y ~ x1 + x2") #> y ~ x1 + x2 #> <environment: 0x3361710> Here is an example of how it might be used: # These are the variable names: measurevar <- "y" groupvars <- c("x1","x2","x3") # This creates the appropriate string: paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ ") #> [1] "y ~ x1 + x2 + x3" # This returns the formula: as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ ")) #> y ~ x1 + x2 + x3 #> <environment: 0x3361710>

Extracting components from a formula

Problem

You want to extract parts of a formula for further use.

Solution

You can index into the formula object as though it were a list, using the [[ operator. f <- y ~ x1 + x2 # Take a look at f str(f) #> Class 'formula' language y ~ x1 + x2 #> ..- attr(*, ".Environment")=<environment: 0x1e46710> # Get each part f[[1]] #> `~` f[[2]] #> y f[[3]] #> x1 + x2 # Or view the whole thing as a list as.list(f) #> [[1]] #> `~` #> #> [[2]] #> y #> #> [[3]] #> x1 + x2 #> #> <environment: 0x1e46710> For formulas that have nothing on the left side, there are only two elements: f2 <- ~ x1 + x2 as.list(f2) #> [[1]] #> `~` #> #> [[2]] #> x1 + x2 #> #> <environment: 0x1e46710> Each of the elements of the formula is an symbol or language object (which consists of multiple symbols: str(f[[1]]) #> symbol ~ str(f[[2]]) #> symbol y str(f[[3]]) #> language x1 + x2 # Look at parts of the langage object str(f[[3]][[1]]) #> symbol + str(f[[3]][[2]]) #> symbol x1 str(f[[3]][[3]]) #> symbol x2 You can use as.character() or deparse() to convert any of these to strings. deparse() can give a more natural-looking result: as.character(f[[1]]) #> [1] "~" as.character(f[[2]]) #> [1] "y" # The language object gets coerced into a string that represents the parse tree: as.character(f[[3]]) #> [1] "+" "x1" "x2" # You can use deparse() to get a more natural looking string deparse(f[[3]]) #> [1] "x1 + x2" deparse(f) #> [1] "y ~ x1 + x2" The formula object also captures the environment in which it was called, as we saw earlier when we ran str(f). To extract it, use environment(): environment(f) #> <environment: 0x1e46710>

Loading data from a file

Problem

You want to load data from a file.

Solution

Delimited text files

The simplest way to import data is to save it as a text file with delimiters such as tabs or commas (CSV). data <- read.csv("datafile.csv") # Load a CSV file that doesn't have headers data <- read.csv("datafile-noheader.csv", header=FALSE) The function read.table() is a more general function which allows you to set the delimiter, whether or not there are headers, whether strings are set off with quotes, and more. See ?read.table for more information on the details. data <- read.table("datafile-noheader.csv", header=FALSE, sep="," # use "\t" for tab-delimited files )

Loading a file with a file chooser

On some platforms, using file.choose() will open a file chooser dialog window. On others, it will simply prompt the user to type in a filename. data <- read.csv(file.choose())

Treating strings as factors or characters

By default, strings in the data are converted to factors. If you load the data below with read.csv, then all the text columns will be treated as factors, even though it might make more sense to treat some of them as strings. To do this, use stringsAsFactors=FALSE: data <- read.csv("datafile.csv", stringsAsFactors=FALSE) # You might have to convert some columns to factors data$Sex <- factor(data$Sex) Another alternative is to load them as factors and convert some columns to characters: data <- read.csv("datafile.csv") data$First <- as.character(data$First) data$Last <- as.character(data$Last) # Another method: convert columns named "First" and "Last" stringcols <- c("First","Last") data[stringcols] <- lapply(data[stringcols], as.character)

Loading a file from the Internet

Data can also be loaded from a URL. These (very long) URLs will load the files linked to below. data <- read.csv("http://www.cookbook-r.com/Data_input_and_output/Loading_data_from_a_file/datafile.csv") # Read in a CSV file without headers data <- read.csv("http://www.cookbook-r.com/Data_input_and_output/Loading_data_from_a_file/datafile-noheader.csv", header=FALSE) # Manually assign the header names names(data) <- c("First","Last","Sex","Number") The data files used above: datafile.csv: "First","Last","Sex","Number" "Currer","Bell","F",2 "Dr.","Seuss","M",49 "","Student",NA,21 datafile-noheader.csv: "Currer","Bell","F",2 "Dr.","Seuss","M",49 "","Student",NA,21

Fixed-width text files

Suppose your data has fixed-width columns, like this: First Last Sex Number Currer Bell F 2 Dr. Seuss M 49 "" Student NA 21 One way to read it in is to simply use read.table() with strip.white=TRUE, which will remove extra spaces. read.table("clipboard", header=TRUE, strip.white=TRUE) However, your data file may have columns containing spaces, or columns with no spaces separating them, like this, where the scores column represents six different measurements, each from 0 to 3. subject sex scores N 1 M 113311 NE 2 F 112231 S 3 F 111221 W 4 M 011002 In this case, you may need to use the read.fwf() function. If you read the column names from the file, it requires that they be separated with a delimiter like a single tab, space, or comma. If they are separated with multiple spaces, as in this example, you will have to assign the column names directly. # Assign the column names manually read.fwf("myfile.txt", c(7,5,-2,1,1,1,1,1,1), # Width of the columns. -2 means drop those columns skip=1, # Skip the first line (contains header here) col.names=c("subject","sex","s1","s2","s3","s4","s5","s6"), strip.white=TRUE) # Strip out leading and trailing whitespace when reading each #> subject sex s1 s2 s3 s4 s5 s6 #> 1 N 1 M 1 1 3 3 1 1 #> 2 NE 2 F 1 1 2 2 3 1 #> 3 S 3 F 1 1 1 2 2 1 #> 4 W 4 M 0 1 1 0 0 2 # subject sex s1 s2 s3 s4 s5 s6 # N 1 M 1 1 3 3 1 1 # NE 2 F 1 1 2 2 3 1 # S 3 F 1 1 1 2 2 1 # W 4 M 0 1 1 0 0 2 # If the first row looked like this: # subject,sex,scores # Then we could use header=TRUE: read.fwf("myfile.txt", c(7,5,-2,1,1,1,1,1,1), header=TRUE, strip.white=TRUE) #> Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names, : more columns than column names

Excel files

The read.xls function in the gdata package can read in Excel files. library(gdata) data <- read.xls("data.xls") See http://cran.r-project.org/doc/manuals/R-data.html#Reading-Excel-spreadsheets.

SPSS data files

The read.spss function in the foreign package can read in SPSS files. library(foreign) data <- read.spss("data.sav", to.data.frame=TRUE)

Loading and storing data with the keyboard and clipboard

Problem

You want to enter data using input from the keyboard (not a file).

Solution

Data input

Suppose this is your data: size weight cost small 5 6 medium 8 10 large 11 9

Loading data from keyboard input or clipboard

One way enter from the keyboard is to read from standard input (stdin()). # Cutting and pasting using read.table and stdin() data <- read.table(stdin(), header=TRUE) # You will be prompted for input; copy and paste text here # Or: # data <- read.csv(stdin()) You can also load directly from the clipboard: # First copy the data to the clipboard data <- read.table('clipboard', header=TRUE) # Or: # data <- read.csv('clipboard')

Loading data in a script

The previous method can’t be used to load data in a script file because the input must be typed (or pasted) after running the command. # Using read.table() data <- read.table(header=TRUE, text=' size weight cost small 5 6 medium 8 10 large 11 9 ') For different data formats (e.g., comma-delimited, no headers, etc.), options to read.table() can be set. See ../Loading data from a file for more information.

Data output

By default, R prints row names. If you want to print the table in a format that can be copied and pasted, it may be useful to suppress them. print(data, row.names=FALSE) #> size weight cost #> small 5 6 #> medium 8 10 #> large 11 9

Writing data for copying and pasting, or to the clipboard

It is possible to write delimited data to terminal (stdout()), so that it can be copied and pasted elsewhere. Or it can be written directly to the clipboard. write.csv(data, stdout(), row.names=FALSE) # "size","weight","cost" # "small",5,6 # "medium",8,10 # "large",11,9 # Write to the clipboard (does not work on Mac or Unix) write.csv(data, 'clipboard', row.names=FALSE)

Output for loading in R

If the data has already been loaded into R, the data structure can be saved using dput(). The output from dput() is a command which will recreate the data structure. The advantage of this method is that it will keep any modifications to data types; for example, if one column consists of numbers and you have converted it to a factor, this method will preserve that type, whereas simply loading the text table (as shown above) will treat it as numeric. # Suppose you have already loaded data dput(data) #> structure(list(size = structure(c(3L, 2L, 1L), .Label = c("large", #> "medium", "small"), class = "factor"), weight = c(5L, 8L, 11L #> ), cost = c(6L, 10L, 9L)), .Names = c("size", "weight", "cost" #> ), class = "data.frame", row.names = c(NA, -3L)) # Later, we can use the output from dput to recreate the data structure newdata <- structure(list(size = structure(c(3L, 2L, 1L), .Label = c("large", "medium", "small"), class = "factor"), weight = c(5L, 8L, 11L ), cost = c(6L, 10L, 9L)), .Names = c("size", "weight", "cost" ), class = "data.frame", row.names = c(NA, -3L))

Running a script

Problem

You want to run R code from a text file.

Solution

Use the source() function. # First, go to the proper directory setwd('/home/username/desktop/rcode') source('analyze.r') Note that if you want your script to produce text output, you must use the print() or cat() function. x <- 1:10 # In a script, this will do nothing x # Use the print function: print(x) #> [1] 1 2 3 4 5 6 7 8 9 10 # Simpler output: no row/column numbers, no text wrapping cat(x) #> 1 2 3 4 5 6 7 8 9 10 Another alternative is to run source() with the print.eval=TRUE option.

Writing data to a file

Problem

You want to write data to a file.

Solution

Writing to a delimited text file

The easiest way to do this is to use write.csv(). By default, write.csv() includes row names, but these are usually unnecessary and may cause confusion. # A sample data frame data <- read.table(header=TRUE, text=' subject sex size 1 M 7 2 F NA 3 F 9 4 M 11 ') # Write to a file, suppress row names write.csv(data, "data.csv", row.names=FALSE) # Same, except that instead of "NA", output blank cells write.csv(data, "data.csv", row.names=FALSE, na="") # Use tabs, suppress row names and column names write.table(data, "data.csv", sep="\t", row.names=FALSE, col.names=FALSE)

Saving in R data format

write.csv() and write.table() are best for interoperability with other data analysis programs. They will not, however, preserve special attributes of the data structures, such as whether a column is a character type or factor, or the order of levels in factors. In order to do that, it should be written out in a special format for R. Below are are three primary ways of doing this: The first method is to output R source code which, when run, will re-create the object. This should work for most data objects, but it may not be able to faithfully re-create some more complicated data objects. # Save in a text format that can be easily loaded in R dump("data", "data.Rdmpd") # Can save multiple objects: dump(c("data", "data1"), "data.Rdmpd") # To load the data again: source("data.Rdmpd") # When loaded, the original data names will automatically be used. The next method is to write out individual data objects in RDS format. This format can be binary or ASCII. Binary is more compact, while ASCII will be more efficient with version control systems like Git. # Save a single object in binary RDS format saveRDS(data, "data.rds") # Or, using ASCII format saveRDS(data, "data.rds", ascii=TRUE) # To load the data again: data <- readRDS("data.rds") It’s also possible to save multiple objects into an single file, using the RData format. # Saving multiple objects in binary RData format save(data, file="data.RData") # Or, using ASCII format save(data, file="data.RData", ascii=TRUE) # Can save multiple objects save(data, data1, file="data.RData") # To load the data again: load("data.RData") An important difference between saveRDS() and save() is that, with the former, when you readRDS() the data, you specify the name of the object, and with the latter, when you load() the data, the original object names are automatically used. Automatically using the original object names can sometimes simplify a workflow, but it can also be a drawback if the data object is meant to be distributed to others for use in a different environment.

Writing text and output from analyses to a file

Problem

You want to write output to a file.

Solution

The sink() function will redirect output to a file instead of to the R terminal. Note that if you use sink() in a script and it crashes before output is returned to the terminal, then you will not see any response to your commands. Call sink() without any arguments to return output to the terminal. # Start writing to an output file sink('analysis-output.txt') set.seed(12345) x <-rnorm(10,10,1) y <-rnorm(10,11,1) # Do some stuff here cat(sprintf("x has %d elements:\n", length(x))) print(x) cat("y =", y, "\n") cat("=============================\n") cat("T-test between x and y\n") cat("=============================\n") t.test(x,y) # Stop writing to the file sink() # Append to the file sink('analysis-output.txt', append=TRUE) cat("Some more stuff here...\n") sink() The contents of the output file: x has 10 elements: [1] 10.585529 10.709466 9.890697 9.546503 10.605887 8.182044 10.630099 9.723816 [9] 9.715840 9.080678 y = 10.88375 12.81731 11.37063 11.52022 10.24947 11.8169 10.11364 10.66842 12.12071 11.29872 ============================= T-test between x and y ============================= Welch Two Sample t-test data: x and y t = -3.8326, df = 17.979, p-value = 0.001222 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.196802 -0.641042 sample estimates: mean of x mean of y 9.867056 11.285978 Some more stuff here...

Sorting

Problem

You want to sort a vector, matrix, or data frame.

Solution

Vectors

# Make up a randomly ordered vector v <- sample(101:110) # Sort the vector sort(v) #> [1] 101 102 103 104 105 106 107 108 109 110 # Reverse sort sort(v, decreasing=TRUE) #> [1] 110 109 108 107 106 105 104 103 102 101

Data frames

To sort a data frame on one or more columns, you can use the arrange function from plyr package, or use R’s built-in functions. The arrange function is much easier to use, but does require the external package to be installed. # Make a data frame df <- data.frame (id=1:4, weight=c(20,27,24,22), size=c("small", "large", "medium", "large")) df library(plyr) # Sort by weight column. These have the same result. arrange(df, weight) # Use arrange from plyr package df[ order(df$weight), ] # Use built-in R functions #> id weight size #> 1 1 20 small #> 2 4 22 large #> 3 3 24 medium #> 4 2 27 large # Sort by size, then by weight arrange(df, size, weight) # Use arrange from plyr package df[ order(df$size, df$weight), ] # Use built-in R functions #> id weight size #> 4 4 22 large #> 2 2 27 large #> 3 3 24 medium #> 1 1 20 small # Sort by all columns in the data frame, from left to right df[ do.call(order, as.list(df)), ] # In this particular example, the order will be unchanged Note that the size column is a factor and is sorted by the order of the factor levels. In this case, the levels were automatically assigned alphabetically (when creating the data frame), so large is first and small is last.

Reverse sort

The overall order of the sort can be reversed with the argument decreasing=TRUE. To reverse the direction of a particular column, the method depends on the data type: # Reverse sort by weight column. These all have the same effect: arrange(df, -weight) # Use arrange from plyr package df[ order(df$weight, decreasing=TRUE), ] # Use built-in R functions df[ order(-df$weight), ] # Use built-in R functions #> id weight size #> 2 2 27 large #> 3 3 24 medium #> 4 4 22 large #> 1 1 20 small # Sort by size (increasing), then by weight (decreasing) arrange(df, size, -weight) # Use arrange from plyr package df[ order(df$size, -df$weight), ] # Use built-in R functions #> id weight size #> 2 2 27 large #> 4 4 22 large #> 3 3 24 medium #> 1 1 20 small # Sort by size (decreasing), then by weight (increasing) # The call to xtfrm() is needed for factors arrange(df, -xtfrm(size), weight) # Use arrange from plyr package df[ order(-xtfrm(df$size), df$weight), ] # Use built-in R functions #> id weight size #> 1 1 20 small #> 3 3 24 medium #> 4 4 22 large #> 2 2 27 large

Randomizing order

Problem

You want to randomize the order of a data structure.

Solution

# Create a vector v <- 11:20 # Randomize the order of the vector v <- sample(v) # Create a data frame data <- data.frame(label=letters[1:5], number=11:15) data #> label number #> 1 a 11 #> 2 b 12 #> 3 c 13 #> 4 d 14 #> 5 e 15 # Randomize the order of the data frame data <- data[sample(1:nrow(data)), ] data #> label number #> 5 e 15 #> 2 b 12 #> 4 d 14 #> 3 c 13 #> 1 a 11

Notes

To make a randomization repeatable, you should set the seed for the random number generator.

Converting between vector types

Problem

You want to convert between numeric vectors, character vectors, and factors.

Solution

Suppose you start with this numeric vector n: n <- 10:14 n #> [1] 10 11 12 13 14 To convert the numeric vector to the other two types (we’ll also save these results in c and f): # Numeric to Character c <- as.character(n) # Numeric to Factor f <- factor(n) # 10 11 12 13 14 To convert the character vector to the other two: # Character to Numeric as.numeric(c) #> [1] 10 11 12 13 14 # Character to Factor factor(c) #> [1] 10 11 12 13 14 #> Levels: 10 11 12 13 14 Converting a factor to a character vector is straightforward: # Factor to Character as.character(f) #> [1] "10" "11" "12" "13" "14" However, converting a factor to a numeric vector is a little trickier. If you just convert it with as.numeric, it will give you the numeric coding of the factor, which probably isn’t what you want. as.numeric(f) #> [1] 1 2 3 4 5 # Another way to get the numeric coding, if that's what you want: unclass(f) #> [1] 1 2 3 4 5 #> attr(,"levels") #> [1] "10" "11" "12" "13" "14" The way to get the text values converted to numbers is to first convert it to a character, then a numeric vector. # Factor to Numeric as.numeric(as.character(f)) #> [1] 10 11 12 13 14

Finding and removing duplicate records

Problem

You want to find and/or remove duplicate entries from a vector or data frame.

Solution

With vectors: # Generate a vector set.seed(158) x <- round(rnorm(20, 10, 5)) x #> [1] 14 11 8 4 12 5 10 10 3 3 11 6 0 16 8 10 8 5 6 6 # For each element: is this one a duplicate (first instance of a particular value # not counted) duplicated(x) #> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE #> [15] TRUE TRUE TRUE TRUE TRUE TRUE # The values of the duplicated entries # Note that '6' appears in the original vector three times, and so it has two # entries here. x[duplicated(x)] #> [1] 10 3 11 8 10 8 5 6 6 # Duplicated entries, without repeats unique(x[duplicated(x)]) #> [1] 10 3 11 8 5 6 # The original vector with all duplicates removed. These do the same: unique(x) #> [1] 14 11 8 4 12 5 10 3 6 0 16 x[!duplicated(x)] #> [1] 14 11 8 4 12 5 10 3 6 0 16 With data frames: # A sample data frame: df <- read.table(header=TRUE, text=' label value A 4 B 3 C 6 B 3 B 1 A 2 A 4 A 4 ') # Is each row a repeat? duplicated(df) #> [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE # Show the repeat entries df[duplicated(df),] #> label value #> 4 B 3 #> 7 A 4 #> 8 A 4 # Show unique repeat entries (row names may differ, but values are the same) unique(df[duplicated(df),]) #> label value #> 4 B 3 #> 7 A 4 # Original data with repeats removed. These do the same: unique(df) #> label value #> 1 A 4 #> 2 B 3 #> 3 C 6 #> 5 B 1 #> 6 A 2 df[!duplicated(df),] #> label value #> 1 A 4 #> 2 B 3 #> 3 C 6 #> 5 B 1 #> 6 A 2

Comparing vectors or factors with NA

Problem

You want to compare two vectors or factors but want comparisons with NA’s to be reported as TRUE or FALSE (instead of NA).

Solution

Suppose you have this data frame with two columns which consist of boolean vectors: df <- data.frame( a=c(TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,NA,NA,NA), b=c(TRUE,FALSE,NA,TRUE,FALSE,NA,TRUE,FALSE,NA)) df #> a b #> 1 TRUE TRUE #> 2 TRUE FALSE #> 3 TRUE NA #> 4 FALSE TRUE #> 5 FALSE FALSE #> 6 FALSE NA #> 7 NA TRUE #> 8 NA FALSE #> 9 NA NA Normally, when you compare two vectors or factors containing NA values, the vector of results will have NAs where either of the original items was NA. Depending on your purposes, this may or not be desirable. df$a == df$b #> [1] TRUE FALSE NA FALSE TRUE NA NA NA NA # The same comparison, but presented as another column in the data frame: data.frame(df, isSame = (df$a==df$b)) #> a b isSame #> 1 TRUE TRUE TRUE #> 2 TRUE FALSE FALSE #> 3 TRUE NA NA #> 4 FALSE TRUE FALSE #> 5 FALSE FALSE TRUE #> 6 FALSE NA NA #> 7 NA TRUE NA #> 8 NA FALSE NA #> 9 NA NA NA

A function for comparing with NA’s

This comparison function will essentially treat NA’s as just another value. If an item in both vectors is NA, then it reports TRUE for that item; if the item is NA in just one vector, it reports FALSE; all other comparisons (between non-NA items) behaves the same. # This function returns TRUE wherever elements are the same, including NA's, # and FALSE everywhere else. compareNA <- function(v1,v2) { same <- (v1 == v2) | (is.na(v1) & is.na(v2)) same[is.na(same)] <- FALSE return(same) }

Examples of the function in use

Comparing boolean vectors: compareNA(df$a, df$b) #> [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE # Same comparison, presented as another column data.frame(df, isSame = compareNA(df$a,df$b)) #> a b isSame #> 1 TRUE TRUE TRUE #> 2 TRUE FALSE FALSE #> 3 TRUE NA FALSE #> 4 FALSE TRUE FALSE #> 5 FALSE FALSE TRUE #> 6 FALSE NA FALSE #> 7 NA TRUE FALSE #> 8 NA FALSE FALSE #> 9 NA NA TRUE It also works with factors, even if the levels of the factors are in different orders: # Create sample data frame with factors. df1 <- data.frame(a = factor(c('x','x','x','y','y','y', NA, NA, NA)), b = factor(c('x','y', NA,'x','y', NA,'x','y', NA))) # Do the comparison data.frame(df1, isSame = compareNA(df1$a, df1$b)) #> a b isSame #> 1 x x TRUE #> 2 x y FALSE #> 3 x <NA> FALSE #> 4 y x FALSE #> 5 y y TRUE #> 6 y <NA> FALSE #> 7 <NA> x FALSE #> 8 <NA> y FALSE #> 9 <NA> <NA> TRUE # It still works if the factor levels are arranged in a different order df1$b <- factor(df1$b, levels=c('y','x')) data.frame(df1, isSame = compareNA(df1$a, df1$b)) #> a b isSame #> 1 x x TRUE #> 2 x y FALSE #> 3 x <NA> FALSE #> 4 y x FALSE #> 5 y y TRUE #> 6 y <NA> FALSE #> 7 <NA> x FALSE #> 8 <NA> y FALSE #> 9 <NA> <NA> TRUE

Recoding data

Problem

You want to recode data or calculate new data columns from existing ones.

Solution

The examples below will use this data: data <- read.table(header=T, text=' subject sex control cond1 cond2 1 M 7.9 12.3 10.7 2 F 6.3 10.6 11.1 3 F 9.5 13.1 13.8 4 M 11.5 13.4 12.9 ')

Recoding a categorical variable

The easiest way is to use revalue() or mapvalues() from the plyr package. This will code M as 1 and F as 2, and put it in a new column. Note that these functions preserves the type: if the input is a factor, the output will be a factor; and if the input is a character vector, the output will be a character vector. library(plyr) # The following two lines are equivalent: data$scode <- revalue(data$sex, c("M"="1", "F"="2")) data$scode <- mapvalues(data$sex, from = c("M", "F"), to = c("1", "2")) data #> subject sex control cond1 cond2 scode #> 1 1 M 7.9 12.3 10.7 1 #> 2 2 F 6.3 10.6 11.1 2 #> 3 3 F 9.5 13.1 13.8 2 #> 4 4 M 11.5 13.4 12.9 1 # data$sex is a factor, so data$scode is also a factor See ../Mapping vector values and ../Renaming levels of a factor for more information about these functions. If you don’t want to rely on plyr, you can do the following with R’s built-in functions: data$scode[data$sex=="M"] <- "1" data$scode[data$sex=="F"] <- "2" # Convert the column to a factor data$scode <- factor(data$scode) data #> subject sex control cond1 cond2 scode #> 1 1 M 7.9 12.3 10.7 1 #> 2 2 F 6.3 10.6 11.1 2 #> 3 3 F 9.5 13.1 13.8 2 #> 4 4 M 11.5 13.4 12.9 1 Another way to do it is to use the match function: oldvalues <- c("M", "F") newvalues <- factor(c("g1","g2")) # Make this a factor data$scode <- newvalues[ match(data$sex, oldvalues) ] data #> subject sex control cond1 cond2 scode #> 1 1 M 7.9 12.3 10.7 g1 #> 2 2 F 6.3 10.6 11.1 g2 #> 3 3 F 9.5 13.1 13.8 g2 #> 4 4 M 11.5 13.4 12.9 g1

Recoding a continuous variable into categorical variable

Mark those whose control measurement is <7 as “low”, and those with >=7 as “high”: data$category[data$control< 7] <- "low" data$category[data$control>=7] <- "high" # Convert the column to a factor data$category <- factor(data$category) data #> subject sex control cond1 cond2 scode category #> 1 1 M 7.9 12.3 10.7 g1 high #> 2 2 F 6.3 10.6 11.1 g2 low #> 3 3 F 9.5 13.1 13.8 g2 high #> 4 4 M 11.5 13.4 12.9 g1 high With the cut function, you specify boundaries and the resulting values: data$category <- cut(data$control, breaks=c(-Inf, 7, 9, Inf), labels=c("low","medium","high")) data #> subject sex control cond1 cond2 scode category #> 1 1 M 7.9 12.3 10.7 g1 medium #> 2 2 F 6.3 10.6 11.1 g2 low #> 3 3 F 9.5 13.1 13.8 g2 high #> 4 4 M 11.5 13.4 12.9 g1 high By default, the ranges are open on the left, and closed on the right, as in (7,9]. To set it so that ranges are closed on the left and open on the right, like [7,9), use right=FALSE.

Calculating a new continuous variable

Suppose you want to add a new column with the sum of the three measurements. data$total <- data$control + data$cond1 + data$cond2 data #> subject sex control cond1 cond2 scode category total #> 1 1 M 7.9 12.3 10.7 g1 medium 30.9 #> 2 2 F 6.3 10.6 11.1 g2 low 28.0 #> 3 3 F 9.5 13.1 13.8 g2 high 36.4 #> 4 4 M 11.5 13.4 12.9 g1 high 37.8

Mapping vector values

Problem

You want to change all instances of value x to value y in a vector.

Solution

# Create some example data str <- c("alpha", "beta", "gamma") num <- c(1, 2, 3) The easiest way is to use revalue() or mapvalues() from the plyr package: library(plyr) revalue(str, c("beta"="two", "gamma"="three")) #> [1] "alpha" "two" "three" mapvalues(str, from = c("beta", "gamma"), to = c("two", "three")) #> [1] "alpha" "two" "three" # For numeric vectors, revalue() won't work, since it uses a named vector, and # the names are always strings, not numbers. mapvalues() will work, though: mapvalues(num, from = c(2, 3), to = c(5, 6)) #> [1] 1 5 6 If you don’t want to rely on plyr, you can do the following with R’s built-in functions. Note that these methods will modify the vectors directly; that is, you don’t have to save the result back into the variable. # Rename by name: change "beta" to "two" str[str=="beta"] <- "two" str #> [1] "alpha" "two" "gamma" num[num==2] <- 5 num #> [1] 1 5 3 It’s also possible to use R’s string search-and-replace functions to remap values in character vectors. Note that the ^ and $ surrounding alpha are there to ensure that the entire string matches. Without them, if there were a value named alphabet, it would also match, and the replacement would be onebet. str <- c("alpha", "beta", "gamma") sub("^alpha$", "one", str) #> [1] "one" "beta" "gamma" # Across all columns, replace all instances of "a" with "X" gsub("a", "X", str) #> [1] "XlphX" "betX" "gXmmX" # gsub() replaces all instances of the pattern in each element # sub() replaces only the first instance in each element See also: Changing the name of factor levels works much the same. See ../Renaming levels of a factor for more information.

Renaming levels of a factor

Problem

You want to rename the levels in a factor.

Solution

# A sample factor to work with. x <- factor(c("alpha","beta","gamma","alpha","beta")) x #> [1] alpha beta gamma alpha beta #> Levels: alpha beta gamma levels(x) #> [1] "alpha" "beta" "gamma" The easiest way is to use revalue() or mapvalues() from the plyr package: library(plyr) revalue(x, c("beta"="two", "gamma"="three")) #> [1] alpha two three alpha two #> Levels: alpha two three mapvalues(x, from = c("beta", "gamma"), to = c("two", "three")) #> [1] alpha two three alpha two #> Levels: alpha two three If you don’t want to rely on plyr, you can do the following with R’s built-in functions. Note that these methods will modify x directly; that is, you don’t have to save the result back into x. # Rename by name: change "beta" to "two" levels(x)[levels(x)=="beta"] <- "two" # You can also rename by position, but this is a bit dangerous if your data # can change in the future. If there is a change in the number or positions of # factor levels, then this can result in wrong data. # Rename by index in levels list: change third item, "gamma", to "three". levels(x)[3] <- "three" x #> [1] alpha two three alpha two #> Levels: alpha two three # Rename all levels levels(x) <- c("one","two","three") x #> [1] one two three one two #> Levels: one two three It’s possible to rename factor levels by name (without plyr), but keep in mind that this works only if ALL levels are present in the list; if any are not in the list, they will be replaced with NA. # Rename all levels, by name x <- factor(c("alpha","beta","gamma","alpha","beta")) levels(x) <- list(A="alpha", B="beta", C="gamma") x #> [1] A B C A B #> Levels: A B C It’s also possible to use R’s string search-and-replace functions to rename factor levels. Note that the ^ and $ surrounding alpha are there to ensure that the entire string matches. Without them, if there were a level named alphabet, it would also match, and the replacement would be onebet. # A sample factor to work with. x <- factor(c("alpha","beta","gamma","alpha","beta")) x #> [1] alpha beta gamma alpha beta #> Levels: alpha beta gamma levels(x) <- sub("^alpha$", "one", levels(x)) x #> [1] one beta gamma one beta #> Levels: one beta gamma # Across all columns, replace all instances of "a" with "X" levels(x) <- gsub("a", "X", levels(x)) x #> [1] one betX gXmmX one betX #> Levels: one betX gXmmX # gsub() replaces all instances of the pattern in each factor level. # sub() replaces only the first instance in each factor level. See also: Mapping values in a vector to new values works much the same. See ../Mapping vector values for more information.

Re-computing the levels of factor

Problem

You want to do re-compute the levels of a factor. This is useful when a factor contains levels that aren’t actually present in the data. This can happen during data import, or when you remove some rows.

Solution

For a single factor object: # Create a factor with an extra level (gamma) x <- factor(c("alpha","beta","alpha"), levels=c("alpha","beta","gamma")) x #> [1] alpha beta alpha #> Levels: alpha beta gamma # Remove the extra level x <- factor(x) x #> [1] alpha beta alpha #> Levels: alpha beta After importing data, you may have a data frame with a mix of factors and other kinds vectors, and want to re-compute the levels of all the factors. You can use the droplevels() function to do this. # Create a data frame with some factors (with extra levels) df <- data.frame( x = factor(c("alpha","beta","alpha"), levels=c("alpha","beta","gamma")), y = c(5,8,2), z = factor(c("red","green","green"), levels=c("red","green","blue")) ) # Display the factors (with extra levels) df$x #> [1] alpha beta alpha #> Levels: alpha beta gamma df$z #> [1] red green green #> Levels: red green blue # Drop the extra levels df <- droplevels(df) # Show the factors again, now without extra levels df$x #> [1] alpha beta alpha #> Levels: alpha beta df$z #> [1] red green green #> Levels: red green See also: To re-compute the levels of all the factor columns in a data frame, see ../Re-computing_the_levels_of_all_factor_columns_in_a_data_frame.

Changing the order of levels of a factor

Problem

You want to change the order in which the levels of a factor appear.

Solution

Factors in R come in two varieties: ordered and unordered, e.g., {small, medium, large} and {pen, brush, pencil}. For most analyses, it will not matter whether a factor is ordered or unordered. If the factor is ordered, then the specific order of the levels matters (small < medium < large). If the factor is unordered, then the levels will still appear in some order, but the specific order of the levels matters only for convenience (pen, pencil, brush) – it will determine, for example, how output will be printed, or the arrangement of items on a graph. One way to change the level order is to use factor() on the factor and specify the order directly. In this example, the function ordered() could be used instead of factor(). Here’s the sample data: # Create a factor with the wrong order of levels sizes <- factor(c("small", "large", "large", "small", "medium")) sizes #> [1] small large large small medium #> Levels: large medium small The levels can be specified explicitly: sizes <- factor(sizes, levels = c("small", "medium", "large")) sizes #> [1] small large large small medium #> Levels: small medium large We can do the same with an ordered factor: sizes <- ordered(c("small", "large", "large", "small", "medium")) sizes <- ordered(sizes, levels = c("small", "medium", "large")) sizes #> [1] small large large small medium #> Levels: small < medium < large Another way to change the order is to use relevel() to make a particular level first in the list. (This will not work for ordered factors.) # Create a factor with the wrong order of levels sizes <- factor(c("small", "large", "large", "small", "medium")) sizes #> [1] small large large small medium #> Levels: large medium small # Make medium first sizes <- relevel(sizes, "medium") sizes #> [1] small large large small medium #> Levels: medium large small # Make small first sizes <- relevel(sizes, "small") sizes #> [1] small large large small medium #> Levels: small medium large You can also specify the proper order when the factor is created. sizes <- factor(c("small", "large", "large", "small", "medium"), levels = c("small", "medium", "large")) sizes #> [1] small large large small medium #> Levels: small medium large To reverse the order of levels in a factor: # Create a factor with the wrong order of levels sizes <- factor(c("small", "large", "large", "small", "medium")) sizes #> [1] small large large small medium #> Levels: large medium small sizes <- factor(sizes, levels=rev(levels(sizes))) sizes #> [1] small large large small medium #> Levels: small medium large

Renaming columns in a data frame

Problem

You want to rename the columns in a data frame.

Solution

Start with a sample data frame with three columns: d <- data.frame(alpha=1:3, beta=4:6, gamma=7:9) d #> alpha beta gamma #> 1 1 4 7 #> 2 2 5 8 #> 3 3 6 9 names(d) #> [1] "alpha" "beta" "gamma" The simplest way is to use rename() from the plyr package: library(plyr) rename(d, c("beta"="two", "gamma"="three")) #> alpha two three #> 1 1 4 7 #> 2 2 5 8 #> 3 3 6 9 If you don’t want to rely on plyr, you can do the following with R’s built-in functions. Note that these modify d directly; that is, you don’t have to save the result back into d. # Rename column by name: change "beta" to "two" names(d)[names(d)=="beta"] <- "two" d #> alpha two gamma #> 1 1 4 7 #> 2 2 5 8 #> 3 3 6 9 # You can also rename by position, but this is a bit dangerous if your data # can change in the future. If there is a change in the number or positions of # columns, then this can result in wrong data. # Rename by index in names vector: change third item, "gamma", to "three" names(d)[3] <- "three" d #> alpha two three #> 1 1 4 7 #> 2 2 5 8 #> 3 3 6 9 It’s also possible to use R’s string search-and-replace functions to rename columns. Note that the ^ and $ surrounding alpha are there to ensure that the entire string matches. Without them, if there were a column named alphabet, it would also match, and the replacement would be onebet. names(d) <- sub("^alpha$", "one", names(d)) d #> one two three #> 1 1 4 7 #> 2 2 5 8 #> 3 3 6 9 # Across all columns, replace all instances of "t" with "X" names(d) <- gsub("t", "X", names(d)) d #> one Xwo Xhree #> 1 1 4 7 #> 2 2 5 8 #> 3 3 6 9 # gsub() replaces all instances of the pattern in each column name. # sub() replaces only the first instance in each column name.

Adding and removing columns from a data frame

Problem

You want to add or remove columns from a data frame.

Solution

There are many different ways of adding and removing columns from a data frame. data <- read.table(header=TRUE, text=' id weight 1 20 2 27 3 24 ') # Ways to add a column data$size <- c("small", "large", "medium") data[["size"]] <- c("small", "large", "medium") data[,"size"] <- c("small", "large", "medium") data$size <- 0 # Use the same value (0) for all rows # Ways to remove the column data$size <- NULL data[["size"]] <- NULL data[,"size"] <- NULL data[[3]] <- NULL data[,3] <- NULL data <- subset(data, select=-size)

Reordering the columns in a data frame

Problem

You want to do reorder the columns in a data frame.

Solution

# A sample data frame data <- read.table(header=TRUE, text=' id weight size 1 20 small 2 27 large 3 24 medium ') # Reorder by column number data[c(1,3,2)] #> id size weight #> 1 1 small 20 #> 2 2 large 27 #> 3 3 medium 24 # To actually change `data`, you need to save it back into `data`: # data <- data[c(1,3,2)] # Reorder by column name data[c("size", "id", "weight")] #> size id weight #> 1 small 1 20 #> 2 large 2 27 #> 3 medium 3 24 The above examples index into the data frame by treating it as a list (a data frame is essentially a list of vectors). You can also use matrix-style indexing, as in data[row, col], where row is left blank. data[, c(1,3,2)] #> id size weight #> 1 1 small 20 #> 2 2 large 27 #> 3 3 medium 24 The drawback to matrix indexing is that it gives different results when you specify just one column. In these cases, the returned object is a vector, not a data frame. Because the returned data type isn’t always consistent with matrix indexing, it’s generally safer to use list-style indexing, or the drop=FALSE option: # List-style indexing of one column data[2] #> weight #> 1 20 #> 2 27 #> 3 24 # Matrix-style indexing of one column - drops dimension to become a vector data[,2] #> [1] 20 27 24 # Matrix-style indexing with drop=FALSE - preserves dimension to remain data frame data[, 2, drop=FALSE] #> weight #> 1 20 #> 2 27 #> 3 24

Merging data frames

Problem

You want to merge two data frames on a given column from each (like a join in SQL).

Solution

# Make a data frame mapping story numbers to titles stories <- read.table(header=TRUE, text=' storyid title 1 lions 2 tigers 3 bears ') # Make another data frame with the data and story numbers (no titles) data <- read.table(header=TRUE, text=' subject storyid rating 1 1 6.7 1 2 4.5 1 3 3.7 2 2 3.3 2 3 4.1 2 1 5.2 ') # Merge the two data frames merge(stories, data, "storyid") #> storyid title subject rating #> 1 1 lions 1 6.7 #> 2 1 lions 2 5.2 #> 3 2 tigers 1 4.5 #> 4 2 tigers 2 3.3 #> 5 3 bears 1 3.7 #> 6 3 bears 2 4.1 If the two data frames have different names for the columns you want to match on, the names can be specified: # In this case, the column is named 'id' instead of storyid stories2 <- read.table(header=TRUE, text=' id title 1 lions 2 tigers 3 bears ') # Merge on stories2$id and data$storyid. merge(x=stories2, y=data, by.x="id", by.y="storyid") #> id title subject rating #> 1 1 lions 1 6.7 #> 2 1 lions 2 5.2 #> 3 2 tigers 1 4.5 #> 4 2 tigers 2 3.3 #> 5 3 bears 1 3.7 #> 6 3 bears 2 4.1 # Note that the column name is inherited from the first data frame (x=stories2). It is possible to merge on multiple columns: # Make up more data animals <- read.table(header=T, text=' size type name small cat lynx big cat tiger small dog chihuahua big dog "great dane" ') observations <- read.table(header=T, text=' number size type 1 big cat 2 small dog 3 small dog 4 big dog ') merge(observations, animals, c("size","type")) #> size type number name #> 1 big cat 1 tiger #> 2 big dog 4 great dane #> 3 small dog 2 chihuahua #> 4 small dog 3 chihuahua

Notes

After merging, it may be useful to change the order of the columns.

Comparing data frames

Problem

You want to do compare two or more data frames and find rows that appear in more than one data frame, or rows that appear only in one data frame.

Solution

An example

Suppose you have the following three data frames, and you want to know whether each row from each data frame appears in at least one of the other data frames. dfA <- data.frame(Subject=c(1,1,2,2), Response=c("X","X","X","X")) dfA #> Subject Response #> 1 1 X #> 2 1 X #> 3 2 X #> 4 2 X dfB <- data.frame(Subject=c(1,2,3), Response=c("X","Y","X")) dfB #> Subject Response #> 1 1 X #> 2 2 Y #> 3 3 X dfC <- data.frame(Subject=c(1,2,3), Response=c("Z","Y","Z")) dfC #> Subject Response #> 1 1 Z #> 2 2 Y #> 3 3 Z In dfA, the rows containing (1,X) also appear in dfB, but the rows containing (2,X) do not appear in any of the other data frames. Similarly, dfB contains (1,X) which appears in dfA, and (2,Y), which appears in dfC, but (3,X) does not appear in any other data frame. You might wish to mark the rows which are duplicated in another data frame, or which are unique to each data frame.

Joining the data frames

To proceed, first join the three data frames with a column identifying which source each row came from. It’s called Coder here because this could be data coded by three different people. In this case, you might wish to find where the coders agreed, or where they disagreed. dfA$Coder <- "A" dfB$Coder <- "B" dfC$Coder <- "C" df <- rbind(dfA, dfB, dfC) # Stick them together df <- df[,c("Coder", "Subject", "Response")] # Reorder the columns to look nice df #> Coder Subject Response #> 1 A 1 X #> 2 A 1 X #> 3 A 2 X #> 4 A 2 X #> 5 B 1 X #> 6 B 2 Y #> 7 B 3 X #> 8 C 1 Z #> 9 C 2 Y #> 10 C 3 Z If your data starts out in this format, then there’s obviously no need to join it together.

Finding duplicated rows

Using the function dupsBetweenGroups (defined below), we can find which rows are duplicated between different groups: # Find the rows which have duplicates in a different group. dupRows <- dupsBetweenGroups(df, "Coder") # Print it alongside the data frame cbind(df, dup=dupRows) #> Coder Subject Response dup #> 1 A 1 X TRUE #> 2 A 1 X TRUE #> 3 A 2 X FALSE #> 4 A 2 X FALSE #> 5 B 1 X TRUE #> 6 B 2 Y TRUE #> 7 B 3 X FALSE #> 8 C 1 Z FALSE #> 9 C 2 Y TRUE #> 10 C 3 Z FALSE Note that this does not mark duplicated rows within a group. With Coder=A, there are two rows with Subject=1 and Response=X, but these are not marked as duplicates.

Finding unique rows

It’s also possible to find the rows that are unique within each group: cbind(df, unique=!dupRows) #> Coder Subject Response unique #> 1 A 1 X FALSE #> 2 A 1 X FALSE #> 3 A 2 X TRUE #> 4 A 2 X TRUE #> 5 B 1 X FALSE #> 6 B 2 Y FALSE #> 7 B 3 X TRUE #> 8 C 1 Z TRUE #> 9 C 2 Y FALSE #> 10 C 3 Z TRUE

Splitting apart the data frame

If you wish to split the joined data frame into the three original data frames # Store the results in df dfDup <- cbind(df, dup=dupRows) dfA <- subset(dfDup, Coder=="A", select=-Coder) dfA #> Subject Response dup #> 1 1 X TRUE #> 2 1 X TRUE #> 3 2 X FALSE #> 4 2 X FALSE dfB <- subset(dfDup, Coder=="B", select=-Coder) dfB #> Subject Response dup #> 5 1 X TRUE #> 6 2 Y TRUE #> 7 3 X FALSE dfC <- subset(dfDup, Coder=="C", select=-Coder) dfC #> Subject Response dup #> 8 1 Z FALSE #> 9 2 Y TRUE #> 10 3 Z FALSE

Ignoring columns

It is also possible to ignore one or more columns, by removing that column from the data frame that is passed to the function. The results can be joined to the original complete data frame if desired. # Ignore the Subject column -- only use Response dfNoSub <- subset(df, select=-Subject) dfNoSub #> Coder Response #> 1 A X #> 2 A X #> 3 A X #> 4 A X #> 5 B X #> 6 B Y #> 7 B X #> 8 C Z #> 9 C Y #> 10 C Z # Check for duplicates dupRows <- dupsBetweenGroups(dfNoSub, "Coder") # Join the result to the original data frame cbind(df, dup=dupRows) #> Coder Subject Response dup #> 1 A 1 X TRUE #> 2 A 1 X TRUE #> 3 A 2 X TRUE #> 4 A 2 X TRUE #> 5 B 1 X TRUE #> 6 B 2 Y TRUE #> 7 B 3 X TRUE #> 8 C 1 Z FALSE #> 9 C 2 Y TRUE #> 10 C 3 Z FALSE

dupsBetweenGroups function

This is the function that does all the work: dupsBetweenGroups <- function (df, idcol) { # df: the data frame # idcol: the column which identifies the group each row belongs to # Get the data columns to use for finding matches datacols <- setdiff(names(df), idcol) # Sort by idcol, then datacols. Save order so we can undo the sorting later. sortorder <- do.call(order, df) df <- df[sortorder,] # Find duplicates within each id group (first copy not marked) dupWithin <- duplicated(df) # With duplicates within each group filtered out, find duplicates between groups. # Need to scan up and down with duplicated() because first copy is not marked. dupBetween = rep(NA, nrow(df)) dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols]) dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols], fromLast=TRUE) | dupBetween[!dupWithin] # ============= Replace NA's with previous non-NA value ============== # This is why we sorted earlier - it was necessary to do this part efficiently # Get indexes of non-NA's goodIdx <- !is.na(dupBetween) # These are the non-NA values from x only # Add a leading NA for later use when we index into this vector goodVals <- c(NA, dupBetween[goodIdx]) # Fill the indices of the output vector with the indices pulled from # these offsets of goodVals. Add 1 to avoid indexing to zero. fillIdx <- cumsum(goodIdx)+1 # The original vector, now with gaps filled dupBetween <- goodVals[fillIdx] # Undo the original sort dupBetween[sortorder] <- dupBetween # Return the vector of which entries are duplicated across groups return(dupBetween) }

Re-computing the levels of all factor columns in a data frame

Problem

You want to re-compute factor levels of all factor columns in a data frame.

Solution

Sometimes after reading in data and cleaning it, you will end up with factor columns that have levels that should no longer be there. For example, d below has one blank row. When it’s read in, the factor columns have a level "", which shouldn’t be part of the data. d <- read.csv(header = TRUE, text=' x,y,value a,one,1 ,,5 b,two,4 c,three,10 ') d #> x y value #> 1 a one 1 #> 2 5 #> 3 b two 4 #> 4 c three 10 str(d) #> 'data.frame': 4 obs. of 3 variables: #> $ x : Factor w/ 4 levels "","a","b","c": 2 1 3 4 #> $ y : Factor w/ 4 levels "","one","three",..: 2 1 4 3 #> $ value: int 1 5 4 10 Even after removing the empty row, the factors still have the blank string "" as a level: # Remove second row d <- d[-2,] d #> x y value #> 1 a one 1 #> 3 b two 4 #> 4 c three 10 str(d) #> 'data.frame': 3 obs. of 3 variables: #> $ x : Factor w/ 4 levels "","a","b","c": 2 3 4 #> $ y : Factor w/ 4 levels "","one","three",..: 2 4 3 #> $ value: int 1 4 10

With droplevels

The simplest way is to use the droplevels() function: d1 <- droplevels(d) str(d1) #> 'data.frame': 3 obs. of 3 variables: #> $ x : Factor w/ 3 levels "a","b","c": 1 2 3 #> $ y : Factor w/ 3 levels "one","three",..: 1 3 2 #> $ value: int 1 4 10

With vapply and lapply

To re-compute the levels for all factor columns, we can use vapply() with is.factor() to find out which of columns are factors, and then use that information with lapply to apply the factor() function to those columns. # Find which columns are factors factor_cols <- vapply(d, is.factor, logical(1)) # Apply the factor() function to those columns, and assign then back into d d[factor_cols] <- lapply(d[factor_cols], factor) str(d) #> 'data.frame': 3 obs. of 3 variables: #> $ x : Factor w/ 3 levels "a","b","c": 1 2 3 #> $ y : Factor w/ 3 levels "one","three",..: 1 3 2 #> $ value: int 1 4 10 See also: For information about re-computing the levels of a factor, see ../Re-computing_the_levels_of_factor.

Converting data between wide and long format

Problem

You want to do convert data from a wide format to a long format. Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.

Solution

There are two sets of methods that are explained below: There are a number of other methods which aren’t covered here, since they are not as easy to use:

Sample data

These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below. olddata_wide <- read.table(header=TRUE, text=' subject sex control cond1 cond2 1 M 7.9 12.3 10.7 2 F 6.3 10.6 11.1 3 F 9.5 13.1 13.8 4 M 11.5 13.4 12.9 ') # Make sure the subject column is a factor olddata_wide$subject <- factor(olddata_wide$subject) olddata_long <- read.table(header=TRUE, text=' subject sex condition measurement 1 M control 7.9 1 M cond1 12.3 1 M cond2 10.7 2 F control 6.3 2 F cond1 10.6 2 F cond2 11.1 3 F control 9.5 3 F cond1 13.1 3 F cond2 13.8 4 M control 11.5 4 M cond1 13.4 4 M cond2 12.9 ') # Make sure the subject column is a factor olddata_long$subject <- factor(olddata_long$subject)

tidyr

From wide to long

Use gather: olddata_wide #> subject sex control cond1 cond2 #> 1 1 M 7.9 12.3 10.7 #> 2 2 F 6.3 10.6 11.1 #> 3 3 F 9.5 13.1 13.8 #> 4 4 M 11.5 13.4 12.9 library(tidyr) # The arguments to gather(): # - data: Data object # - key: Name of new key column (made from names of data columns) # - value: Name of new value column # - ...: Names of source columns that contain values # - factor_key: Treat the new key column as a factor (instead of character vector) data_long <- gather(olddata_wide, condition, measurement, control:cond2, factor_key=TRUE) data_long #> subject sex condition measurement #> 1 1 M control 7.9 #> 2 2 F control 6.3 #> 3 3 F control 9.5 #> 4 4 M control 11.5 #> 5 1 M cond1 12.3 #> 6 2 F cond1 10.6 #> 7 3 F cond1 13.1 #> 8 4 M cond1 13.4 #> 9 1 M cond2 10.7 #> 10 2 F cond2 11.1 #> 11 3 F cond2 13.8 #> 12 4 M cond2 12.9 In this example, the source columns that are gathered are specified with control:cond2. This means to use all the columns, positionally, between control and cond2. Another way of doing it is to name the columns individually, as in: gather(olddata_wide, condition, measurement, control, cond1, cond2) If you need to use gather() programmatically, you may need to use variables containing column names. To do this, you should use the gather_() function instead, which takes strings instead of bare (unquoted) column names. keycol <- "condition" valuecol <- "measurement" gathercols <- c("control", "cond1", "cond2") gather_(olddata_wide, keycol, valuecol, gathercols) Optional: Rename the factor levels of the variable column, and sort. # Rename factor names from "cond1" and "cond2" to "first" and "second" levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first" levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second" # Sort by subject first, then by condition data_long <- data_long[order(data_long$subject, data_long$condition), ] data_long #> subject sex condition measurement #> 1 1 M control 7.9 #> 5 1 M first 12.3 #> 9 1 M second 10.7 #> 2 2 F control 6.3 #> 6 2 F first 10.6 #> 10 2 F second 11.1 #> 3 3 F control 9.5 #> 7 3 F first 13.1 #> 11 3 F second 13.8 #> 4 4 M control 11.5 #> 8 4 M first 13.4 #> 12 4 M second 12.9

From long to wide

Use spread: olddata_long #> subject sex condition measurement #> 1 1 M control 7.9 #> 2 1 M cond1 12.3 #> 3 1 M cond2 10.7 #> 4 2 F control 6.3 #> 5 2 F cond1 10.6 #> 6 2 F cond2 11.1 #> 7 3 F control 9.5 #> 8 3 F cond1 13.1 #> 9 3 F cond2 13.8 #> 10 4 M control 11.5 #> 11 4 M cond1 13.4 #> 12 4 M cond2 12.9 library(tidyr) # The arguments to spread(): # - data: Data object # - key: Name of column containing the new column names # - value: Name of column containing values data_wide <- spread(olddata_long, condition, measurement) data_wide #> subject sex cond1 cond2 control #> 1 1 M 12.3 10.7 7.9 #> 2 2 F 10.6 11.1 6.3 #> 3 3 F 13.1 13.8 9.5 #> 4 4 M 13.4 12.9 11.5 Optional: A few things to make the data look nicer. # Rename cond1 to first, and cond2 to second names(data_wide)[names(data_wide)=="cond1"] <- "first" names(data_wide)[names(data_wide)=="cond2"] <- "second" # Reorder the columns data_wide <- data_wide[, c(1,2,5,3,4)] data_wide #> subject sex control first second #> 1 1 M 7.9 12.3 10.7 #> 2 2 F 6.3 10.6 11.1 #> 3 3 F 9.5 13.1 13.8 #> 4 4 M 11.5 13.4 12.9 The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

reshape2

From wide to long

Use melt: olddata_wide #> subject sex control cond1 cond2 #> 1 1 M 7.9 12.3 10.7 #> 2 2 F 6.3 10.6 11.1 #> 3 3 F 9.5 13.1 13.8 #> 4 4 M 11.5 13.4 12.9 library(reshape2) # Specify id.vars: the variables to keep but not split apart on melt(olddata_wide, id.vars=c("subject", "sex")) #> subject sex variable value #> 1 1 M control 7.9 #> 2 2 F control 6.3 #> 3 3 F control 9.5 #> 4 4 M control 11.5 #> 5 1 M cond1 12.3 #> 6 2 F cond1 10.6 #> 7 3 F cond1 13.1 #> 8 4 M cond1 13.4 #> 9 1 M cond2 10.7 #> 10 2 F cond2 11.1 #> 11 3 F cond2 13.8 #> 12 4 M cond2 12.9 There are options for melt that can make the output a little easier to work with: data_long <- melt(olddata_wide, # ID variables - all the variables to keep but not split apart on id.vars=c("subject", "sex"), # The source columns measure.vars=c("control", "cond1", "cond2" ), # Name of the destination column that will identify the original # column that the measurement came from variable.name="condition", value.name="measurement" ) data_long #> subject sex condition measurement #> 1 1 M control 7.9 #> 2 2 F control 6.3 #> 3 3 F control 9.5 #> 4 4 M control 11.5 #> 5 1 M cond1 12.3 #> 6 2 F cond1 10.6 #> 7 3 F cond1 13.1 #> 8 4 M cond1 13.4 #> 9 1 M cond2 10.7 #> 10 2 F cond2 11.1 #> 11 3 F cond2 13.8 #> 12 4 M cond2 12.9 If you leave out the measure.vars, melt will automatically use all the other variables as the id.vars. The reverse is true if you leave out id.vars. If you don’t specify variable.name, it will name that column "variable", and if you leave out value.name, it will name that column "measurement". Optional: Rename the factor levels of the variable column. # Rename factor names from "cond1" and "cond2" to "first" and "second" levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first" levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second" # Sort by subject first, then by condition data_long <- data_long[ order(data_long$subject, data_long$condition), ] data_long #> subject sex condition measurement #> 1 1 M control 7.9 #> 5 1 M first 12.3 #> 9 1 M second 10.7 #> 2 2 F control 6.3 #> 6 2 F first 10.6 #> 10 2 F second 11.1 #> 3 3 F control 9.5 #> 7 3 F first 13.1 #> 11 3 F second 13.8 #> 4 4 M control 11.5 #> 8 4 M first 13.4 #> 12 4 M second 12.9

From long to wide

The following code uses dcast to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast instead. olddata_long #> subject sex condition measurement #> 1 1 M control 7.9 #> 2 1 M cond1 12.3 #> 3 1 M cond2 10.7 #> 4 2 F control 6.3 #> 5 2 F cond1 10.6 #> 6 2 F cond2 11.1 #> 7 3 F control 9.5 #> 8 3 F cond1 13.1 #> 9 3 F cond2 13.8 #> 10 4 M control 11.5 #> 11 4 M cond1 13.4 #> 12 4 M cond2 12.9 # From the source: # "subject" and "sex" are columns we want to keep the same # "condition" is the column that contains the names of the new column to put things in # "measurement" holds the measurements library(reshape2) data_wide <- dcast(olddata_long, subject + sex ~ condition, value.var="measurement") data_wide #> subject sex cond1 cond2 control #> 1 1 M 12.3 10.7 7.9 #> 2 2 F 10.6 11.1 6.3 #> 3 3 F 13.1 13.8 9.5 #> 4 4 M 13.4 12.9 11.5 Optional: A few things to make the data look nicer. # Rename cond1 to first, and cond2 to second names(data_wide)[names(data_wide)=="cond1"] <- "first" names(data_wide)[names(data_wide)=="cond2"] <- "second" # Reorder the columns data_wide <- data_wide[, c(1,2,5,3,4)] data_wide #> subject sex control first second #> 1 1 M 7.9 12.3 10.7 #> 2 2 F 6.3 10.6 11.1 #> 3 3 F 9.5 13.1 13.8 #> 4 4 M 11.5 13.4 12.9 The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

Summarizing data

Problem

You want to do summarize your data (with mean, standard deviation, etc.), broken down by group.

Solution

There are three ways described here to group data based on some specified variables, and apply a summary function (like mean, standard deviation, etc.) to each group. Suppose you have this data and want to find the N, mean of change, standard deviation, and standard error of the mean for each group, where the groups are specified by each combination of sex and condition: F-placebo, F-aspirin, M-placebo, and M-aspirin. data <- read.table(header=TRUE, text=' subject sex condition before after change 1 F placebo 10.1 6.9 -3.2 2 F placebo 6.3 4.2 -2.1 3 M aspirin 12.4 6.3 -6.1 4 F placebo 8.1 6.1 -2.0 5 M aspirin 15.2 9.9 -5.3 6 F aspirin 10.9 7.0 -3.9 7 F aspirin 11.6 8.5 -3.1 8 M aspirin 9.5 3.0 -6.5 9 F placebo 11.5 9.0 -2.5 10 M placebo 11.9 11.0 -0.9 11 F aspirin 11.4 8.0 -3.4 12 M aspirin 10.0 4.4 -5.6 13 M aspirin 12.5 5.4 -7.1 14 M placebo 10.6 10.6 0.0 15 M aspirin 9.1 4.3 -4.8 16 F placebo 12.1 10.2 -1.9 17 F placebo 11.0 8.8 -2.2 18 F placebo 11.9 10.2 -1.7 19 M aspirin 9.1 3.6 -5.5 20 M placebo 13.5 12.4 -1.1 21 M aspirin 12.0 7.5 -4.5 22 F placebo 9.1 7.6 -1.5 23 M placebo 9.9 8.0 -1.9 24 F placebo 7.6 5.2 -2.4 25 F placebo 11.8 9.7 -2.1 26 F placebo 11.8 10.7 -1.1 27 F aspirin 10.1 7.9 -2.2 28 M aspirin 11.6 8.3 -3.3 29 F aspirin 11.3 6.8 -4.5 30 F placebo 10.3 8.3 -2.0 ')

Using ddply

library(plyr) # Run the functions length, mean, and sd on the value of "change" for each group, # broken down by sex + condition cdata <- ddply(data, c("sex", "condition"), summarise, N = length(change), mean = mean(change), sd = sd(change), se = sd / sqrt(N) ) cdata #> sex condition N mean sd se #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 #> 4 M placebo 4 -0.975000 0.7804913 0.3902456

Handling missing data

If there are NA’s in the data, you need to pass the flag na.rm=TRUE to each of the functions. length() doesn’t take na.rm as an option, so one way to work around it is to use sum(!is.na(...)) to count how many non-NA’s there are. # Put some NA's in the data dataNA <- data dataNA$change[11:14] <- NA cdata <- ddply(dataNA, c("sex", "condition"), summarise, N = sum(!is.na(change)), mean = mean(change, na.rm=TRUE), sd = sd(change, na.rm=TRUE), se = sd / sqrt(N) ) cdata #> sex condition N mean sd se #> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 #> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 #> 4 M placebo 3 -1.300000 0.5291503 0.3055050

A function for mean, count, standard deviation, standard error of the mean, and confidence interval

Instead of manually specifying all the values you want and then calculating the standard error, as shown above, this function will handle all of those details. It will do all the things described here: To use, put this function in your code and call it as demonstrated below. ## Summarizes data. ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%). ## data: a data frame. ## measurevar: the name of a column that contains the variable to be summariezed ## groupvars: a vector containing names of columns that contain grouping variables ## na.rm: a boolean that indicates whether to ignore NA's ## conf.interval: the percent range of the confidence interval (default is 95%) summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95, .drop=TRUE) { library(plyr) # New version of length which can handle NA's: if na.rm==T, don't count them length2 <- function (x, na.rm=FALSE) { if (na.rm) sum(!is.na(x)) else length(x) } # This does the summary. For each group's data frame, return a vector with # N, mean, and sd datac <- ddply(data, groupvars, .drop=.drop, .fun = function(xx, col) { c(N = length2(xx[[col]], na.rm=na.rm), mean = mean (xx[[col]], na.rm=na.rm), sd = sd (xx[[col]], na.rm=na.rm) ) }, measurevar ) # Rename the "mean" column datac <- rename(datac, c("mean" = measurevar)) datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean # Confidence interval multiplier for standard error # Calculate t-statistic for confidence interval: # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1 ciMult <- qt(conf.interval/2 + .5, datac$N-1) datac$ci <- datac$se * ciMult return(datac) } Example usage (with 95% confidence interval). Instead of doing all the steps manually, as done previously, the summarySE function does it all in one step: summarySE(data, measurevar="change", groupvars=c("sex", "condition")) #> sex condition N change sd se ci #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767 #> 4 M placebo 4 -0.975000 0.7804913 0.3902456 1.2419358 # With a data set with NA's, use na.rm=TRUE summarySE(dataNA, measurevar="change", groupvars=c("sex", "condition"), na.rm=TRUE) #> sex condition N change sd se ci #> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 1.5879046 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 0.9872588 #> 4 M placebo 3 -1.300000 0.5291503 0.3055050 1.3144821

Filling empty combinations with zeros

Sometimes there will be empty combinations of factors in the summary data frame – that is, combinations of factors that are possible, but don’t actually occur in the original data frame. It is often useful to automatically fill in those combinations in the summary data frame with NA’s. To do this, set .drop=FALSE in the call to ddply or summarySE. Example usage: # First remove some all Male+Placebo entries from the data dataSub <- subset(data, !(sex=="M" & condition=="placebo")) # If we summarize the data, there will be a missing row for Male+Placebo, # since there were no cases with this combination. summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition")) #> sex condition N change sd se ci #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767 # Set .drop=FALSE to NOT drop those combinations summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"), .drop=FALSE) #> Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced #> sex condition N change sd se ci #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767 #> 4 M placebo 0 NaN NA NA NA

Using summaryBy

To collapse the data using the summarizeBy() function: library(doBy) # Run the functions length, mean, and sd on the value of "change" for each group, # broken down by sex + condition cdata <- summaryBy(change ~ sex + condition, data=data, FUN=c(length,mean,sd)) cdata #> sex condition change.length change.mean change.sd #> 1 F aspirin 5 -3.420000 0.8642916 #> 2 F placebo 12 -2.058333 0.5247655 #> 3 M aspirin 9 -5.411111 1.1307569 #> 4 M placebo 4 -0.975000 0.7804913 # Rename column change.length to just N names(cdata)[names(cdata)=="change.length"] <- "N" # Calculate standard error of the mean cdata$change.se <- cdata$change.sd / sqrt(cdata$N) cdata #> sex condition N change.mean change.sd change.se #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 #> 4 M placebo 4 -0.975000 0.7804913 0.3902456 Note that if you have any within-subjects variables, these standard error values may not be useful for comparing groups.

Handling missing data

If there are NA’s in the data, you need to pass the flag na.rm=TRUE to the functions. Normally you could pass it to summaryBy() and it would get passed to each of the functions called, but length() does not recognize it and so it won’t work. One way around it is to define a new length function that handles the NA’s. # New version of length which can handle NA's: if na.rm==T, don't count them length2 <- function (x, na.rm=FALSE) { if (na.rm) sum(!is.na(x)) else length(x) } # Put some NA's in the data dataNA <- data dataNA$change[11:14] <- NA cdataNA <- summaryBy(change ~ sex + condition, data=dataNA, FUN=c(length2, mean, sd), na.rm=TRUE) cdataNA #> sex condition change.length2 change.mean change.sd #> 1 F aspirin 4 -3.425000 0.9979145 #> 2 F placebo 12 -2.058333 0.5247655 #> 3 M aspirin 7 -5.142857 1.0674848 #> 4 M placebo 3 -1.300000 0.5291503 # Now, do the same as before

A function for mean, count, standard deviation, standard error of the mean, and confidence interval

Instead of manually specifying all the values you want and then calculating the standard error, as shown above, this function will handle all of those details. It will do all the things described here: To use, put this function in your code and call it as demonstrated below. ## Summarizes data. ## Gives count, mean, standard deviation, standard error of the mean, and confidence ## interval (default 95%). ## data: a data frame. ## measurevar: the name of a column that contains the variable to be summariezed ## groupvars: a vector containing names of columns that contain grouping variables ## na.rm: a boolean that indicates whether to ignore NA's ## conf.interval: the percent range of the confidence interval (default is 95%) summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) { library(doBy) # New version of length which can handle NA's: if na.rm==T, don't count them length2 <- function (x, na.rm=FALSE) { if (na.rm) sum(!is.na(x)) else length(x) } # Collapse the data formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ ")) datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm) # Rename columns names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- measurevar names(datac)[ names(datac) == paste(measurevar, ".sd", sep="") ] <- "sd" names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N" datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean # Confidence interval multiplier for standard error # Calculate t-statistic for confidence interval: # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1 ciMult <- qt(conf.interval/2 + .5, datac$N-1) datac$ci <- datac$se * ciMult return(datac) } Example usage (with 95% confidence interval). Instead of doing all the steps manually, as done previously, the summarySE function does it all in one step: summarySE(data, measurevar="change", groupvars=c("sex","condition")) #> sex condition N change sd se ci #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767 #> 4 M placebo 4 -0.975000 0.7804913 0.3902456 1.2419358 # With a data set with NA's, use na.rm=TRUE summarySE(dataNA, measurevar="change", groupvars=c("sex","condition"), na.rm=TRUE) #> sex condition N change sd se ci #> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 1.5879046 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 0.9872588 #> 4 M placebo 3 -1.300000 0.5291503 0.3055050 1.3144821

Filling empty combinations with zeros

Sometimes there will be empty combinations of factors in the summary data frame – that is, combinations of factors that are possible, but don’t actually occur in the original data frame. It is often useful to automatically fill in those combinations in the summary data frame with zeros. This function will fill in those missing combinations with zeros: fillMissingCombs <- function(df, factors, measures) { # Make a list of the combinations of factor levels levelList <- list() for (f in factors) { levelList[[f]] <- levels(df[,f]) } fullFactors <- expand.grid(levelList) dfFull <- merge(fullFactors, df, all.x=TRUE) # Wherever there is an NA in the measure vars, replace with 0 for (m in measures) { dfFull[is.na(dfFull[,m]), m] <- 0 } return(dfFull) } Example usage: # First remove some all Male+Placebo entries from the data dataSub <- subset(data, !(sex=="M" & condition=="placebo")) # If we summarize the data, there will be a missing row for Male+Placebo, # since there were no cases with this combination. cdataSub <- summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition")) cdataSub #> sex condition N change sd se ci #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767 # This will fill in the missing combinations with zeros fillMissingCombs(cdataSub, factors=c("sex","condition"), measures=c("N","change","sd","se","ci")) #> sex condition N change sd se ci #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598 #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201 #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767 #> 4 M placebo 0 0.000000 0.0000000 0.0000000 0.0000000

Using aggregate

The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package. # Get a count of number of subjects in each category (sex*condition) cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length) cdata #> sex condition subject #> 1 F aspirin 5 #> 2 M aspirin 9 #> 3 F placebo 12 #> 4 M placebo 4 # Rename "subject" column to "N" names(cdata)[names(cdata)=="subject"] <- "N" cdata #> sex condition N #> 1 F aspirin 5 #> 2 M aspirin 9 #> 3 F placebo 12 #> 4 M placebo 4 # Sort by sex first cdata <- cdata[order(cdata$sex),] cdata #> sex condition N #> 1 F aspirin 5 #> 3 F placebo 12 #> 2 M aspirin 9 #> 4 M placebo 4 # We also keep the __before__ and __after__ columns: # Get the average effect size by sex and condition cdata.means <- aggregate(data[c("before","after","change")], by = data[c("sex","condition")], FUN=mean) cdata.means #> sex condition before after change #> 1 F aspirin 11.06000 7.640000 -3.420000 #> 2 M aspirin 11.26667 5.855556 -5.411111 #> 3 F placebo 10.13333 8.075000 -2.058333 #> 4 M placebo 11.47500 10.500000 -0.975000 # Merge the data frames cdata <- merge(cdata, cdata.means) cdata #> sex condition N before after change #> 1 F aspirin 5 11.06000 7.640000 -3.420000 #> 2 F placebo 12 10.13333 8.075000 -2.058333 #> 3 M aspirin 9 11.26667 5.855556 -5.411111 #> 4 M placebo 4 11.47500 10.500000 -0.975000 # Get the sample (n-1) standard deviation for "change" cdata.sd <- aggregate(data["change"], by = data[c("sex","condition")], FUN=sd) # Rename the column to change.sd names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd" cdata.sd #> sex condition change.sd #> 1 F aspirin 0.8642916 #> 2 M aspirin 1.1307569 #> 3 F placebo 0.5247655 #> 4 M placebo 0.7804913 # Merge cdata <- merge(cdata, cdata.sd) cdata #> sex condition N before after change change.sd #> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 #> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 #> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 #> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 # Calculate standard error of the mean cdata$change.se <- cdata$change.sd / sqrt(cdata$N) cdata #> sex condition N before after change change.sd change.se #> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 0.3865230 #> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 0.1514867 #> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 0.3769190 #> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 0.3902456 If you have NA’s in your data and wish to skip them, use na.rm=TRUE: cdata.means <- aggregate(data[c("before","after","change")], by = data[c("sex","condition")], FUN=mean, na.rm=TRUE) cdata.means #> sex condition before after change #> 1 F aspirin 11.06000 7.640000 -3.420000 #> 2 M aspirin 11.26667 5.855556 -5.411111 #> 3 F placebo 10.13333 8.075000 -2.058333 #> 4 M placebo 11.47500 10.500000 -0.975000

Converting between data frames and contingency tables

Problem

You want to do convert between a data frame of cases, a data frame of counts of each type of case, and a contingency table.

Solution

These three data structures represent the same information, but in different formats: # Each row represents one case cases <- data.frame( Sex=c("M", "M", "F", "F", "F"), Color=c("brown", "blue", "brown", "brown", "brown") ) cases #> Sex Color #> 1 M brown #> 2 M blue #> 3 F brown #> 4 F brown #> 5 F brown # A contingency table ctable <- table(cases) ctable #> Color #> Sex blue brown #> F 0 3 #> M 1 1 # A table with counts of each combination counts <- data.frame( Sex=c("F", "M", "F", "M"), Color=c("blue", "blue", "brown", "brown"), Freq=c(0, 1, 3, 1) ) counts #> Sex Color Freq #> 1 F blue 0 #> 2 M blue 1 #> 3 F brown 3 #> 4 M brown 1

Cases to contingency table

To convert from cases to contingency table (this is already shown above): # Cases to Table ctable <- table(cases) ctable #> Color #> Sex blue brown #> F 0 3 #> M 1 1 # If you call table using two vectors, it will not add names (Sex and Color) to # the dimensions. table(cases$Sex, cases$Color) #> #> blue brown #> F 0 3 #> M 1 1 # The dimension names can be specified manually with `dnn`, or by using a subset # of the data frame that contains only the desired columns. table(cases$Sex, cases$Color, dnn=c("Sex","Color")) #> Color #> Sex blue brown #> F 0 3 #> M 1 1 table(cases[,c("Sex","Color")]) #> Color #> Sex blue brown #> F 0 3 #> M 1 1

Cases to counts

It can also be represented as a data frame of counts of each combination. Note that it’s converted here and stored in countdf: # Cases to Counts countdf <- as.data.frame(table(cases)) countdf #> Sex Color Freq #> 1 F blue 0 #> 2 M blue 1 #> 3 F brown 3 #> 4 M brown 1

Contingency table to cases

countsToCases(as.data.frame(ctable)) #> Sex Color #> 2 M blue #> 3 F brown #> 3.1 F brown #> 3.2 F brown #> 4 M brown Note that the expand.dft function is defined below.

Contingency table to counts

as.data.frame(ctable) #> Sex Color Freq #> 1 F blue 0 #> 2 M blue 1 #> 3 F brown 3 #> 4 M brown 1

Counts to cases

countsToCases(countdf) #> Sex Color #> 2 M blue #> 3 F brown #> 3.1 F brown #> 3.2 F brown #> 4 M brown Note that the countsToCases function is defined below.

Counts to contingency table

xtabs(Freq ~ Sex+Color, data=countdf) #> Color #> Sex blue brown #> F 0 3 #> M 1 1

countsToCases() function

This function is used in the examples above: # Convert from data frame of counts to data frame of cases. # `countcol` is the name of the column containing the counts countsToCases <- function(x, countcol = "Freq") { # Get the row indices to pull from x idx <- rep.int(seq_len(nrow(x)), x[[countcol]]) # Drop count column x[[countcol]] <- NULL # Get the rows from x x[idx, ] }

Calculating a moving average

Problem

You want to calculate a moving average.

Solution

Suppose your data is a noisy sine wave with some missing values: set.seed(993) x <- 1:300 y <- sin(x/20) + rnorm(300,sd=.1) y[251:255] <- NA The filter() function can be used to calculate a moving average. # Plot the unsmoothed data (gray) plot(x, y, type="l", col=grey(.5)) # Draw gridlines grid() # Smoothed with lag: # average of current sample and 19 previous samples (red) f20 <- rep(1/20, 20) f20 #> [1] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 #> [18] 0.05 0.05 0.05 y_lag <- filter(y, f20, sides=1) lines(x, y_lag, col="red") # Smoothed symmetrically: # average of current sample, 10 future samples, and 10 past samples (blue) f21 <- rep(1/21,21) f21 #> [1] 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 #> [8] 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 #> [15] 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 y_sym <- filter(y, f21, sides=2) lines(x, y_sym, col="blue") plot of chunk unnamed-chunk-2 filter() will leave holes wherever it encounters missing values, as shown in the graph above. A different way to handle missing data is to simply ignore it, and not include it in the average. The function defined here will do that. # x: the vector # n: the number of samples # centered: if FALSE, then average current sample and previous (n-1) samples # if TRUE, then average symmetrically in past and future. (If n is even, use one more sample from future.) movingAverage <- function(x, n=1, centered=FALSE) { if (centered) { before <- floor ((n-1)/2) after <- ceiling((n-1)/2) } else { before <- n-1 after <- 0 } # Track the sum and count of number of non-NA items s <- rep(0, length(x)) count <- rep(0, length(x)) # Add the centered data new <- x # Add to count list wherever there isn't a count <- count + !is.na(new) # Now replace NA_s with 0_s and add to total new[is.na(new)] <- 0 s <- s + new # Add the data from before i <- 1 while (i <= before) { # This is the vector with offset values to add new <- c(rep(NA, i), x[1:(length(x)-i)]) count <- count + !is.na(new) new[is.na(new)] <- 0 s <- s + new i <- i+1 } # Add the data from after i <- 1 while (i <= after) { # This is the vector with offset values to add new <- c(x[(i+1):length(x)], rep(NA, i)) count <- count + !is.na(new) new[is.na(new)] <- 0 s <- s + new i <- i+1 } # return sum divided by count s/count } # Make same plots from before, with thicker lines plot(x, y, type="l", col=grey(.5)) grid() y_lag <- filter(y, rep(1/20, 20), sides=1) lines(x, y_lag, col="red", lwd=4) # Lagged average in red y_sym <- filter(y, rep(1/21,21), sides=2) lines(x, y_sym, col="blue", lwd=4) # Symmetrical average in blue # Calculate lagged moving average with new method and overplot with green y_lag_na.rm <- movingAverage(y, 20) lines(x, y_lag_na.rm, col="green", lwd=2) # Calculate symmetrical moving average with new method and overplot with green y_sym_na.rm <- movingAverage(y, 21, TRUE) lines(x, y_sym_na.rm, col="green", lwd=2) plot of chunk unnamed-chunk-3

Averaging a sequence in blocks

Problem

You want to chop a sequence of data into blocks of a given length, and find the average of each block. This is one way of smoothing data.

Solution

Suppose you have a numeric vector and want to find the average of the first four numbers, the second four, and so on. # Generate a vector with 22 random numbers from 0-99 set.seed(123) x <- floor(runif(22)*100) x #> [1] 28 78 40 88 94 4 52 89 55 45 95 45 67 57 10 89 24 4 32 95 88 69 # Round up the length of vector the to the nearest 4 newlength <- ceiling(length(x)/4)*4 newlength #> [1] 24 # Pad x with NA's up to the new length x[newlength] <- NA x #> [1] 28 78 40 88 94 4 52 89 55 45 95 45 67 57 10 89 24 4 32 95 88 69 NA NA # Convert to a matrix with 4 rows x <- matrix(x, nrow=4) x #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 28 94 55 67 24 88 #> [2,] 78 4 45 57 4 69 #> [3,] 40 52 95 10 32 NA #> [4,] 88 89 45 89 95 NA # Take the means of the columns, and ignore any NA's colMeans(x, na.rm=TRUE) #> [1] 58.50 59.75 60.00 55.75 38.75 78.50

Finding sequences of identical values

Problem

You want to find sequences of identical values in a vector or factor.

Solution

It is possible to search for sequences of identical values by simply iterating over a vector, but this is very slow in R. A much faster way to find sequences is to use the rle() function. # Example data v <- c("A","A","A", "B","B","B","B", NA,NA, "C","C", "B", "C","C","C") v #> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C" vr <- rle(v) vr #> Run Length Encoding #> lengths: int [1:7] 3 4 1 1 2 1 3 #> values : chr [1:7] "A" "B" NA NA "C" "B" "C" The RLE coded data can be converted back to a vector with inverse.rle(). inverse.rle(vr) #> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C" One issue that might be problematic is that each NA is treated as a run of length 1, even if the NA’s are next to each other. It is possible to work around this by replacing the NA’s with some special designated value. For numeric vectors, Inf or some other number can be used; for character vectors, any string may be used. Of course, the special value must not appear otherwise in the vector. w <- v w[is.na(w)] <- "ZZZ" w #> [1] "A" "A" "A" "B" "B" "B" "B" "ZZZ" "ZZZ" "C" "C" "B" "C" "C" #> [15] "C" wr <- rle(w) wr #> Run Length Encoding #> lengths: int [1:6] 3 4 2 2 1 3 #> values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C" # Replace the ZZZ's with NA in the RLE-coded data wr$values[ wr$values=="ZZZ" ] <- NA wr #> Run Length Encoding #> lengths: int [1:6] 3 4 2 2 1 3 #> values : chr [1:6] "A" "B" NA "C" "B" "C" w2 <- inverse.rle(wr) w2 #> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"

Working with factors

Even though factors are basically just integer vectors with some information about levels attached, the rle() function doesn’t work with factors. The solution is to manually convert the factor to an integer vector or a character vector. Using an integer vector is fast and memory-efficient, which may matter for large data sets, but it is difficult to interpret. Using a character vector is slower and requires more memory, but can be much easier to interpret. # Suppose this is the factor we have to work with f <- factor(v) f #> [1] A A A B B B B <NA> <NA> C C B C C C #> Levels: A B C # Store the levels in the factor. # This isn't strictly necessary, but it is useful for preserving order of levels f_levels <- levels(f) f_levels #> [1] "A" "B" "C" fc <- as.character(f) fc[ is.na(fc) ] <- "ZZZ" fc #> [1] "A" "A" "A" "B" "B" "B" "B" "ZZZ" "ZZZ" "C" "C" "B" "C" "C" #> [15] "C" fr <- rle(fc) fr #> Run Length Encoding #> lengths: int [1:6] 3 4 2 2 1 3 #> values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C" # Replace the ZZZ's with NA in the RLE-coded data fr$values[ fr$values=="ZZZ" ] <- NA fr #> Run Length Encoding #> lengths: int [1:6] 3 4 2 2 1 3 #> values : chr [1:6] "A" "B" NA "C" "B" "C" # Invert RLE coding and convert back to a factor f2 <- inverse.rle(fr) f2 <- factor(f, levels=f_levels) f2 #> [1] A A A B B B B <NA> <NA> C C B C C C #> Levels: A B C

Filling in NAs with last non-NA value

Problem

You want to replace NA’s in a vector or factor with the last non-NA value.

Solution

This code shows how to fill gaps in a vector. If you need to do this repeatedly, see the function below. The function also can fill in leading NA’s with the first good value and handle factors properly. # Sample data x <- c(NA,NA, "A","A", "B","B","B", NA,NA, "C", NA,NA,NA, "A","A","B", NA,NA) goodIdx <- !is.na(x) goodIdx #> [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE #> [15] TRUE TRUE FALSE FALSE # These are the non-NA values from x only # Add a leading NA for later use when we index into this vector goodVals <- c(NA, x[goodIdx]) goodVals #> [1] NA "A" "A" "B" "B" "B" "C" "A" "A" "B" # Fill the indices of the output vector with the indices pulled from # these offsets of goodVals. Add 1 to avoid indexing to zero. fillIdx <- cumsum(goodIdx)+1 fillIdx #> [1] 1 1 2 3 4 5 6 6 6 7 7 7 7 8 9 10 10 10 # The original vector with gaps filled goodVals[fillIdx] #> [1] NA NA "A" "A" "B" "B" "B" "B" "B" "C" "C" "C" "C" "A" "A" "B" "B" "B"

A function for filling gaps

This function does the same as the code above. It can also fill leading NA’s with the first good value, and handle factors properly. fillNAgaps <- function(x, firstBack=FALSE) { ## NA's in a vector or factor are replaced with last non-NA values ## If firstBack is TRUE, it will fill in leading NA's with the first ## non-NA value. If FALSE, it will not change leading NA's. # If it's a factor, store the level labels and convert to integer lvls <- NULL if (is.factor(x)) { lvls <- levels(x) x <- as.integer(x) } goodIdx <- !is.na(x) # These are the non-NA values from x only # Add a leading NA or take the first good value, depending on firstBack if (firstBack) goodVals <- c(x[goodIdx][1], x[goodIdx]) else goodVals <- c(NA, x[goodIdx]) # Fill the indices of the output vector with the indices pulled from # these offsets of goodVals. Add 1 to avoid indexing to zero. fillIdx <- cumsum(goodIdx)+1 x <- goodVals[fillIdx] # If it was originally a factor, convert it back if (!is.null(lvls)) { x <- factor(x, levels=seq_along(lvls), labels=lvls) } x } # Sample data x <- c(NA,NA, "A","A", "B","B","B", NA,NA, "C", NA,NA,NA, "A","A","B", NA,NA) x #> [1] NA NA "A" "A" "B" "B" "B" NA NA "C" NA NA NA "A" "A" "B" NA NA fillNAgaps(x) #> [1] NA NA "A" "A" "B" "B" "B" "B" "B" "C" "C" "C" "C" "A" "A" "B" "B" "B" # Fill the leading NA's with the first good value fillNAgaps(x, firstBack=TRUE) #> [1] "A" "A" "A" "A" "B" "B" "B" "B" "B" "C" "C" "C" "C" "A" "A" "B" "B" "B" # It also works on factors y <- factor(x) y #> [1] <NA> <NA> A A B B B <NA> <NA> C <NA> <NA> <NA> A A B <NA> #> [18] <NA> #> Levels: A B C fillNAgaps(y) #> [1] <NA> <NA> A A B B B B B C C C C A A B B #> [18] B #> Levels: A B C

Notes

This is adapted from na.locf() in the zoo library.

Regression and correlation

Problem

You want to perform linear regressions and/or correlations.

Solution

Some sample data to work with: # Make some data # X increases (noisily) # Z increases slowly # Y is constructed so it is inversely related to xvar and positively related to xvar*zvar set.seed(955) xvar <- 1:20 + rnorm(20,sd=3) zvar <- 1:20/4 + rnorm(20,sd=2) yvar <- -2*xvar + xvar*zvar/5 + 3 + rnorm(20,sd=4) # Make a data frame with the variables dat <- data.frame(x=xvar, y=yvar, z=zvar) # Show first few rows head(dat) #> x y z #> 1 -4.252354 4.5857688 1.89877152 #> 2 1.702318 -4.9027824 -0.82937359 #> 3 4.323054 -4.3076433 -1.31283495 #> 4 1.780628 0.2050367 -0.28479448 #> 5 11.537348 -29.7670502 -1.27303976 #> 6 6.672130 -10.1458220 -0.09459239

Correlation

# Correlation coefficient cor(dat$x, dat$y) #> [1] -0.7695378

Correlation matrices (for multiple variables)

It is also possible to run correlations between many pairs of variables, using a matrix or data frame. # A correlation matrix of the variables cor(dat) #> x y z #> x 1.0000000 -0.769537849 0.491698938 #> y -0.7695378 1.000000000 0.004172295 #> z 0.4916989 0.004172295 1.000000000 # Print with only two decimal places round(cor(dat), 2) #> x y z #> x 1.00 -0.77 0.49 #> y -0.77 1.00 0.00 #> z 0.49 0.00 1.00

Linear regression

Linear regressions, where dat$x is the predictor, and dat$y is the outcome. This can be done using two columns from a data frame, or with numeric vectors directly. # These two commands will have the same outcome: fit <- lm(y ~ x, data=dat) # Using the columns x and y from the data frame fit <- lm(dat$y ~ dat$x) # Using the vectors dat$x and dat$y fit #> #> Call: #> lm(formula = dat$y ~ dat$x) #> #> Coefficients: #> (Intercept) dat$x #> -0.2278 -1.1829 # This means that the predicted y = -0.2278 - 1.1829*x # Get more detailed information: summary(fit) #> #> Call: #> lm(formula = dat$y ~ dat$x) #> #> Residuals: #> Min 1Q Median 3Q Max #> -15.8922 -2.5114 0.2866 4.4646 9.3285 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.2278 2.6323 -0.087 0.932 #> dat$x -1.1829 0.2314 -5.113 7.28e-05 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 6.506 on 18 degrees of freedom #> Multiple R-squared: 0.5922, Adjusted R-squared: 0.5695 #> F-statistic: 26.14 on 1 and 18 DF, p-value: 7.282e-05

Linear regression with multiple predictors

Linear regression with y as the outcome, and x and z as predictors. Note that the formula specified below does not test for interactions between x and z. # These have the same result fit2 <- lm(y ~ x + z, data=dat) # Using the columns x, y, and z from the data frame fit2 <- lm(dat$y ~ dat$x + dat$z) # Using the vectors x, y, z fit2 #> #> Call: #> lm(formula = dat$y ~ dat$x + dat$z) #> #> Coefficients: #> (Intercept) dat$x dat$z #> -1.382 -1.564 1.858 summary(fit2) #> #> Call: #> lm(formula = dat$y ~ dat$x + dat$z) #> #> Residuals: #> Min 1Q Median 3Q Max #> -7.974 -3.187 -1.205 3.847 7.524 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -1.3816 1.9878 -0.695 0.49644 #> dat$x -1.5642 0.1984 -7.883 4.46e-07 *** #> dat$z 1.8578 0.4753 3.908 0.00113 ** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 4.859 on 17 degrees of freedom #> Multiple R-squared: 0.7852, Adjusted R-squared: 0.7599 #> F-statistic: 31.07 on 2 and 17 DF, p-value: 2.1e-06

Interactions

The topic of how to properly do multiple regression and test for interactions can be quite complex and is not covered here. Here we just fit a model with x, z, and the interaction between the two. To model interactions between x and z, a x:z term must be added. Alternatively, the formula x*z expands to x+z+x:z. # These are equivalent; the x*z expands to x + z + x:z fit3 <- lm(y ~ x * z, data=dat) fit3 <- lm(y ~ x + z + x:z, data=dat) fit3 #> #> Call: #> lm(formula = y ~ x + z + x:z, data = dat) #> #> Coefficients: #> (Intercept) x z x:z #> 2.2820 -2.1311 -0.1068 0.2081 summary(fit3) #> #> Call: #> lm(formula = y ~ x + z + x:z, data = dat) #> #> Residuals: #> Min 1Q Median 3Q Max #> -5.3045 -3.5998 0.3926 2.1376 8.3957 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 2.28204 2.20064 1.037 0.3152 #> x -2.13110 0.27406 -7.776 8e-07 *** #> z -0.10682 0.84820 -0.126 0.9013 #> x:z 0.20814 0.07874 2.643 0.0177 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 4.178 on 16 degrees of freedom #> Multiple R-squared: 0.8505, Adjusted R-squared: 0.8225 #> F-statistic: 30.34 on 3 and 16 DF, p-value: 7.759e-07

t-test

Problem

You want to test whether two samples are drawn from populations with different means, or test whether one sample is drawn from a population with a mean different from some theoretical mean.

Solution

Sample data

We will use the built-in sleep data set. sleep #> extra group ID #> 1 0.7 1 1 #> 2 -1.6 1 2 #> 3 -0.2 1 3 #> 4 -1.2 1 4 #> 5 -0.1 1 5 #> 6 3.4 1 6 #> 7 3.7 1 7 #> 8 0.8 1 8 #> 9 0.0 1 9 #> 10 2.0 1 10 #> 11 1.9 2 1 #> 12 0.8 2 2 #> 13 1.1 2 3 #> 14 0.1 2 4 #> 15 -0.1 2 5 #> 16 4.4 2 6 #> 17 5.5 2 7 #> 18 1.6 2 8 #> 19 4.6 2 9 #> 20 3.4 2 10 We’ll also make a wide version of the sleep data; below we’ll see how to work with data in both long and wide formats. sleep_wide <- data.frame( ID=1:10, group1=sleep$extra[1:10], group2=sleep$extra[11:20] ) sleep_wide #> ID group1 group2 #> 1 1 0.7 1.9 #> 2 2 -1.6 0.8 #> 3 3 -0.2 1.1 #> 4 4 -1.2 0.1 #> 5 5 -0.1 -0.1 #> 6 6 3.4 4.4 #> 7 7 3.7 5.5 #> 8 8 0.8 1.6 #> 9 9 0.0 4.6 #> 10 10 2.0 3.4

Comparing two groups: independent two-sample t-test

Suppose the two groups are independently sampled; we’ll ignore the ID variable for the purposes here. The t.test function can operate on long-format data like sleep, where one column (extra) records the measurement, and the other column (group) specifies the grouping; or it can operate on two separate vectors. # Welch t-test t.test(extra ~ group, sleep) #> #> Welch Two Sample t-test #> #> data: extra by group #> t = -1.8608, df = 17.776, p-value = 0.07939 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> -3.3654832 0.2054832 #> sample estimates: #> mean in group 1 mean in group 2 #> 0.75 2.33 # Same for wide data (two separate vectors) # t.test(sleep_wide$group1, sleep_wide$group2) By default, t.test does not assume equal variances; instead of Student’s t-test, it uses the Welch t-test by default. Note that in the Welch t-test, df=17.776, because of the adjustment for unequal variances. To use Student’s t-test, set var.equal=TRUE. # Student t-test t.test(extra ~ group, sleep, var.equal=TRUE) #> #> Two Sample t-test #> #> data: extra by group #> t = -1.8608, df = 18, p-value = 0.07919 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> -3.363874 0.203874 #> sample estimates: #> mean in group 1 mean in group 2 #> 0.75 2.33 # Same for wide data (two separate vectors) # t.test(sleep_wide$group1, sleep_wide$group2, var.equal=TRUE)

Paired-sample t-test

You can also compare paired data, using a paired-sample t-test. You might have observations before and after a treatment, or of two matched subjects with different treatments. Again, the t-test function can be used on a data frame with a grouping variable, or on two vectors. It relies the relative position to determine the pairing. If you are using long-format data with a grouping variable, the first row with group=1 is paired with the first row with group=2. It is important to make sure that the data is sorted and there are not missing observations; otherwise the pairing can be thrown off. In this case, we can sort by the group and ID variables to ensure that the order is the same. For more on sorting see Sorting. # Sort by group then ID sleep <- sleep[order(sleep$group, sleep$ID), ] # Paired t-test t.test(extra ~ group, sleep, paired=TRUE) #> #> Paired t-test #> #> data: extra by group #> t = -4.0621, df = 9, p-value = 0.002833 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> -2.4598858 -0.7001142 #> sample estimates: #> mean of the differences #> -1.58 # Same for wide data (two separate vectors) # t.test(sleep.wide$group1, sleep.wide$group2, paired=TRUE) The paired t-test is equivalent to testing whether difference between each pair of observations has a population mean of 0. (See below for comparing a single group to a population mean.) t.test(sleep.wide$group1 - sleep.wide$group2, mu=0, var.equal=TRUE) #> Error in t.test(sleep.wide$group1 - sleep.wide$group2, mu = 0, var.equal = TRUE): object 'sleep.wide' not found

Comparing a group against an expected population mean: one-sample t-test

Suppose that you want to test whether the data in column extra is drawn from a population whose true mean is 0. In this case, the group and ID columns are ignored. t.test(sleep$extra, mu=0) #> #> One Sample t-test #> #> data: sleep$extra #> t = 3.413, df = 19, p-value = 0.002918 #> alternative hypothesis: true mean is not equal to 0 #> 95 percent confidence interval: #> 0.5955845 2.4844155 #> sample estimates: #> mean of x #> 1.54

Frequency tests

Problem

You have categorical data and want test whether the frequency distribution of values differs from an expected frequency, or if the distribution differs between groups.

Solution

There are two general questions that frequency tests address:
  1. Does the frequency distribution differ from an expected, or theoretical, distribution (e.g., expect 50% yes, 50% no)? (Goodness-of-fit test)
  2. Does the frequency distribution differ between two or more groups? (Independence test)
Of the statistical tests commonly used to address these questions, there are exact tests and approximate tests.
  Expected distribution Comparing groups
Exact Exact binomial Fisher’s exact
Approximate Chi-square goodness of fit Chi-square test of independence
Note: The exact binomial test can only be used on one variable that has two levels. Fisher’s exact test can only be used with two-dimensional contingency tables (for example, it can be used when there is one independent variable and one dependent variable, but not when there are 2 IVs and 1 DV). To test for paired or within-subjects effects, McNemar’s test can be used. To use it, there must be one IV with two levels, and one DV with two levels. To test for the independence of two variables with repeated measurements, the Cochran-Mantel-Haenszel test can be used. Suppose this is your data, where each row represents one case: data <- read.table(header=TRUE, text=' condition result control 0 control 0 control 0 control 0 treatment 1 control 0 control 0 treatment 0 treatment 1 control 1 treatment 1 treatment 1 treatment 1 treatment 1 treatment 0 control 0 control 1 control 0 control 1 treatment 0 treatment 1 treatment 0 treatment 0 control 0 treatment 1 control 0 control 0 treatment 1 treatment 0 treatment 1 ') Instead of a data frame of cases, your data might be stored as a data frame of counts, or as a contingency table. For the analyses here, your data must be in a contingency table. See this page for more information.

Tests of goodness-of-fit (expected frequency)

Chi-square test

To test the hypothesis that the two values in the result column (ignoring condition) are equally likely (50%-50%) in the population: # Create contingency table for result, which contains values 0 and 1 # The column names are "0" and "1" (they're not actually values in the table) ct <- table(data$result) ct #> #> 0 1 #> 17 13 # An alternative is to manually create the table #ct <- matrix(c(17,13), ncol=2) #colnames(ct1) <- c("0", "1") # Perform Chi-square test chisq.test(ct) #> #> Chi-squared test for given probabilities #> #> data: ct #> X-squared = 0.53333, df = 1, p-value = 0.4652 To test the sample with different expected frequencies (in this case 0.75 and 0.25): # Probability table - must sum to 1 pt <- c(.75, .25) chisq.test(ct, p=pt) #> #> Chi-squared test for given probabilities #> #> data: ct #> X-squared = 5.3778, df = 1, p-value = 0.02039 If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str(), and pull out the parts you want. For example: chi_res <- chisq.test(ct, p=pt) # View all the parts that can be extracted str(chi_res) #> List of 9 #> $ statistic: Named num 5.38 #> ..- attr(*, "names")= chr "X-squared" #> $ parameter: Named num 1 #> ..- attr(*, "names")= chr "df" #> $ p.value : num 0.0204 #> $ method : chr "Chi-squared test for given probabilities" #> $ data.name: chr "ct" #> $ observed : 'table' int [1:2(1d)] 17 13 #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:2] "0" "1" #> $ expected : Named num [1:2] 22.5 7.5 #> ..- attr(*, "names")= chr [1:2] "0" "1" #> $ residuals: table [1:2(1d)] -1.16 2.01 #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:2] "0" "1" #> $ stdres : table [1:2(1d)] -2.32 2.32 #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:2] "0" "1" #> - attr(*, "class")= chr "htest" # Get the Chi-squared value chi_res$statistic #> X-squared #> 5.377778 # Get the p-value chi_res$p.value #> [1] 0.02039484

Exact binomial test

The exact binomial test is used only with data where there is one variable with two possible values. ct <- table(data$result) ct #> #> 0 1 #> 17 13 binom.test(ct, p=0.5) #> #> Exact binomial test #> #> data: ct #> number of successes = 17, number of trials = 30, p-value = 0.5847 #> alternative hypothesis: true probability of success is not equal to 0.5 #> 95 percent confidence interval: #> 0.3742735 0.7453925 #> sample estimates: #> probability of success #> 0.5666667 # Using expected probability of 75% -- notice that 1 is in the second column of # the table so need to use p=.25 binom.test(ct, p=0.25) #> #> Exact binomial test #> #> data: ct #> number of successes = 17, number of trials = 30, p-value = 0.0002157 #> alternative hypothesis: true probability of success is not equal to 0.25 #> 95 percent confidence interval: #> 0.3742735 0.7453925 #> sample estimates: #> probability of success #> 0.5666667 If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str(), and pull out the parts you want. For example: bin_res <- binom.test(ct, p=0.25) # View all the parts that can be extracted str(bin_res) #> List of 9 #> $ statistic : Named num 17 #> ..- attr(*, "names")= chr "number of successes" #> $ parameter : Named num 30 #> ..- attr(*, "names")= chr "number of trials" #> $ p.value : Named num 0.000216 #> ..- attr(*, "names")= chr "0" #> $ conf.int : atomic [1:2] 0.374 0.745 #> ..- attr(*, "conf.level")= num 0.95 #> $ estimate : Named num 0.567 #> ..- attr(*, "names")= chr "probability of success" #> $ null.value : Named num 0.25 #> ..- attr(*, "names")= chr "probability of success" #> $ alternative: chr "two.sided" #> $ method : chr "Exact binomial test" #> $ data.name : chr "ct" #> - attr(*, "class")= chr "htest" # Get the p-value bin_res$p.value #> 0 #> 0.0002156938 # Get the 95% confidence interval bin_res$conf.int #> [1] 0.3742735 0.7453925 #> attr(,"conf.level") #> [1] 0.95

Tests of independence (comparing groups)

Chi-square test

To test whether the control and treatment conditions result in different frequencies, use a 2D contingency table. ct <- table(data$condition, data$result) ct #> #> 0 1 #> control 11 3 #> treatment 6 10 chisq.test(ct) #> #> Pearson's Chi-squared test with Yates' continuity correction #> #> data: ct #> X-squared = 3.593, df = 1, p-value = 0.05802 chisq.test(ct, correct=FALSE) #> #> Pearson's Chi-squared test #> #> data: ct #> X-squared = 5.1293, df = 1, p-value = 0.02353 For 2x2 tables, it uses Yates’s continuity correction by default. This test is more conservative for small sample sizes. The flag correct=FALSE, tells it to use Pearson’s chi-square test without the correction.

Fisher’s exact test

For small sample sizes, Fisher’s exact test may be more appropriate. It is typically used for 2x2 tables with relatively small sample sizes because it is computationally intensive for more complicated (e.g., 2x3) tables, and those with larger sample sizes. However, the implementation in R seems to handle larger cases without trouble. ct <- table(data$condition, data$result) ct #> #> 0 1 #> control 11 3 #> treatment 6 10 fisher.test(ct) #> #> Fisher's Exact Test for Count Data #> #> data: ct #> p-value = 0.03293 #> alternative hypothesis: true odds ratio is not equal to 1 #> 95 percent confidence interval: #> 0.966861 45.555016 #> sample estimates: #> odds ratio #> 5.714369

Cochran-Mantel-Haenszel test

The Cochran-Mantel-Haenszel test (or Mantel-Haenszel test) is used for testing the independence of two dichotomous variables with repeated measurements. It is most commonly used with 2x2xK tables, where K is the number of measurement conditions. For example, you may want to know whether a treatment (C vs. D) affects the likelihood of recovery (yes or no). Suppose, though, that the treatments were administered at three different times of day, morning, afternoon, and night – and that you want your test to control for this. The CMH test would then operate on a 2x2x3 contingency table, where the third variable is the one you wish to control for. The implementation of the CMH test in R can handle dimensions greater than 2x2xK. For example, you could use it for a 3x3xK contingency table. In the following example (borrowed from McDonald’s Handbook of Biological Statistics), there are three variables: Location, Allele, and Habitat. The question is whether Allele (94 or non-94) and Habitat (marine or estuarine) are independent, when location is controlled for. fish <- read.table(header=TRUE, text=' Location Allele Habitat Count tillamook 94 marine 56 tillamook 94 estuarine 69 tillamook non-94 marine 40 tillamook non-94 estuarine 77 yaquina 94 marine 61 yaquina 94 estuarine 257 yaquina non-94 marine 57 yaquina non-94 estuarine 301 alsea 94 marine 73 alsea 94 estuarine 65 alsea non-94 marine 71 alsea non-94 estuarine 79 umpqua 94 marine 71 umpqua 94 estuarine 48 umpqua non-94 marine 55 umpqua non-94 estuarine 48 ') Note that the data above is entered as a data frame of counts, instead of a data frame of cases as in previous examples. Instead of using table() to convert it to a contingency table, use xtabs() instead. For more information, see this page. # Make a 3D contingency table, where the last variable, Location, is the one to # control for. If you use table() for case data, the last variable is also the # one to control for. ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish) ct #> , , Location = alsea #> #> Habitat #> Allele estuarine marine #> 94 65 73 #> non-94 79 71 #> #> , , Location = tillamook #> #> Habitat #> Allele estuarine marine #> 94 69 56 #> non-94 77 40 #> #> , , Location = umpqua #> #> Habitat #> Allele estuarine marine #> 94 48 71 #> non-94 48 55 #> #> , , Location = yaquina #> #> Habitat #> Allele estuarine marine #> 94 257 61 #> non-94 301 57 # This prints ct in a "flat" format ftable(ct) #> Location alsea tillamook umpqua yaquina #> Allele Habitat #> 94 estuarine 65 69 48 257 #> marine 73 56 71 61 #> non-94 estuarine 79 77 48 301 #> marine 71 40 55 57 # Print with different arrangement of variables ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat") #> Habitat estuarine marine #> Location Allele #> alsea 94 65 73 #> non-94 79 71 #> tillamook 94 69 56 #> non-94 77 40 #> umpqua 94 48 71 #> non-94 48 55 #> yaquina 94 257 61 #> non-94 301 57 mantelhaen.test(ct) #> #> Mantel-Haenszel chi-squared test with continuity correction #> #> data: ct #> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463 #> alternative hypothesis: true common odds ratio is not equal to 1 #> 95 percent confidence interval: #> 0.6005522 0.9593077 #> sample estimates: #> common odds ratio #> 0.759022 According to this test, there is a relationship between Allele and Habitat, controlling for Location, p=.025. Note that the first two dimensions of the contingency table are treated the same (so their order can be swapped without affecting the test result), the highest-order dimension in the contingency table is different. This is illustrated below. # The following two create different contingency tables, but have the same result # with the CMH test ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish) ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish) mantelhaen.test(ct.1) #> #> Mantel-Haenszel chi-squared test with continuity correction #> #> data: ct.1 #> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463 #> alternative hypothesis: true common odds ratio is not equal to 1 #> 95 percent confidence interval: #> 0.6005522 0.9593077 #> sample estimates: #> common odds ratio #> 0.759022 mantelhaen.test(ct.2) #> #> Mantel-Haenszel chi-squared test with continuity correction #> #> data: ct.2 #> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463 #> alternative hypothesis: true common odds ratio is not equal to 1 #> 95 percent confidence interval: #> 0.6005522 0.9593077 #> sample estimates: #> common odds ratio #> 0.759022 # With Allele last, we get a different result ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish) ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish) mantelhaen.test(ct.3) #> #> Cochran-Mantel-Haenszel test #> #> data: ct.3 #> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16 mantelhaen.test(ct.4) #> #> Cochran-Mantel-Haenszel test #> #> data: ct.4 #> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16 # With Habitat last, we get a different result ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish) ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish) mantelhaen.test(ct.5) #> #> Cochran-Mantel-Haenszel test #> #> data: ct.5 #> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689 mantelhaen.test(ct.6) #> #> Cochran-Mantel-Haenszel test #> #> data: ct.6 #> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689

McNemar’s test

McNemar’s test is conceptually like a within-subjects test for frequency data. For example, suppose you want to test whether a treatment increases the probability that a person will respond “yes” to a question, and that you get just one pre-treatment and one post-treatment response per person. A standard chi-square test would be inappropriate, because it assumes that the groups are independent. Instead, McNemar’s test can be used. This test can only be used when there are two measurements of a dichotomous variable. The 2x2 contingency table used for McNemar’s test bears a superficial resemblance to those used for “normal” chi-square tests, but it is different in structure. Suppose this is your data. Each subject has a pre-treatment and post-treatment response. data <- read.table(header=TRUE, text=' subject time result 1 pre 0 1 post 1 2 pre 1 2 post 1 3 pre 0 3 post 1 4 pre 1 4 post 0 5 pre 1 5 post 1 6 pre 0 6 post 1 7 pre 0 7 post 1 8 pre 0 8 post 1 9 pre 0 9 post 1 10 pre 1 10 post 1 11 pre 0 11 post 0 12 pre 1 12 post 1 13 pre 0 13 post 1 14 pre 0 14 post 0 15 pre 0 15 post 1 ') If your data is not already in wide format, it must be converted (see this page for more information): library(tidyr) data_wide <- spread(data, time, result) data_wide #> subject post pre #> 1 1 1 0 #> 2 2 1 1 #> 3 3 1 0 #> 4 4 0 1 #> 5 5 1 1 #> 6 6 1 0 #> 7 7 1 0 #> 8 8 1 0 #> 9 9 1 0 #> 10 10 1 1 #> 11 11 0 0 #> 12 12 1 1 #> 13 13 1 0 #> 14 14 0 0 #> 15 15 1 0 Next, generate the contingency table from just the pre and post columns from the data frame: ct <- table( data_wide[,c("pre","post")] ) ct #> post #> pre 0 1 #> 0 2 8 #> 1 1 4 # The contingency table above puts each subject into one of four cells, depending # on their pre- and post-treatment response. Note that it it is different from # the contingency table used for a "normal" chi-square test, shown below. The table # below does not account for repeated measurements, and so it is not useful for # the purposes here. # table(data[,c("time","result")]) # result # time 0 1 # post 3 12 # pre 10 5 After creating the appropriate contingency table, run the test: mcnemar.test(ct) #> #> McNemar's Chi-squared test with continuity correction #> #> data: ct #> McNemar's chi-squared = 4, df = 1, p-value = 0.0455 For small sample sizes, it uses a continuity correction. Instead of using this correction, you can use an exact version of McNemar’s test, which is more accurate. It is available in the package exact2x2. library(exact2x2) #> Loading required package: exactci #> Loading required package: ssanv mcnemar.exact(ct) #> #> Exact McNemar test (with central confidence intervals) #> #> data: ct #> b = 8, c = 1, p-value = 0.03906 #> alternative hypothesis: true odds ratio is not equal to 1 #> 95 percent confidence interval: #> 1.072554 354.981246 #> sample estimates: #> odds ratio #> 8

ANOVA

Problem

You want to compare multiple groups using an ANOVA.

Solution

Suppose this is your data: data <- read.table(header=TRUE, text=' subject sex age before after 1 F old 9.5 7.1 2 M old 10.3 11.0 3 M old 7.5 5.8 4 F old 12.4 8.8 5 M old 10.2 8.6 6 M old 11.0 8.0 7 M young 9.1 3.0 8 F young 7.9 5.2 9 F old 6.6 3.4 10 M young 7.7 4.0 11 M young 9.4 5.3 12 M old 11.6 11.3 13 M young 9.9 4.6 14 F young 8.6 6.4 15 F young 14.3 13.5 16 F old 9.2 4.7 17 M young 9.8 5.1 18 F old 9.9 7.3 19 F young 13.0 9.5 20 M young 10.2 5.4 21 M young 9.0 3.7 22 F young 7.9 6.2 23 M old 10.1 10.0 24 M young 9.0 1.7 25 M young 8.6 2.9 26 M young 9.4 3.2 27 M young 9.7 4.7 28 M young 9.3 4.9 29 F young 10.7 9.8 30 M old 9.3 9.4 ') # Make sure subject column is a factor, so that it's not treated as a continuous # variable. data$subject <- factor(data$subject)

One way between ANOVA

# One way between: # IV: sex # DV: before aov1 <- aov(before ~ sex, data=data) summary(aov1) #> Df Sum Sq Mean Sq F value Pr(>F) #> sex 1 1.53 1.529 0.573 0.455 #> Residuals 28 74.70 2.668 # Show the means model.tables(aov1, "means") #> Tables of means #> Grand mean #> #> 9.703333 #> #> sex #> F M #> 10 9.532 #> rep 11 19.000

Two way between ANOVA

# 2x2 between: # IV: sex # IV: age # DV: after # These two calls are equivalent aov2 <- aov(after ~ sex*age, data=data) aov2 <- aov(after ~ sex + age + sex:age, data=data) summary(aov2) #> Df Sum Sq Mean Sq F value Pr(>F) #> sex 1 16.08 16.08 4.038 0.0550 . #> age 1 38.96 38.96 9.786 0.0043 ** #> sex:age 1 89.61 89.61 22.509 6.6e-05 *** #> Residuals 26 103.51 3.98 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Show the means model.tables(aov2, "means") #> Tables of means #> Grand mean #> #> 6.483333 #> #> sex #> F M #> 7.445 5.926 #> rep 11.000 19.000 #> #> age #> old young #> 7.874 5.556 #> rep 12.000 18.000 #> #> sex:age #> age #> sex old young #> F 6.260 8.433 #> rep 5.000 6.000 #> M 9.157 4.042 #> rep 7.000 12.000

Tukey HSD post-hoc test

TukeyHSD(aov2) #> Tukey multiple comparisons of means #> 95% family-wise confidence level #> #> Fit: aov(formula = after ~ sex + age + sex:age, data = data) #> #> $sex #> diff lwr upr p adj #> M-F -1.519139 -3.073025 0.03474709 0.0549625 #> #> $age #> diff lwr upr p adj #> young-old -2.31785 -3.846349 -0.7893498 0.0044215 #> #> $`sex:age` #> diff lwr upr p adj #> M:old-F:old 2.8971429 -0.3079526 6.1022384 0.0869856 #> F:young-F:old 2.1733333 -1.1411824 5.4878491 0.2966111 #> M:young-F:old -2.2183333 -5.1319553 0.6952887 0.1832890 #> F:young-M:old -0.7238095 -3.7691188 2.3214997 0.9138789 #> M:young-M:old -5.1154762 -7.7187601 -2.5121923 0.0000676 #> M:young-F:young -4.3916667 -7.1285380 -1.6547953 0.0008841

ANOVAs with within-subjects variables

For ANOVAs with within-subjects variables, the data must be in long format. The data supplied above is in wide format, so we have to convert it first. (See ../../Manipulating data/Converting data between wide and long format for more information.) Also, for ANOVAs with a within-subjects variable, there must be an identifier column. In this case, it is subject. This identifier variable must be a factor. If it is a numeric type, the function will interpret it incorrectly and it won’t work properly. library(tidyr) # This is what the original (wide) data looks like # subject sex age before after # 1 F old 9.5 7.1 # 2 M old 10.3 11.0 # 3 M old 7.5 5.8 # Convert it to long format data_long <- gather(data, time, value, before:after) # Look at first few rows head(data_long) #> subject sex age time value #> 1 1 F old before 9.5 #> 2 2 M old before 10.3 #> 3 3 M old before 7.5 #> 4 4 F old before 12.4 #> 5 5 M old before 10.2 #> 6 6 M old before 11.0 # Make sure subject column is a factor data_long$subject <- factor(data_long$subject)

One-way within ANOVA

First, convert the data to long format and make sure subject is a factor, as shown above. If subject is a numeric column, and not a factor, your results will be wrong! # IV (within): time # DV: value aov_time <- aov(value ~ time + Error(subject/time), data=data_long) summary(aov_time) #> #> Error: subject #> Df Sum Sq Mean Sq F value Pr(>F) #> Residuals 29 261.2 9.009 #> #> Error: subject:time #> Df Sum Sq Mean Sq F value Pr(>F) #> time 1 155.53 155.53 71.43 2.59e-09 *** #> Residuals 29 63.14 2.18 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # This won't work here for some reason (?) model.tables(aov_time, "means") #> Tables of means #> Grand mean #> #> 8.093333 #> #> time #> time #> after before #> 6.483 9.703

Mixed design ANOVA

First, convert the data to long format and make sure subject is a factor, as shown above. # 2x2 mixed: # IV between: age # IV within: time # DV: value aov_age_time <- aov(value ~ age*time + Error(subject/time), data=data_long) summary(aov_age_time) #> #> Error: subject #> Df Sum Sq Mean Sq F value Pr(>F) #> age 1 24.44 24.440 2.89 0.1 #> Residuals 28 236.81 8.457 #> #> Error: subject:time #> Df Sum Sq Mean Sq F value Pr(>F) #> time 1 155.53 155.53 98.14 1.18e-10 *** #> age:time 1 18.77 18.77 11.84 0.00184 ** #> Residuals 28 44.37 1.58 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # This won't work here because the data is unbalanced model.tables(aov_age_time, "means") #> Tables of means #> Grand mean #> #> 8.093333 #> #> age #> old young #> 8.875 7.572 #> rep 24.000 36.000 #> #> time #> after before #> 6.483 9.703 #> rep 30.000 30.000 #> #> age:time #> time #> age after before #> old 7.950 9.800 #> rep 12.000 12.000 #> young 5.506 9.639 #> rep 18.000 18.000

More ANOVAs with within-subjects variables

These examples don’t operate on the data above, but they should illustrate how to do things. First, convert the data to long format and make sure subject is a factor, as shown above. # Two within variables aov.ww <- aov(y ~ w1*w2 + Error(subject/(w1*w2)), data=data_long) # One between variable and two within variables aov.bww <- aov(y ~ b1*w1*w2 + Error(subject/(w1*w2)) + b1, data=data_long) # Two between variables and one within variables aov.bww <- aov(y ~ b1*b2*w1 + Error(subject/(w1)) + b1*b2, data=data_long) For more information on ANOVA in R, see:

Logistic regression

Problem

You want to perform a logistic regression.

Solution

A logistic regression is typically used when there is one dichotomous outcome variable (such as winning or losing), and a continuous predictor variable which is related to the probability or odds of the outcome variable. It can also be used with categorical predictors, and with multiple predictors. Suppose we start with part of the built-in mtcars dataset. In the examples below, we’ll use vs as the outcome variable, mpg as a continuous predictor, and am as a categorical (dichotomous) predictor. data(mtcars) dat <- subset(mtcars, select=c(mpg, am, vs)) dat #> mpg am vs #> Mazda RX4 21.0 1 0 #> Mazda RX4 Wag 21.0 1 0 #> Datsun 710 22.8 1 1 #> Hornet 4 Drive 21.4 0 1 #> Hornet Sportabout 18.7 0 0 #> Valiant 18.1 0 1 #> Duster 360 14.3 0 0 #> Merc 240D 24.4 0 1 #> Merc 230 22.8 0 1 #> Merc 280 19.2 0 1 #> Merc 280C 17.8 0 1 #> Merc 450SE 16.4 0 0 #> Merc 450SL 17.3 0 0 #> Merc 450SLC 15.2 0 0 #> Cadillac Fleetwood 10.4 0 0 #> Lincoln Continental 10.4 0 0 #> Chrysler Imperial 14.7 0 0 #> Fiat 128 32.4 1 1 #> Honda Civic 30.4 1 1 #> Toyota Corolla 33.9 1 1 #> Toyota Corona 21.5 0 1 #> Dodge Challenger 15.5 0 0 #> AMC Javelin 15.2 0 0 #> Camaro Z28 13.3 0 0 #> Pontiac Firebird 19.2 0 0 #> Fiat X1-9 27.3 1 1 #> Porsche 914-2 26.0 1 0 #> Lotus Europa 30.4 1 1 #> Ford Pantera L 15.8 1 0 #> Ferrari Dino 19.7 1 0 #> Maserati Bora 15.0 1 0 #> Volvo 142E 21.4 1 1

Continuous predictor, dichotomous outcome

If the data set has one dichotomous and one continuous variable, and the continuous variable is a predictor of the probability the dichotomous variable, then a logistic regression might be appropriate. In this example, mpg is the continuous predictor variable, and vs is the dichotomous outcome variable. # Do the logistic regression - both of these have the same effect. # ("logit" is the default model when family is binomial.) logr_vm <- glm(vs ~ mpg, data=dat, family=binomial) logr_vm <- glm(vs ~ mpg, data=dat, family=binomial(link="logit")) To view the model and information about it: # Print information about the model logr_vm #> #> Call: glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat) #> #> Coefficients: #> (Intercept) mpg #> -8.8331 0.4304 #> #> Degrees of Freedom: 31 Total (i.e. Null); 30 Residual #> Null Deviance: 43.86 #> Residual Deviance: 25.53 AIC: 29.53 # More information about the model summary(logr_vm) #> #> Call: #> glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -2.2127 -0.5121 -0.2276 0.6402 1.6980 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -8.8331 3.1623 -2.793 0.00522 ** #> mpg 0.4304 0.1584 2.717 0.00659 ** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 43.860 on 31 degrees of freedom #> Residual deviance: 25.533 on 30 degrees of freedom #> AIC: 29.533 #> #> Number of Fisher Scoring iterations: 6

Plotting

The data and logistic regression model can be plotted with ggplot2 or base graphics: library(ggplot2) ggplot(dat, aes(x=mpg, y=vs)) + geom_point() + stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE) par(mar = c(4, 4, 1, 1)) # Reduce some of the margins so that the plot fits better plot(dat$mpg, dat$vs) curve(predict(logr_vm, data.frame(mpg=x), type="response"), add=TRUE) plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Dichotomous predictor, dichotomous outcome

This proceeds in much the same way as above. In this example, am is the dichotomous predictor variable, and vs is the dichotomous outcome variable. # Do the logistic regression logr_va <- glm(vs ~ am, data=dat, family=binomial) # Print information about the model logr_va #> #> Call: glm(formula = vs ~ am, family = binomial, data = dat) #> #> Coefficients: #> (Intercept) am #> -0.5390 0.6931 #> #> Degrees of Freedom: 31 Total (i.e. Null); 30 Residual #> Null Deviance: 43.86 #> Residual Deviance: 42.95 AIC: 46.95 # More information about the model summary(logr_va) #> #> Call: #> glm(formula = vs ~ am, family = binomial, data = dat) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -1.2435 -0.9587 -0.9587 1.1127 1.4132 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -0.5390 0.4756 -1.133 0.257 #> am 0.6931 0.7319 0.947 0.344 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 43.860 on 31 degrees of freedom #> Residual deviance: 42.953 on 30 degrees of freedom #> AIC: 46.953 #> #> Number of Fisher Scoring iterations: 4

Plotting

The data and logistic regression model can be plotted with ggplot2 or base graphics, although the plots are probably less informative than those with a continuous variable. Because there are only 4 locations for the points to go, it will help to jitter the points so they do not all get overplotted. library(ggplot2) ggplot(dat, aes(x=am, y=vs)) + geom_point(shape=1, position=position_jitter(width=.05,height=.05)) + stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE) par(mar = c(4, 4, 1, 1)) # Reduce some of the margins so that the plot fits better plot(jitter(dat$am, .2), jitter(dat$vs, .2)) curve(predict(logr_va, data.frame(am=x), type="response"), add=TRUE) plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7

Continuous and dichotomous predictors, dichotomous outcome

This is similar to the previous examples. In this example, mpg is the continuous predictor, am is the dichotomous predictor variable, and vs is the dichotomous outcome variable. logr_vma <- glm(vs ~ mpg + am, data=dat, family=binomial) logr_vma #> #> Call: glm(formula = vs ~ mpg + am, family = binomial, data = dat) #> #> Coefficients: #> (Intercept) mpg am #> -12.7051 0.6809 -3.0073 #> #> Degrees of Freedom: 31 Total (i.e. Null); 29 Residual #> Null Deviance: 43.86 #> Residual Deviance: 20.65 AIC: 26.65 summary(logr_vma) #> #> Call: #> glm(formula = vs ~ mpg + am, family = binomial, data = dat) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -2.05888 -0.44544 -0.08765 0.33335 1.68405 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -12.7051 4.6252 -2.747 0.00602 ** #> mpg 0.6809 0.2524 2.698 0.00697 ** #> am -3.0073 1.5995 -1.880 0.06009 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 43.860 on 31 degrees of freedom #> Residual deviance: 20.646 on 29 degrees of freedom #> AIC: 26.646 #> #> Number of Fisher Scoring iterations: 6

Multiple predictors with interactions

It is possible to test for interactions when there are multiple predictors. The interactions can be specified individually, as with a + b + c + a:b + b:c + a:b:c, or they can be expanded automatically, with a * b * c. It is possible to specify only a subset of the possible interactions, such as a + b + c + a:c. This case proceeds as above, but with a slight change: instead of the formula being vs ~ mpg + am, it is vs ~ mpg * am, which is equivalent to vs ~ mpg + am + mpg:am. # Do the logistic regression - both of these have the same effect. logr_vmai <- glm(vs ~ mpg * am, data=dat, family=binomial) logr_vmai <- glm(vs ~ mpg + am + mpg:am, data=dat, family=binomial) logr_vmai #> #> Call: glm(formula = vs ~ mpg + am + mpg:am, family = binomial, data = dat) #> #> Coefficients: #> (Intercept) mpg am mpg:am #> -20.4784 1.1084 10.1055 -0.6637 #> #> Degrees of Freedom: 31 Total (i.e. Null); 28 Residual #> Null Deviance: 43.86 #> Residual Deviance: 19.12 AIC: 27.12 summary(logr_vmai) #> #> Call: #> glm(formula = vs ~ mpg + am + mpg:am, family = binomial, data = dat) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -1.70566 -0.31124 -0.04817 0.28038 1.55603 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -20.4784 10.5525 -1.941 0.0523 . #> mpg 1.1084 0.5770 1.921 0.0547 . #> am 10.1055 11.9104 0.848 0.3962 #> mpg:am -0.6637 0.6242 -1.063 0.2877 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 43.860 on 31 degrees of freedom #> Residual deviance: 19.125 on 28 degrees of freedom #> AIC: 27.125 #> #> Number of Fisher Scoring iterations: 7 TODO: Add comparison between interaction and non-interaction models.

Homogeneity of variance

Problem

You want test samples to see for homogeneity of variance (homoscedasticity) – or more accurately. Many statistical tests assume that the populations are homoscedastic.

Solution

There are many ways of testing data for homogeneity of variance. Three methods are shown here. For all these tests, the null hypothesis is that all populations variances are equal; the alternative hypothesis is that at least two of them differ.

Sample data

The examples here will use the InsectSprays and ToothGrowth data sets. The InsectSprays data set has one independent variable, while the ToothGrowth data set has two independent variables. head(InsectSprays) #> count spray #> 1 10 A #> 2 7 A #> 3 20 A #> 4 14 A #> 5 14 A #> 6 12 A tg <- ToothGrowth tg$dose <- factor(tg$dose) # Treat this column as a factor, not numeric head(tg) #> len supp dose #> 1 4.2 VC 0.5 #> 2 11.5 VC 0.5 #> 3 7.3 VC 0.5 #> 4 5.8 VC 0.5 #> 5 6.4 VC 0.5 #> 6 10.0 VC 0.5 Quick boxplots of these data sets: plot(count ~ spray, data = InsectSprays) plot of chunk unnamed-chunk-2 plot(len ~ interaction(dose,supp), data=ToothGrowth) plot of chunk unnamed-chunk-3 On a first glance, it appears that both data sets are heteroscedastic, but this needs to be properly tested, which we’ll do below.

Bartlett’s test

With one independent variable: bartlett.test(count ~ spray, data=InsectSprays) #> #> Bartlett test of homogeneity of variances #> #> data: count by spray #> Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05 # Same effect, but with two vectors, instead of two columns from a data frame # bartlett.test(InsectSprays$count ~ InsectSprays$spray) With multiple independent variables, the interaction() function must be used to collapse the IV’s into a single variable with all combinations of the factors. If it is not used, then the will be the wrong degrees of freedom, and the p-value will be wrong. bartlett.test(len ~ interaction(supp,dose), data=ToothGrowth) #> #> Bartlett test of homogeneity of variances #> #> data: len by interaction(supp, dose) #> Bartlett's K-squared = 6.9273, df = 5, p-value = 0.2261 # The above gives the same result as testing len vs. dose alone, without supp bartlett.test(len ~ dose, data=ToothGrowth) #> #> Bartlett test of homogeneity of variances #> #> data: len by dose #> Bartlett's K-squared = 0.66547, df = 2, p-value = 0.717

Levene’s test

The leveneTest function is part of the car package. With one independent variable: library(car) leveneTest(count ~ spray, data=InsectSprays) #> Levene's Test for Homogeneity of Variance (center = median) #> Df F value Pr(>F) #> group 5 3.8214 0.004223 ** #> 66 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 With two independent variables. Note that the interaction function is not needed, as it is for the other two tests. leveneTest(len ~ supp*dose, data=tg) #> Levene's Test for Homogeneity of Variance (center = median) #> Df F value Pr(>F) #> group 5 1.7086 0.1484 #> 54

Fligner-Killeen test

With one independent variable: fligner.test(count ~ spray, data=InsectSprays) #> #> Fligner-Killeen test of homogeneity of variances #> #> data: count by spray #> Fligner-Killeen:med chi-squared = 14.483, df = 5, p-value = 0.01282 # Same effect, but with two vectors, instead of two columns from a data frame # fligner.test(InsectSprays$count ~ InsectSprays$spray) The fligner.test function has the same quirks as bartlett.test when working with multiple IV’s. With multiple independent variables, the interaction() function must be used. fligner.test(len ~ interaction(supp,dose), data=ToothGrowth) #> #> Fligner-Killeen test of homogeneity of variances #> #> data: len by interaction(supp, dose) #> Fligner-Killeen:med chi-squared = 7.7488, df = 5, p-value = 0.1706 # The above gives the same result as testing len vs. dose alone, without supp fligner.test(len ~ dose, data=ToothGrowth) #> #> Fligner-Killeen test of homogeneity of variances #> #> data: len by dose #> Fligner-Killeen:med chi-squared = 1.3879, df = 2, p-value = 0.4996

Inter-rater reliability

Problem

You want to calculate inter-rater reliability.

Solution

The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

Categorical data

Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the diagnoses data set in the irr package. library(irr) #> Loading required package: lpSolve data(diagnoses) dat <- diagnoses[,1:3] # rater1 rater2 rater3 # 4. Neurosis 4. Neurosis 4. Neurosis # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder # 2. Personality Disorder 3. Schizophrenia 3. Schizophrenia # 5. Other 5. Other 5. Other # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder # 1. Depression 1. Depression 3. Schizophrenia # 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia # 1. Depression 1. Depression 3. Schizophrenia # 1. Depression 1. Depression 4. Neurosis # 5. Other 5. Other 5. Other # 1. Depression 4. Neurosis 4. Neurosis # 1. Depression 2. Personality Disorder 4. Neurosis # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder # 1. Depression 4. Neurosis 4. Neurosis # 2. Personality Disorder 2. Personality Disorder 4. Neurosis # 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia # 1. Depression 1. Depression 1. Depression # 1. Depression 1. Depression 1. Depression # 2. Personality Disorder 2. Personality Disorder 4. Neurosis # 1. Depression 3. Schizophrenia 3. Schizophrenia # 5. Other 5. Other 5. Other # 2. Personality Disorder 4. Neurosis 4. Neurosis # 2. Personality Disorder 2. Personality Disorder 4. Neurosis # 1. Depression 1. Depression 4. Neurosis # 1. Depression 4. Neurosis 4. Neurosis # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder # 1. Depression 1. Depression 1. Depression # 2. Personality Disorder 2. Personality Disorder 4. Neurosis # 1. Depression 3. Schizophrenia 3. Schizophrenia # 5. Other 5. Other 5. Other

Two raters: Cohen’s Kappa

This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2. kappa2(dat[,c(1,2)], "unweighted") #> Cohen's Kappa for 2 Raters (Weights: unweighted) #> #> Subjects = 30 #> Raters = 2 #> Kappa = 0.651 #> #> z = 7 #> p-value = 2.63e-12

N raters: Fleiss’s Kappa, Conger’s Kappa

If there are more than two raters, use Fleiss’s Kappa. kappam.fleiss(dat) #> Fleiss' Kappa for m Raters #> #> Subjects = 30 #> Raters = 3 #> Kappa = 0.534 #> #> z = 9.89 #> p-value = 0 It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.) kappam.fleiss(dat, exact=TRUE) #> Fleiss' Kappa for m Raters (exact value) #> #> Subjects = 30 #> Raters = 3 #> Kappa = 0.55

Ordinal data: weighted Kappa

If the data is ordinal, then it may be appropriate to use a weighted Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high. We will use a subset of the anxiety data set from the irr package. library(irr) data(anxiety) dfa <- anxiety[,c(1,2)] dfa #> rater1 rater2 #> 1 3 3 #> 2 3 6 #> 3 3 4 #> 4 4 6 #> 5 5 2 #> 6 5 4 #> 7 2 2 #> 8 3 4 #> 9 5 3 #> 10 2 3 #> 11 2 2 #> 12 6 3 #> 13 1 3 #> 14 5 3 #> 15 2 2 #> 16 2 2 #> 17 1 1 #> 18 2 3 #> 19 4 3 #> 20 3 4 The weighted Kappa calculation must be made with 2 raters, and can use either linear or squared weights of the differences. # Compare raters 1 and 2 with squared weights kappa2(dfa, "squared") #> Cohen's Kappa for 2 Raters (Weights: squared) #> #> Subjects = 20 #> Raters = 2 #> Kappa = 0.297 #> #> z = 1.34 #> p-value = 0.18 # Use linear weights kappa2(dfa, "equal") #> Cohen's Kappa for 2 Raters (Weights: equal) #> #> Subjects = 20 #> Raters = 2 #> Kappa = 0.189 #> #> z = 1.42 #> p-value = 0.157 Compare the results above to the unweighted calculation (used in the tests for non-ordinal data above), which treats all differences as the same: kappa2(dfa, "unweighted") #> Cohen's Kappa for 2 Raters (Weights: unweighted) #> #> Subjects = 20 #> Raters = 2 #> Kappa = 0.119 #> #> z = 1.16 #> p-value = 0.245

Weighted Kappa with factors

The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong. # Make a factor-ized version of the data dfa2 <- dfa dfa2$rater1 <- factor(dfa2$rater1, levels=1:6, labels=LETTERS[1:6]) dfa2$rater2 <- factor(dfa2$rater2, levels=1:6, labels=LETTERS[1:6]) dfa2 #> rater1 rater2 #> 1 C C #> 2 C F #> 3 C D #> 4 D F #> 5 E B #> 6 E D #> 7 B B #> 8 C D #> 9 E C #> 10 B C #> 11 B B #> 12 F C #> 13 A C #> 14 E C #> 15 B B #> 16 B B #> 17 A A #> 18 B C #> 19 D C #> 20 C D # The factor levels must be in the correct order: levels(dfa2$rater1) #> [1] "A" "B" "C" "D" "E" "F" levels(dfa2$rater2) #> [1] "A" "B" "C" "D" "E" "F" # The results are the same as with the numeric data, above kappa2(dfa2, "squared") #> Cohen's Kappa for 2 Raters (Weights: squared) #> #> Subjects = 20 #> Raters = 2 #> Kappa = 0.297 #> #> z = 1.34 #> p-value = 0.18 # Use linear weights kappa2(dfa2, "equal") #> Cohen's Kappa for 2 Raters (Weights: equal) #> #> Subjects = 20 #> Raters = 2 #> Kappa = 0.189 #> #> z = 1.42 #> p-value = 0.157

Continuous data: Intraclass correlation coefficient

When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for icc: When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout & Fleiss, 1979):
  1. Should only the subjects be considered as random effects ("oneway" model, default) or are subjects and raters randomly chosen from a bigger pool of persons ("twoway" model).
  2. If differences in judges’ mean ratings are of interest, interrater "agreement" instead of "consistency" (default) should be computed.
  3. If the unit of analysis is a mean of several ratings, unit should be changed to "average". In most cases, however, single values (unit="single", default) are regarded.
We will use the anxiety data set from the irr package. library(irr) data(anxiety) anxiety #> rater1 rater2 rater3 #> 1 3 3 2 #> 2 3 6 1 #> 3 3 4 4 #> 4 4 6 4 #> 5 5 2 3 #> 6 5 4 2 #> 7 2 2 1 #> 8 3 4 6 #> 9 5 3 1 #> 10 2 3 1 #> 11 2 2 1 #> 12 6 3 2 #> 13 1 3 3 #> 14 5 3 3 #> 15 2 2 1 #> 16 2 2 1 #> 17 1 1 3 #> 18 2 3 3 #> 19 4 3 2 #> 20 3 4 2 # Just one of the many possible ICC coefficients icc(anxiety, model="twoway", type="agreement") #> Single Score Intraclass Correlation #> #> Model: twoway #> Type : agreement #> #> Subjects = 20 #> Raters = 3 #> ICC(A,1) = 0.198 #> #> F-Test, H0: r0 = 0 ; H1: r0 > 0 #> F(19,39.7) = 1.83 , p = 0.0543 #> #> 95%-Confidence Interval for ICC Population Values: #> -0.039 < ICC < 0.494

Bar and line graphs (ggplot2)

Problem

You want to do make basic bar or line graphs.

Solution

To make graphs with ggplot2, the data must be in a data frame, and in “long” (as opposed to wide) format. If your data needs to be restructured, see this page for more information.

Basic graphs with discrete x-axis

With bar graphs, there are two different things that the heights of bars commonly represent:
x axis is Height of bar represents Common name
Continuous Count Histogram
Discrete Count Bar graph
Continuous Value Bar graph
Discrete Value Bar graph
In ggplot2, the default is to use stat_bin, so that the bar height represents the count of cases.
Bar graphs of values
Here is some sample data (derived from the tips dataset in the reshape2 package): dat <- data.frame( time = factor(c("Lunch","Dinner"), levels=c("Lunch","Dinner")), total_bill = c(14.89, 17.23) ) dat #> time total_bill #> 1 Lunch 14.89 #> 2 Dinner 17.23 # Load the ggplot2 package library(ggplot2) In these examples, the height of the bar will represent the value in a column of the data frame. This is done by using stat="identity" instead of the default, stat="bin". These are the variable mappings used here: # Very basic bar graph ggplot(data=dat, aes(x=time, y=total_bill)) + geom_bar(stat="identity") # Map the time of day to different fill colors ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) + geom_bar(stat="identity") ## This would have the same result as above # ggplot(data=dat, aes(x=time, y=total_bill)) + # geom_bar(aes(fill=time), stat="identity") # Add a black outline ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) + geom_bar(colour="black", stat="identity") # No legend, since the information is redundant ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) + geom_bar(colour="black", stat="identity") + guides(fill=FALSE) plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3 The desired bar graph might look something like this: # Add title, narrower bars, fill color, and change axis labels ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) + geom_bar(colour="black", fill="#DD8888", width=.8, stat="identity") + guides(fill=FALSE) + xlab("Time of day") + ylab("Total bill") + ggtitle("Average bill for 2 people") plot of chunk unnamed-chunk-4 See ../Colors (ggplot2) for more information on colors.
Bar graphs of counts
In these examples, the height of the bar will represent the count of cases. This is done by using stat="bin" (which is the default). We’ll start with the tips data from the reshape2 package: library(reshape2) # Look at fist several rows head(tips) #> total_bill tip sex smoker day time size #> 1 16.99 1.01 Female No Sun Dinner 2 #> 2 10.34 1.66 Male No Sun Dinner 3 #> 3 21.01 3.50 Male No Sun Dinner 3 #> 4 23.68 3.31 Male No Sun Dinner 2 #> 5 24.59 3.61 Female No Sun Dinner 4 #> 6 25.29 4.71 Male No Sun Dinner 4 To get a bar graph of counts, don’t map a variable to y, and use stat="bin" (which is the default) instead of stat="identity": # Bar graph of counts ggplot(data=tips, aes(x=day)) + geom_bar(stat="count") ## Equivalent to this, since stat="bin" is the default: # ggplot(data=tips, aes(x=day)) + # geom_bar() plot of chunk unnamed-chunk-6

Line graphs

For line graphs, the data points must be grouped so that it knows which points to connect. In this case, it is simple – all points should be connected, so group=1. When more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable (this is seen in later examples). These are the variable mappings used here: # Basic line graph ggplot(data=dat, aes(x=time, y=total_bill, group=1)) + geom_line() ## This would have the same result as above # ggplot(data=dat, aes(x=time, y=total_bill)) + # geom_line(aes(group=1)) # Add points ggplot(data=dat, aes(x=time, y=total_bill, group=1)) + geom_line() + geom_point() # Change color of both line and points # Change line type and point type, and use thicker line and larger points # Change points to circles with white fill ggplot(data=dat, aes(x=time, y=total_bill, group=1)) + geom_line(colour="red", linetype="dashed", size=1.5) + geom_point(colour="red", size=4, shape=21, fill="white") plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7 The desired line graph might look something like this: # Change the y-range to go from 0 to the maximum value in the total_bill column, # and change axis labels ggplot(data=dat, aes(x=time, y=total_bill, group=1)) + geom_line() + geom_point() + expand_limits(y=0) + xlab("Time of day") + ylab("Total bill") + ggtitle("Average bill for 2 people") plot of chunk unnamed-chunk-8 See ../Colors (ggplot2) for more information on colors, and ../Shapes and line types for information on shapes and line types.

Graphs with more variables

This data will be used for the examples below: dat1 <- data.frame( sex = factor(c("Female","Female","Male","Male")), time = factor(c("Lunch","Dinner","Lunch","Dinner"), levels=c("Lunch","Dinner")), total_bill = c(13.53, 16.81, 16.24, 17.42) ) dat1 #> sex time total_bill #> 1 Female Lunch 13.53 #> 2 Female Dinner 16.81 #> 3 Male Lunch 16.24 #> 4 Male Dinner 17.42 This is derived from the tips dataset in the reshape2 package.

Bar graphs

These are the variable mappings used here: # Stacked bar graph -- this is probably not what you want ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity") # Bar graph, time on x-axis, color fill grouped by sex -- use position_dodge() ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity", position=position_dodge()) ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity", position=position_dodge(), colour="black") # Change colors ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity", position=position_dodge(), colour="black") + scale_fill_manual(values=c("#999999", "#E69F00")) plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10 It’s easy to change which variable is mapped the x-axis and which is mapped to the fill. # Bar graph, time on x-axis, color fill grouped by sex -- use position_dodge() ggplot(data=dat1, aes(x=sex, y=total_bill, fill=time)) + geom_bar(stat="identity", position=position_dodge(), colour="black") plot of chunk unnamed-chunk-11 See ../Colors (ggplot2) for more information on colors.

Line graphs

These are the variable mappings used here: To draw multiple lines, the points must be grouped by a variable; otherwise all points will be connected by a single line. In this case, we want them to be grouped by sex. # Basic line graph with points ggplot(data=dat1, aes(x=time, y=total_bill, group=sex)) + geom_line() + geom_point() # Map sex to color ggplot(data=dat1, aes(x=time, y=total_bill, group=sex, colour=sex)) + geom_line() + geom_point() # Map sex to different point shape, and use larger points ggplot(data=dat1, aes(x=time, y=total_bill, group=sex, shape=sex)) + geom_line() + geom_point() # Use thicker lines and larger points, and hollow white-filled points ggplot(data=dat1, aes(x=time, y=total_bill, group=sex, shape=sex)) + geom_line(size=1.5) + geom_point(size=3, fill="white") + scale_shape_manual(values=c(22,21)) plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12 It’s easy to change which variable is mapped the x-axis and which is mapped to the color or shape. ggplot(data=dat1, aes(x=sex, y=total_bill, group=time, shape=time, color=time)) + geom_line() + geom_point() plot of chunk unnamed-chunk-13 See ../Colors (ggplot2) for more information on colors, and ../Shapes and line types for information on shapes and line types.

Finished examples

The finished graphs might look like these: # A bar graph ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(colour="black", stat="identity", position=position_dodge(), size=.3) + # Thinner lines scale_fill_hue(name="Sex of payer") + # Set legend title xlab("Time of day") + ylab("Total bill") + # Set axis labels ggtitle("Average bill for 2 people") + # Set title theme_bw() # A line graph ggplot(data=dat1, aes(x=time, y=total_bill, group=sex, shape=sex, colour=sex)) + geom_line(aes(linetype=sex), size=1) + # Set linetype by sex geom_point(size=3, fill="white") + # Use larger points, fill with white expand_limits(y=0) + # Set y range to include 0 scale_colour_hue(name="Sex of payer", # Set legend title l=30) + # Use darker colors (lightness=30) scale_shape_manual(name="Sex of payer", values=c(22,21)) + # Use points with a fill color scale_linetype_discrete(name="Sex of payer") + xlab("Time of day") + ylab("Total bill") + # Set axis labels ggtitle("Average bill for 2 people") + # Set title theme_bw() + theme(legend.position=c(.7, .4)) # Position legend inside # This must go after theme_bw plot of chunk unnamed-chunk-14plot of chunk unnamed-chunk-14 In the line graph, the reason that the legend title, “Sex of payer”, must be specified three times is so that there is only one legend. The issue is explained here.

With a numeric x-axis

When the variable on the x-axis is numeric, it is sometimes useful to treat it as continuous, and sometimes useful to treat it as categorical. In this data set, the dose is a numeric variable with values 0.5, 1.0, and 2.0. It might be useful to treat these values as equal categories when making a graph. datn <- read.table(header=TRUE, text=' supp dose length OJ 0.5 13.23 OJ 1.0 22.70 OJ 2.0 26.06 VC 0.5 7.98 VC 1.0 16.77 VC 2.0 26.14 ') This is derived from the ToothGrowth dataset included with R.

With x-axis treated as continuous

A simple graph might put dose on the x-axis as a numeric value. It is possible to make a line graph this way, but not a bar graph. ggplot(data=datn, aes(x=dose, y=length, group=supp, colour=supp)) + geom_line() + geom_point() plot of chunk unnamed-chunk-16

With x-axis treated as categorical

If you wish to treat it as a categorical variable instead of a numeric one, it must be converted to a factor. This can be done by modifying the data frame, or by changing the specification of the graph. # Copy the data frame and convert dose to a factor datn2 <- datn datn2$dose <- factor(datn2$dose) ggplot(data=datn2, aes(x=dose, y=length, group=supp, colour=supp)) + geom_line() + geom_point() # Use the original data frame, but put factor() directly in the plot specification ggplot(data=datn, aes(x=factor(dose), y=length, group=supp, colour=supp)) + geom_line() + geom_point() plot of chunk unnamed-chunk-17plot of chunk unnamed-chunk-17 It is also possible to make a bar graph when the variable is treated as categorical rather than numeric. # Use datn2 from above ggplot(data=datn2, aes(x=dose, y=length, fill=supp)) + geom_bar(stat="identity", position=position_dodge()) # Use the original data frame, but put factor() directly in the plot specification ggplot(data=datn, aes(x=factor(dose), y=length, fill=supp)) + geom_bar(stat="identity", position=position_dodge()) plot of chunk unnamed-chunk-18plot of chunk unnamed-chunk-18

Plotting means and error bars (ggplot2)

Problem

You want to plot means and error bars for a dataset.

Solution

To make graphs with ggplot2, the data must be in a data frame, and in “long” (as opposed to wide) format. If your data needs to be restructured, see this page for more information.

Sample data

The examples below will the ToothGrowth dataset. Note that dose is a numeric column here; in some situations it may be useful to convert it to a factor. tg <- ToothGrowth head(tg) #> len supp dose #> 1 4.2 VC 0.5 #> 2 11.5 VC 0.5 #> 3 7.3 VC 0.5 #> 4 5.8 VC 0.5 #> 5 6.4 VC 0.5 #> 6 10.0 VC 0.5 library(ggplot2) First, it is necessary to summarize the data. This can be done in a number of ways, as described on this page. In this case, we’ll use the summarySE() function defined on that page, and also at the bottom of this page. (The code for the summarySE function must be entered before it is called here). # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval tgc <- summarySE(tg, measurevar="len", groupvars=c("supp","dose")) tgc #> supp dose N len sd se ci #> 1 OJ 0.5 10 13.23 4.459709 1.4102837 3.190283 #> 2 OJ 1.0 10 22.70 3.910953 1.2367520 2.797727 #> 3 OJ 2.0 10 26.06 2.655058 0.8396031 1.899314 #> 4 VC 0.5 10 7.98 2.746634 0.8685620 1.964824 #> 5 VC 1.0 10 16.77 2.515309 0.7954104 1.799343 #> 6 VC 2.0 10 26.14 4.797731 1.5171757 3.432090

Line graphs

After the data is summarized, we can make the graph. These are basic line and point graph with error bars representing either the standard error of the mean, or 95% confidence interval. # Standard error of the mean ggplot(tgc, aes(x=dose, y=len, colour=supp)) + geom_errorbar(aes(ymin=len-se, ymax=len+se), width=.1) + geom_line() + geom_point() # The errorbars overlapped, so use position_dodge to move them horizontally pd <- position_dodge(0.1) # move them .05 to the left and right ggplot(tgc, aes(x=dose, y=len, colour=supp)) + geom_errorbar(aes(ymin=len-se, ymax=len+se), width=.1, position=pd) + geom_line(position=pd) + geom_point(position=pd) # Use 95% confidence interval instead of SEM ggplot(tgc, aes(x=dose, y=len, colour=supp)) + geom_errorbar(aes(ymin=len-ci, ymax=len+ci), width=.1, position=pd) + geom_line(position=pd) + geom_point(position=pd) # Black error bars - notice the mapping of 'group=supp' -- without it, the error # bars won't be dodged! ggplot(tgc, aes(x=dose, y=len, colour=supp, group=supp)) + geom_errorbar(aes(ymin=len-ci, ymax=len+ci), colour="black", width=.1, position=pd) + geom_line(position=pd) + geom_point(position=pd, size=3) plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4 A finished graph with error bars representing the standard error of the mean might look like this. The points are drawn last so that the white fill goes on top of the lines and error bars. ggplot(tgc, aes(x=dose, y=len, colour=supp, group=supp)) + geom_errorbar(aes(ymin=len-se, ymax=len+se), colour="black", width=.1, position=pd) + geom_line(position=pd) + geom_point(position=pd, size=3, shape=21, fill="white") + # 21 is filled circle xlab("Dose (mg)") + ylab("Tooth length") + scale_colour_hue(name="Supplement type", # Legend label, use darker colors breaks=c("OJ", "VC"), labels=c("Orange juice", "Ascorbic acid"), l=40) + # Use darker colors, lightness=40 ggtitle("The Effect of Vitamin C on\nTooth Growth in Guinea Pigs") + expand_limits(y=0) + # Expand y range scale_y_continuous(breaks=0:20*4) + # Set tick every 4 theme_bw() + theme(legend.justification=c(1,0), legend.position=c(1,0)) # Position legend in bottom right plot of chunk unnamed-chunk-5

Bar graphs

The procedure is similar for bar graphs. Note that tgc$size must be a factor. If it is a numeric vector, then it will not work. # Use dose as a factor rather than numeric tgc2 <- tgc tgc2$dose <- factor(tgc2$dose) # Error bars represent standard error of the mean ggplot(tgc2, aes(x=dose, y=len, fill=supp)) + geom_bar(position=position_dodge(), stat="identity") + geom_errorbar(aes(ymin=len-se, ymax=len+se), width=.2, # Width of the error bars position=position_dodge(.9)) # Use 95% confidence intervals instead of SEM ggplot(tgc2, aes(x=dose, y=len, fill=supp)) + geom_bar(position=position_dodge(), stat="identity") + geom_errorbar(aes(ymin=len-ci, ymax=len+ci), width=.2, # Width of the error bars position=position_dodge(.9)) plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6 A finished graph might look like this. ggplot(tgc2, aes(x=dose, y=len, fill=supp)) + geom_bar(position=position_dodge(), stat="identity", colour="black", # Use black outlines, size=.3) + # Thinner lines geom_errorbar(aes(ymin=len-se, ymax=len+se), size=.3, # Thinner lines width=.2, position=position_dodge(.9)) + xlab("Dose (mg)") + ylab("Tooth length") + scale_fill_hue(name="Supplement type", # Legend label, use darker colors breaks=c("OJ", "VC"), labels=c("Orange juice", "Ascorbic acid")) + ggtitle("The Effect of Vitamin C on\nTooth Growth in Guinea Pigs") + scale_y_continuous(breaks=0:20*4) + theme_bw() plot of chunk unnamed-chunk-7

Error bars for within-subjects variables

When all variables are between-subjects, it is straightforward to plot standard error or confidence intervals. However, when there are within-subjects variables (repeated measures), plotting the standard error or regular confidence intervals may be misleading for making inferences about differences between conditions. The method below is from Morey (2008), which is a correction to Cousineau (2005), which in turn is meant to be a simpler method of that in Loftus and Masson (1994). See these papers for a more detailed treatment of the issues involved in error bars with within-subjects variables.

One within-subjects variable

Here is a data set (from Morey 2008) with one within-subjects variable: pre/post-test. dfw <- read.table(header=TRUE, text=' subject pretest posttest 1 59.4 64.5 2 46.4 52.4 3 46.0 49.7 4 49.0 48.7 5 32.5 37.4 6 45.2 49.5 7 60.3 59.9 8 54.3 54.1 9 45.4 49.6 10 38.9 48.5 ') # Treat subject ID as a factor dfw$subject <- factor(dfw$subject) The first step is to convert it to long format. See this page for more information about the conversion. # Convert to long format library(reshape2) dfw_long <- melt(dfw, id.vars = "subject", measure.vars = c("pretest","posttest"), variable.name = "condition") dfw_long #> subject condition value #> 1 1 pretest 59.4 #> 2 2 pretest 46.4 #> 3 3 pretest 46.0 #> 4 4 pretest 49.0 #> 5 5 pretest 32.5 #> 6 6 pretest 45.2 #> 7 7 pretest 60.3 #> 8 8 pretest 54.3 #> 9 9 pretest 45.4 #> 10 10 pretest 38.9 #> 11 1 posttest 64.5 #> 12 2 posttest 52.4 #> 13 3 posttest 49.7 #> 14 4 posttest 48.7 #> 15 5 posttest 37.4 #> 16 6 posttest 49.5 #> 17 7 posttest 59.9 #> 18 8 posttest 54.1 #> 19 9 posttest 49.6 #> 20 10 posttest 48.5 Collapse the data using summarySEwithin (defined at the bottom of this page; both of the helper functions below must be entered before the function is called here). dfwc <- summarySEwithin(dfw_long, measurevar="value", withinvars="condition", idvar="subject", na.rm=FALSE, conf.interval=.95) dfwc #> condition N value value_norm sd se ci #> 1 posttest 10 51.43 51.43 2.262361 0.7154214 1.618396 #> 2 pretest 10 47.74 47.74 2.262361 0.7154214 1.618396 library(ggplot2) # Make the graph with the 95% confidence interval ggplot(dfwc, aes(x=condition, y=value, group=1)) + geom_line() + geom_errorbar(width=.1, aes(ymin=value-ci, ymax=value+ci)) + geom_point(shape=21, size=3, fill="white") + ylim(40,60) plot of chunk unnamed-chunk-10 The value and value_norm columns represent the un-normed and normed means. See the section below on normed means for more information.

Understanding within-subjects error bars

This section explains how the within-subjects error bar values are calculated. The steps here are for explanation purposes only; they are not necessary for making the error bars. The graph of individual data shows that there is a consistent trend for the within-subjects variable condition, but this would not necessarily be revealed by taking the regular standard errors (or confidence intervals) for each group. The method in Morey (2008) and Cousineau (2005) essentially normalizes the data to remove the between-subject variability and calculates the variance from this normalized data. # Use a consistent y range ymax <- max(dfw_long$value) ymin <- min(dfw_long$value) # Plot the individuals ggplot(dfw_long, aes(x=condition, y=value, colour=subject, group=subject)) + geom_line() + geom_point(shape=21, fill="white") + ylim(ymin,ymax) # Create the normed version of the data dfwNorm.long <- normDataWithin(data=dfw_long, idvar="subject", measurevar="value") # Plot the normed individuals ggplot(dfwNorm.long, aes(x=condition, y=value_norm, colour=subject, group=subject)) + geom_line() + geom_point(shape=21, fill="white") + ylim(ymin,ymax) plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11 The differences in the error bars for the regular (between-subject) method and the within-subject method are shown here. The regular error bars are in red, and the within-subject error bars are in black. # Instead of summarySEwithin, use summarySE, which treats condition as though it were a between-subjects variable dfwc_between <- summarySE(data=dfw_long, measurevar="value", groupvars="condition", na.rm=FALSE, conf.interval=.95) dfwc_between #> condition N value sd se ci #> 1 pretest 10 47.74 8.598992 2.719240 6.151348 #> 2 posttest 10 51.43 7.253972 2.293907 5.189179 # Show the between-S CI's in red, and the within-S CI's in black ggplot(dfwc_between, aes(x=condition, y=value, group=1)) + geom_line() + geom_errorbar(width=.1, aes(ymin=value-ci, ymax=value+ci), colour="red") + geom_errorbar(width=.1, aes(ymin=value-ci, ymax=value+ci), data=dfwc) + geom_point(shape=21, size=3, fill="white") + ylim(ymin,ymax) plot of chunk unnamed-chunk-12

Two within-subjects variables

If there is more than one within-subjects variable, the same function, summarySEwithin, can be used. This data set is taken from Hays (1994), and used for making this type of within-subject error bar in Rouder and Morey (2005). data <- read.table(header=TRUE, text=' Subject RoundMono SquareMono RoundColor SquareColor 1 41 40 41 37 2 57 56 56 53 3 52 53 53 50 4 49 47 47 47 5 47 48 48 47 6 37 34 35 36 7 47 50 47 46 8 41 40 38 40 9 48 47 49 45 10 37 35 36 35 11 32 31 31 33 12 47 42 42 42 ') The data must first be converted to long format. In this case, the column names indicate two variables, shape (round/square) and color scheme (monochromatic/colored). # Convert it to long format library(reshape2) data_long <- melt(data=data, id.var="Subject", measure.vars=c("RoundMono", "SquareMono", "RoundColor", "SquareColor"), variable.name="Condition") names(data_long)[names(data_long)=="value"] <- "Time" # Split Condition column into Shape and ColorScheme data_long$Shape <- NA data_long$Shape[grepl("^Round", data_long$Condition)] <- "Round" data_long$Shape[grepl("^Square", data_long$Condition)] <- "Square" data_long$Shape <- factor(data_long$Shape) data_long$ColorScheme <- NA data_long$ColorScheme[grepl("Mono$", data_long$Condition)] <- "Monochromatic" data_long$ColorScheme[grepl("Color$", data_long$Condition)] <- "Colored" data_long$ColorScheme <- factor(data_long$ColorScheme, levels=c("Monochromatic","Colored")) # Remove the Condition column now data_long$Condition <- NULL # Look at first few rows head(data_long) #> Subject Time Shape ColorScheme #> 1 1 41 Round Monochromatic #> 2 2 57 Round Monochromatic #> 3 3 52 Round Monochromatic #> 4 4 49 Round Monochromatic #> 5 5 47 Round Monochromatic #> 6 6 37 Round Monochromatic Now it can be summarized and graphed. datac <- summarySEwithin(data_long, measurevar="Time", withinvars=c("Shape","ColorScheme"), idvar="Subject") datac #> Shape ColorScheme N Time Time_norm sd se ci #> 1 Round Colored 12 43.58333 43.58333 1.212311 0.3499639 0.7702654 #> 2 Round Monochromatic 12 44.58333 44.58333 1.331438 0.3843531 0.8459554 #> 3 Square Colored 12 42.58333 42.58333 1.461630 0.4219364 0.9286757 #> 4 Square Monochromatic 12 43.58333 43.58333 1.261312 0.3641095 0.8013997 library(ggplot2) ggplot(datac, aes(x=Shape, y=Time, fill=ColorScheme)) + geom_bar(position=position_dodge(.9), colour="black", stat="identity") + geom_errorbar(position=position_dodge(.9), width=.25, aes(ymin=Time-ci, ymax=Time+ci)) + coord_cartesian(ylim=c(40,46)) + scale_fill_manual(values=c("#CCCCCC","#FFFFFF")) + scale_y_continuous(breaks=seq(1:100)) + theme_bw() + geom_hline(yintercept=38) plot of chunk unnamed-chunk-15

Note about normed means

The summarySEWithin function returns both normed and un-normed means. The un-normed means are simply the mean of each group. The normed means are calculated so that means of each between-subject group are the same. These values can diverge when there are between-subject variables. For example: dat <- read.table(header=TRUE, text=' id trial gender dv A 0 male 2 A 1 male 4 B 0 male 6 B 1 male 8 C 0 female 22 C 1 female 24 D 0 female 26 D 1 female 28 ') # normed and un-normed means are different summarySEwithin(dat, measurevar="dv", withinvars="trial", betweenvars="gender", idvar="id") #> Automatically converting the following non-factors to factors: trial #> gender trial N dv dv_norm sd se ci #> 1 female 0 2 24 14 0 0 0 #> 2 female 1 2 26 16 0 0 0 #> 3 male 0 2 4 14 0 0 0 #> 4 male 1 2 6 16 0 0 0

Helper functions

The summarySE function is also defined on this page. If you only are working with between-subjects variables, that is the only function you will need in your code. If you have within-subjects variables and want to adjust the error bars so that inter-subject variability is removed as in Loftus and Masson (1994), then the other two functions, normDataWithin and summarySEwithin must also be added to your code; summarySEwithin will then be the function that you call. ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%). ## data: a data frame. ## measurevar: the name of a column that contains the variable to be summariezed ## groupvars: a vector containing names of columns that contain grouping variables ## na.rm: a boolean that indicates whether to ignore NA's ## conf.interval: the percent range of the confidence interval (default is 95%) summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95, .drop=TRUE) { library(plyr) # New version of length which can handle NA's: if na.rm==T, don't count them length2 <- function (x, na.rm=FALSE) { if (na.rm) sum(!is.na(x)) else length(x) } # This does the summary. For each group's data frame, return a vector with # N, mean, and sd datac <- ddply(data, groupvars, .drop=.drop, .fun = function(xx, col) { c(N = length2(xx[[col]], na.rm=na.rm), mean = mean (xx[[col]], na.rm=na.rm), sd = sd (xx[[col]], na.rm=na.rm) ) }, measurevar ) # Rename the "mean" column datac <- rename(datac, c("mean" = measurevar)) datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean # Confidence interval multiplier for standard error # Calculate t-statistic for confidence interval: # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1 ciMult <- qt(conf.interval/2 + .5, datac$N-1) datac$ci <- datac$se * ciMult return(datac) } ## Norms the data within specified groups in a data frame; it normalizes each ## subject (identified by idvar) so that they have the same mean, within each group ## specified by betweenvars. ## data: a data frame. ## idvar: the name of a column that identifies each subject (or matched subjects) ## measurevar: the name of a column that contains the variable to be summariezed ## betweenvars: a vector containing names of columns that are between-subjects variables ## na.rm: a boolean that indicates whether to ignore NA's normDataWithin <- function(data=NULL, idvar, measurevar, betweenvars=NULL, na.rm=FALSE, .drop=TRUE) { library(plyr) # Measure var on left, idvar + between vars on right of formula. data.subjMean <- ddply(data, c(idvar, betweenvars), .drop=.drop, .fun = function(xx, col, na.rm) { c(subjMean = mean(xx[,col], na.rm=na.rm)) }, measurevar, na.rm ) # Put the subject means with original data data <- merge(data, data.subjMean) # Get the normalized data in a new column measureNormedVar <- paste(measurevar, "_norm", sep="") data[,measureNormedVar] <- data[,measurevar] - data[,"subjMean"] + mean(data[,measurevar], na.rm=na.rm) # Remove this subject mean column data$subjMean <- NULL return(data) } ## Summarizes data, handling within-subjects variables by removing inter-subject variability. ## It will still work if there are no within-S variables. ## Gives count, un-normed mean, normed mean (with same between-group mean), ## standard deviation, standard error of the mean, and confidence interval. ## If there are within-subject variables, calculate adjusted values using method from Morey (2008). ## data: a data frame. ## measurevar: the name of a column that contains the variable to be summariezed ## betweenvars: a vector containing names of columns that are between-subjects variables ## withinvars: a vector containing names of columns that are within-subjects variables ## idvar: the name of a column that identifies each subject (or matched subjects) ## na.rm: a boolean that indicates whether to ignore NA's ## conf.interval: the percent range of the confidence interval (default is 95%) summarySEwithin <- function(data=NULL, measurevar, betweenvars=NULL, withinvars=NULL, idvar=NULL, na.rm=FALSE, conf.interval=.95, .drop=TRUE) { # Ensure that the betweenvars and withinvars are factors factorvars <- vapply(data[, c(betweenvars, withinvars), drop=FALSE], FUN=is.factor, FUN.VALUE=logical(1)) if (!all(factorvars)) { nonfactorvars <- names(factorvars)[!factorvars] message("Automatically converting the following non-factors to factors: ", paste(nonfactorvars, collapse = ", ")) data[nonfactorvars] <- lapply(data[nonfactorvars], factor) } # Get the means from the un-normed data datac <- summarySE(data, measurevar, groupvars=c(betweenvars, withinvars), na.rm=na.rm, conf.interval=conf.interval, .drop=.drop) # Drop all the unused columns (these will be calculated with normed data) datac$sd <- NULL datac$se <- NULL datac$ci <- NULL # Norm each subject's data ndata <- normDataWithin(data, idvar, measurevar, betweenvars, na.rm, .drop=.drop) # This is the name of the new column measurevar_n <- paste(measurevar, "_norm", sep="") # Collapse the normed data - now we can treat between and within vars the same ndatac <- summarySE(ndata, measurevar_n, groupvars=c(betweenvars, withinvars), na.rm=na.rm, conf.interval=conf.interval, .drop=.drop) # Apply correction from Morey (2008) to the standard error and confidence interval # Get the product of the number of conditions of within-S variables nWithinGroups <- prod(vapply(ndatac[,withinvars, drop=FALSE], FUN=nlevels, FUN.VALUE=numeric(1))) correctionFactor <- sqrt( nWithinGroups / (nWithinGroups-1) ) # Apply the correction factor ndatac$sd <- ndatac$sd * correctionFactor ndatac$se <- ndatac$se * correctionFactor ndatac$ci <- ndatac$ci * correctionFactor # Combine the un-normed means with the normed results merge(datac, ndatac) }

Plotting distributions (ggplot2)

Problem

You want to plot a distribution of data.

Solution

This sample data will be used for the examples below: set.seed(1234) dat <- data.frame(cond = factor(rep(c("A","B"), each=200)), rating = c(rnorm(200),rnorm(200, mean=.8))) # View first few rows head(dat) #> cond rating #> 1 A -1.2070657 #> 2 A 0.2774292 #> 3 A 1.0844412 #> 4 A -2.3456977 #> 5 A 0.4291247 #> 6 A 0.5060559 library(ggplot2)

Histogram and density plots

The qplot function is supposed make the same graphs as ggplot, but with a simpler syntax. However, in practice, it’s often easier to just use ggplot because the options for qplot can be more confusing to use. ## Basic histogram from the vector "rating". Each bin is .5 wide. ## These both result in the same output: ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5) # qplot(dat$rating, binwidth=.5) # Draw with black outline, white fill ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white") # Density curve ggplot(dat, aes(x=rating)) + geom_density() # Histogram overlaid with kernel density curve ggplot(dat, aes(x=rating)) + geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis binwidth=.5, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") # Overlay with transparent density plot plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3 Add a line for the mean: ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white") + geom_vline(aes(xintercept=mean(rating, na.rm=T)), # Ignore NA values for mean color="red", linetype="dashed", size=1) plot of chunk unnamed-chunk-4

Histogram and density plots with multiple groups

# Overlaid histograms ggplot(dat, aes(x=rating, fill=cond)) + geom_histogram(binwidth=.5, alpha=.5, position="identity") # Interleaved histograms ggplot(dat, aes(x=rating, fill=cond)) + geom_histogram(binwidth=.5, position="dodge") # Density plots ggplot(dat, aes(x=rating, colour=cond)) + geom_density() # Density plots with semi-transparent fill ggplot(dat, aes(x=rating, fill=cond)) + geom_density(alpha=.3) plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5 Add lines for each mean requires first creating a separate data frame with the means: # Find the mean of each group library(plyr) cdat <- ddply(dat, "cond", summarise, rating.mean=mean(rating)) cdat #> cond rating.mean #> 1 A -0.05775928 #> 2 B 0.87324927 # Overlaid histograms with means ggplot(dat, aes(x=rating, fill=cond)) + geom_histogram(binwidth=.5, alpha=.5, position="identity") + geom_vline(data=cdat, aes(xintercept=rating.mean, colour=cond), linetype="dashed", size=1) # Density plots with means ggplot(dat, aes(x=rating, colour=cond)) + geom_density() + geom_vline(data=cdat, aes(xintercept=rating.mean, colour=cond), linetype="dashed", size=1) plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6 Using facets: ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white") + facet_grid(cond ~ .) # With mean lines, using cdat from above ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white") + facet_grid(cond ~ .) + geom_vline(data=cdat, aes(xintercept=rating.mean), linetype="dashed", size=1, colour="red") plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7 See Facets (ggplot2) for more details.

Box plots

# A basic box plot ggplot(dat, aes(x=cond, y=rating)) + geom_boxplot() # A basic box with the conditions colored ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() # The above adds a redundant legend. With the legend removed: ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() + guides(fill=FALSE) # With flipped axes ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() + guides(fill=FALSE) + coord_flip() plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8 It’s also possible to add the mean by using stat_summary. # Add a diamond at the mean, and make it larger ggplot(dat, aes(x=cond, y=rating)) + geom_boxplot() + stat_summary(fun.y=mean, geom="point", shape=5, size=4) plot of chunk unnamed-chunk-9

Scatterplots (ggplot2)

Problem

You want to make a scatterplot.

Solution

Suppose this is your data: set.seed(955) # Make some noisily increasing data dat <- data.frame(cond = rep(c("A", "B"), each=10), xvar = 1:20 + rnorm(20,sd=3), yvar = 1:20 + rnorm(20,sd=3)) head(dat) #> cond xvar yvar #> 1 A -4.252354 3.473157275 #> 2 A 1.702318 0.005939612 #> 3 A 4.323054 -0.094252427 #> 4 A 1.780628 2.072808278 #> 5 A 11.537348 1.215440358 #> 6 A 6.672130 3.608111411 library(ggplot2)

Basic scatterplots with regression lines

ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1) # Use hollow circles ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1) + # Use hollow circles geom_smooth(method=lm) # Add linear regression line # (by default includes 95% confidence region) ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1) + # Use hollow circles geom_smooth(method=lm, # Add linear regression line se=FALSE) # Don't add shaded confidence region ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1) + # Use hollow circles geom_smooth() # Add a loess smoothed fit curve with confidence region #> `geom_smooth()` using method = 'loess' plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Set color/shape by another variable

# Set color by cond ggplot(dat, aes(x=xvar, y=yvar, color=cond)) + geom_point(shape=1) # Same, but with different colors and add regression lines ggplot(dat, aes(x=xvar, y=yvar, color=cond)) + geom_point(shape=1) + scale_colour_hue(l=50) + # Use a slightly darker palette than normal geom_smooth(method=lm, # Add linear regression lines se=FALSE) # Don't add shaded confidence region # Extend the regression lines beyond the domain of the data ggplot(dat, aes(x=xvar, y=yvar, color=cond)) + geom_point(shape=1) + scale_colour_hue(l=50) + # Use a slightly darker palette than normal geom_smooth(method=lm, # Add linear regression lines se=FALSE, # Don't add shaded confidence region fullrange=TRUE) # Extend regression lines # Set shape by cond ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point() # Same, but with different shapes ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point() + scale_shape_manual(values=c(1,2)) # Use a hollow circle and triangle plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4 See Colors (ggplot2) and Shapes and line types for more information about colors and shapes.

Handling overplotting

If you have many data points, or if your data scales are discrete, then the data points might overlap and it will be impossible to see if there are many points at the same location. # Round xvar and yvar to the nearest 5 dat$xrnd <- round(dat$xvar/5)*5 dat$yrnd <- round(dat$yvar/5)*5 # Make each dot partially transparent, with 1/4 opacity # For heavy overplotting, try using smaller values ggplot(dat, aes(x=xrnd, y=yrnd)) + geom_point(shape=19, # Use solid circles alpha=1/4) # 1/4 opacity # Jitter the points # Jitter range is 1 on the x-axis, .5 on the y-axis ggplot(dat, aes(x=xrnd, y=yrnd)) + geom_point(shape=1, # Use hollow circles position=position_jitter(width=1,height=.5)) plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Titles (ggplot2)

Problem

You want to set the title of your graph.

Solution

An example graph without a title: library(ggplot2) bp <- ggplot(PlantGrowth, aes(x=group, y=weight)) + geom_boxplot() bp plot of chunk unnamed-chunk-2 With a title: bp + ggtitle("Plant growth") ## Equivalent to # bp + labs(title="Plant growth") # If the title is long, it can be split into multiple lines with \n bp + ggtitle("Plant growth with\ndifferent treatments") # Reduce line spacing and use bold text bp + ggtitle("Plant growth with\ndifferent treatments") + theme(plot.title = element_text(lineheight=.8, face="bold")) plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Axes (ggplot2)

Problem

You want to change the order or direction of the axes.

Solution

Note: In the examples below, where it says something like scale_y_continuous, scale_x_continuous, or ylim, the y can be replaced with x if you want to operate on the other axis. This is the basic boxplot that we will work with, using the built-in PlantGrowth data set. library(ggplot2) bp <- ggplot(PlantGrowth, aes(x=group, y=weight)) + geom_boxplot() bp plot of chunk unnamed-chunk-2

Swapping X and Y axes

Swap x and y axes (make x vertical, y horizontal): bp + coord_flip() plot of chunk unnamed-chunk-3

Discrete axis

Changing the order of items

# Manually set the order of a discrete-valued axis bp + scale_x_discrete(limits=c("trt1","trt2","ctrl")) # Reverse the order of a discrete-valued axis # Get the levels of the factor flevels <- levels(PlantGrowth$group) flevels #> [1] "ctrl" "trt1" "trt2" # Reverse the order flevels <- rev(flevels) flevels #> [1] "trt2" "trt1" "ctrl" bp + scale_x_discrete(limits=flevels) # Or it can be done in one line: bp + scale_x_discrete(limits = rev(levels(PlantGrowth$group))) plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Setting tick mark labels

For discrete variables, the tick mark labels are taken directly from levels of the factor. However, sometimes the factor levels have short names that aren’t suitable for presentation. bp + scale_x_discrete(breaks=c("ctrl", "trt1", "trt2"), labels=c("Control", "Treat 1", "Treat 2")) plot of chunk unnamed-chunk-5 # Hide x tick marks, labels, and grid lines bp + scale_x_discrete(breaks=NULL) # Hide all tick marks and labels (on X axis), but keep the gridlines bp + theme(axis.ticks = element_blank(), axis.text.x = element_blank()) plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

Continuous axis

Setting range and reversing direction of an axis

If you simply want to make sure that an axis includes a particular value in the range, use expand_limits(). This can only expand the range of an axis; it can’t shrink the range. # Make sure to include 0 in the y axis bp + expand_limits(y=0) # Make sure to include 0 and 8 in the y axis bp + expand_limits(y=c(0,8)) plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7 You can also explicitly set the y limits. Note that if any scale_y_continuous command is used, it overrides any ylim command, and the ylim will be ignored. # Set the range of a continuous-valued axis # These are equivalent bp + ylim(0, 8) # bp + scale_y_continuous(limits=c(0, 8)) plot of chunk unnamed-chunk-8 If the y range is reduced using the method above, the data outside the range is ignored. This might be OK for a scatterplot, but it can be problematic for the box plots used here. For bar graphs, if the range does not include 0, the bars will not show at all! To avoid this problem, you can use coord_cartesian instead. Instead of setting the limits of the data, it sets the viewing area of the data. # These two do the same thing; all data points outside the graphing range are # dropped, resulting in a misleading box plot bp + ylim(5, 7.5) #> Warning: Removed 13 rows containing non-finite values (stat_boxplot). # bp + scale_y_continuous(limits=c(5, 7.5)) # Using coord_cartesian "zooms" into the area bp + coord_cartesian(ylim=c(5, 7.5)) # Specify tick marks directly bp + coord_cartesian(ylim=c(5, 7.5)) + scale_y_continuous(breaks=seq(0, 10, 0.25)) # Ticks from 0-10, every .25 plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9

Reversing the direction of an axis

# Reverse order of a continuous-valued axis bp + scale_y_reverse() plot of chunk unnamed-chunk-10

Setting and hiding tick markers

# Setting the tick marks on an axis # This will show tick marks on every 0.25 from 1 to 10 # The scale will show only the ones that are within range (3.50-6.25 in this case) bp + scale_y_continuous(breaks=seq(1,10,1/4)) # The breaks can be spaced unevenly bp + scale_y_continuous(breaks=c(4, 4.25, 4.5, 5, 6,8)) # Suppress ticks and gridlines bp + scale_y_continuous(breaks=NULL) # Hide tick marks and labels (on Y axis), but keep the gridlines bp + theme(axis.ticks = element_blank(), axis.text.y = element_blank()) plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11

Axis transformations: log, sqrt, etc.

By default, the axes are linearly scaled. It is possible to transform the axes with log, power, roots, and so on. There are two ways of transforming an axis. One is to use a scale transform, and the other is to use a coordinate transform. With a scale transform, the data is transformed before properties such as breaks (the tick locations) and range of the axis are decided. With a coordinate transform, the transformation happens after the breaks and scale range are decided. This results in different appearances, as shown below. # Create some noisy exponentially-distributed data set.seed(201) n <- 100 dat <- data.frame( xval = (1:n+rnorm(n,sd=5))/20, yval = 2*2^((1:n+rnorm(n,sd=5))/20) ) # A scatterplot with regular (linear) axis scaling sp <- ggplot(dat, aes(xval, yval)) + geom_point() sp # log2 scaling of the y axis (with visually-equal spacing) library(scales) # Need the scales package sp + scale_y_continuous(trans=log2_trans()) # log2 coordinate transformation (with visually-diminishing spacing) sp + coord_trans(y="log2") plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12 With a scale transformation, you can also set the axis tick marks to show exponents. sp + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) plot of chunk unnamed-chunk-13 Many transformations are available. See ?trans_new for a full list. If the transformation you need isn’t on the list, it is possible to write your own transformation function. A couple scale transformations have convenience functions: scale_y_log10 and scale_y_sqrt (with corresponding versions for x). set.seed(205) n <- 100 dat10 <- data.frame( xval = (1:n+rnorm(n,sd=5))/20, yval = 10*10^((1:n+rnorm(n,sd=5))/20) ) sp10 <- ggplot(dat10, aes(xval, yval)) + geom_point() # log10 sp10 + scale_y_log10() # log10 with exponents on tick labels sp10 + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) plot of chunk unnamed-chunk-14plot of chunk unnamed-chunk-14

Fixed ratio between x and y axes

It is possible to set the scaling of the axes to an equal ratio, with one visual unit being representing the same numeric unit on both axes. It is also possible to set them to ratios other than 1:1. # Data where x ranges from 0-10, y ranges from 0-30 set.seed(202) dat <- data.frame( xval = runif(40,0,10), yval = runif(40,0,30) ) sp <- ggplot(dat, aes(xval, yval)) + geom_point() # Force equal scaling sp + coord_fixed() # Equal scaling, with each 1 on the x axis the same length as y on x axis sp + coord_fixed(ratio=1/3) plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15

Axis labels and text formatting

To set and hide the axis labels: bp + theme(axis.title.x = element_blank()) + # Remove x-axis label ylab("Weight (Kg)") # Set y-axis label # Also possible to set the axis label with the scale # Note that vertical space is still reserved for x's label bp + scale_x_discrete(name="") + scale_y_continuous(name="Weight (Kg)") plot of chunk unnamed-chunk-16plot of chunk unnamed-chunk-16 To change the fonts, and rotate tick mark labels: # Change font options: # X-axis label: bold, red, and 20 points # X-axis tick marks: rotate 90 degrees CCW, move to the left a bit (using vjust, # since the labels are rotated), and 16 points bp + theme(axis.title.x = element_text(face="bold", colour="#990000", size=20), axis.text.x = element_text(angle=90, vjust=0.5, size=16)) plot of chunk unnamed-chunk-17

Tick mark label text formatters

You may want to display your values as percents, or dollars, or in scientific notation. To do this you can use a formatter, which is a function that changes the text: # Label formatters library(scales) # Need the scales package bp + scale_y_continuous(labels=percent) + scale_x_discrete(labels=abbreviate) # In this particular case, it has no effect plot of chunk unnamed-chunk-18 Other useful formatters for continuous scales include comma, percent, dollar, and scientific. For discrete scales, abbreviate will remove vowels and spaces and shorten to four characters. For dates, use date_format. Sometimes you may need to create your own formatting function. This one will display numeric minutes in HH:MM:SS format. # Self-defined formatting function for times. timeHMS_formatter <- function(x) { h <- floor(x/60) m <- floor(x %% 60) s <- round(60*(x %% 1)) # Round to nearest second lab <- sprintf('%02d:%02d:%02d', h, m, s) # Format the strings as HH:MM:SS lab <- gsub('^00:', '', lab) # Remove leading 00: if present lab <- gsub('^0', '', lab) # Remove leading 0 if present } bp + scale_y_continuous(label=timeHMS_formatter) plot of chunk unnamed-chunk-19

Hiding gridlines

To hide all gridlines, both vertical and horizontal: # Hide all the gridlines bp + theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) # Hide just the minor gridlines bp + theme(panel.grid.minor=element_blank()) plot of chunk unnamed-chunk-20plot of chunk unnamed-chunk-20 It’s also possible to hide just the vertical or horizontal gridlines: # Hide all the vertical gridlines bp + theme(panel.grid.minor.x=element_blank(), panel.grid.major.x=element_blank()) # Hide all the horizontal gridlines bp + theme(panel.grid.minor.y=element_blank(), panel.grid.major.y=element_blank()) plot of chunk unnamed-chunk-21plot of chunk unnamed-chunk-21

Legends (ggplot2)

Problem

You want to modify the legend of a graph made with ggplot2.

Solution

Start with an example graph with the default options: library(ggplot2) bp <- ggplot(data=PlantGrowth, aes(x=group, y=weight, fill=group)) + geom_boxplot() bp plot of chunk unnamed-chunk-2

Removing the legend

Use guides(fill=FALSE), replacing fill with the desired aesthetic. You can also remove all the legends in a graph, using theme. # Remove legend for a particular aesthetic (fill) bp + guides(fill=FALSE) # It can also be done when specifying the scale bp + scale_fill_discrete(guide=FALSE) # This removes all legends bp + theme(legend.position="none") plot of chunk unnamed-chunk-4

Changing the order of items in the legend

This changes the order of items to trt1, ctrl, trt2: bp + scale_fill_discrete(breaks=c("trt1","ctrl","trt2")) plot of chunk unnamed-chunk-5 Depending on how the colors are specified, you may have to use a different scale, such as scale_fill_manual, scale_colour_hue, scale_colour_manual, scale_shape_discrete, scale_linetype_discrete, and so on.

Reversing the order of items in the legend

To reverse the legend order: # These two methods are equivalent: bp + guides(fill = guide_legend(reverse=TRUE)) bp + scale_fill_discrete(guide = guide_legend(reverse=TRUE)) # You can also modify the scale directly: bp + scale_fill_discrete(breaks = rev(levels(PlantGrowth$group))) plot of chunk unnamed-chunk-7 Instead of scale_fill_discrete, you may need to use a different scale, such as scale_fill_manual, scale_colour_hue, scale_colour_manual, scale_shape_discrete, scale_linetype_discrete, and so on.

Hiding the legend title

This will hide the legend title: # Remove title for fill legend bp + guides(fill=guide_legend(title=NULL)) # Remove title for all legends bp + theme(legend.title=element_blank()) plot of chunk unnamed-chunk-9

Modifying the text of legend titles and labels

There are two ways of changing the legend title and labels. The first way is to tell the scale to use have a different title and labels. The second way is to change data frame so that the factor has the desired form.

Using scales

The legend can be a guide for fill, colour, linetype, shape, or other aesthetics.
With fill and color
Because group, the variable in the legend, is mapped to the color fill, it is necessary to use scale_fill_xxx, where xxx is a method of mapping each factor level of group to different colors. The default is to use a different hue on the color wheel for each factor level, but it is also possible to manually specify the colors for each level. bp + scale_fill_discrete(name="Experimental\nCondition") bp + scale_fill_discrete(name="Experimental\nCondition", breaks=c("ctrl", "trt1", "trt2"), labels=c("Control", "Treatment 1", "Treatment 2")) # Using a manual scale instead of hue bp + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"), name="Experimental\nCondition", breaks=c("ctrl", "trt1", "trt2"), labels=c("Control", "Treatment 1", "Treatment 2")) plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10 Note that this didn’t change the x axis labels. See Axes (ggplot2) for information on how to modify the axis labels. If you use a line graph, you will probably need to use scale_colour_xxx and/or scale_shape_xxx instead of scale_fill_xxx. colour maps to the colors of lines and points, while fill maps to the color of area fills. shape maps to the shapes of points. We’ll use a different data set for the line graphs here because the PlantGrowth data set does not work well with a line graph. # A different data set df1 <- data.frame( sex = factor(c("Female","Female","Male","Male")), time = factor(c("Lunch","Dinner","Lunch","Dinner"), levels=c("Lunch","Dinner")), total_bill = c(13.53, 16.81, 16.24, 17.42) ) # A basic graph lp <- ggplot(data=df1, aes(x=time, y=total_bill, group=sex, shape=sex)) + geom_line() + geom_point() lp # Change the legend lp + scale_shape_discrete(name ="Payer", breaks=c("Female", "Male"), labels=c("Woman", "Man")) plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11 If you use both colour and shape, they both need to be given scale specifications. Otherwise there will be two two separate legends. # Specify colour and shape lp1 <- ggplot(data=df1, aes(x=time, y=total_bill, group=sex, shape=sex, colour=sex)) + geom_line() + geom_point() lp1 # Here's what happens if you just specify colour lp1 + scale_colour_discrete(name ="Payer", breaks=c("Female", "Male"), labels=c("Woman", "Man")) # Specify both colour and shape lp1 + scale_colour_discrete(name ="Payer", breaks=c("Female", "Male"), labels=c("Woman", "Man")) + scale_shape_discrete(name ="Payer", breaks=c("Female", "Male"), labels=c("Woman", "Man")) plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12
Kinds of scales
There are many kinds of scales. They take the form scale_xxx_yyy. Here are some commonly-used values of xxx and yyy:
xxx Description
colour Color of lines and points
fill Color of area fills (e.g. bar graph)
linetype Solid/dashed/dotted lines
shape Shape of points
size Size of points
alpha Opacity/transparency
yyy Description
hue Equally-spaced colors from the color wheel
manual Manually-specified values (e.g., colors, point shapes, line types)
gradient Color gradient
grey Shades of grey
discrete Discrete values (e.g., colors, point shapes, line types, point sizes)
continuous Continuous values (e.g., alpha, colors, point sizes)

Changing the factor in the data frame

Another way to change the legend title and labels is to directly modify the data frame. pg <- PlantGrowth # Copy data into new data frame # Rename the column and the values in the factor levels(pg$group)[levels(pg$group)=="ctrl"] <- "Control" levels(pg$group)[levels(pg$group)=="trt1"] <- "Treatment 1" levels(pg$group)[levels(pg$group)=="trt2"] <- "Treatment 2" names(pg)[names(pg)=="group"] <- "Experimental Condition" # View a few rows from the end product head(pg) #> weight Experimental Condition #> 1 4.17 Control #> 2 5.58 Control #> 3 5.18 Control #> 4 6.11 Control #> 5 4.50 Control #> 6 4.61 Control # Make the plot ggplot(data=pg, aes(x=`Experimental Condition`, y=weight, fill=`Experimental Condition`)) + geom_boxplot() plot of chunk unnamed-chunk-13 The legend title “Experimental Condtion” is long and it might look better if it were broken into two lines, but this doesn’t work very well with this method, since you would have to put a newline character in the name of the column. The other method, with scales, is generally a better way to do this. Also note the use of backticks instead of quotes. They are necessary because of the spaces in the variable name.

Modifying the appearance of the legend title and labels

# Title appearance bp + theme(legend.title = element_text(colour="blue", size=16, face="bold")) # Label appearance bp + theme(legend.text = element_text(colour="blue", size = 16, face = "bold")) plot of chunk unnamed-chunk-14plot of chunk unnamed-chunk-14

Modifying the legend box

By default, the legend will not have a box around it. To add a box and modify its properties: bp + theme(legend.background = element_rect()) bp + theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15

Changing the position of the legend

Position legend outside the plotting area (left/right/top/bottom): bp + theme(legend.position="top") plot of chunk unnamed-chunk-16 It is also possible to position the legend inside the plotting area. Note that the numeric position below is relative to the entire area, including titles and labels, not just the plotting area. # Position legend in graph, where x,y is 0,0 (bottom left) to 1,1 (top right) bp + theme(legend.position=c(.5, .5)) # Set the "anchoring point" of the legend (bottom-left is 0,0; top-right is 1,1) # Put bottom-left corner of legend box in bottom-left corner of graph bp + theme(legend.justification=c(0,0), legend.position=c(0,0)) # Put bottom-right corner of legend box in bottom-right corner of graph bp + theme(legend.justification=c(1,0), legend.position=c(1,0)) plot of chunk unnamed-chunk-17plot of chunk unnamed-chunk-17plot of chunk unnamed-chunk-17

Hiding slashes in the legend

If you make bar graphs with an outline (by setting colour=”black”), it will draw a slash through the colors in the legend. There is not a built-in way to remove the slashes, but it is possible to cover them up. # No outline ggplot(data=PlantGrowth, aes(x=group, fill=group)) + geom_bar() # Add outline, but slashes appear in legend ggplot(data=PlantGrowth, aes(x=group, fill=group)) + geom_bar(colour="black") # A hack to hide the slashes: first graph the bars with no outline and add the legend, # then graph the bars again with outline, but with a blank legend. ggplot(data=PlantGrowth, aes(x=group, fill=group)) + geom_bar() + geom_bar(colour="black", show.legend=FALSE) plot of chunk unnamed-chunk-18plot of chunk unnamed-chunk-18plot of chunk unnamed-chunk-18

Lines (ggplot2)

Problem

You want to do add lines to a plot.

Solution

With one continuous and one categorical axis

# Some sample data dat <- read.table(header=TRUE, text=' cond result control 10 treatment 11.5 ') library(ggplot2)

Lines that go all the way across

These use geom_hline because the y-axis is the continuous one, but it is also possible to use geom_vline (with xintercept) if the x-axis is continuous. # Basic bar plot bp <- ggplot(dat, aes(x=cond, y=result)) + geom_bar(position=position_dodge(), stat="identity") bp # Add a horizontal line bp + geom_hline(aes(yintercept=12)) # Make the line red and dashed bp + geom_hline(aes(yintercept=12), colour="#990000", linetype="dashed") plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Separate lines for each categorical value

To make separate lines for each bar, use geom_errorbar. The error bars have no height – ymin=ymax. It also seems necessary to specify y for some reason, even though it doesn’t do anything. # Draw separate hlines for each bar. First add another column to dat dat$hline <- c(9,12) dat #> cond result hline #> 1 control 10.0 9 #> 2 treatment 11.5 12 # Need to re-specify bp, because the data has changed bp <- ggplot(dat, aes(x=cond, y=result)) + geom_bar(position=position_dodge(), stat="identity") # Draw with separate lines for each bar bp + geom_errorbar(aes(ymax=hline, ymin=hline), colour="#AA0000") # Make the lines narrower bp + geom_errorbar(width=0.5, aes(ymax=hline, ymin=hline), colour="#AA0000") # Can get the same result, even if we get the hline values from a second data frame # Define data frame with hline dat_hlines <- data.frame(cond=c("control","treatment"), hline=c(9,12)) dat_hlines #> cond hline #> 1 control 9 #> 2 treatment 12 # The bars are from dat, but the lines are from dat_hlines bp + geom_errorbar(data=dat_hlines, aes(y=NULL, ymax=hline, ymin=hline), colour="#AA0000") #> Warning: Ignoring unknown aesthetics: y plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Lines over grouped bars

It is possible to add lines over grouped bars. In this example, there are actually four lines (one for each entry for hline), but it looks like two, because they are drawn on top of each other. I don’t think it’s possible to avoid this, but it doesn’t cause any problems. dat <- read.table(header=TRUE, text=' cond group result hline control A 10 9 treatment A 11.5 12 control B 12 9 treatment B 14 12 ') dat #> cond group result hline #> 1 control A 10.0 9 #> 2 treatment A 11.5 12 #> 3 control B 12.0 9 #> 4 treatment B 14.0 12 # Define basic bar plot bp <- ggplot(dat, aes(x=cond, y=result, fill=group)) + geom_bar(position=position_dodge(), stat="identity") bp # The error bars get plotted over one another -- there are four but it looks # like two bp + geom_errorbar(aes(ymax=hline, ymin=hline), linetype="dashed") plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Lines over individual grouped bars

It is also possible to have lines over each individual bar, even when grouped. dat <- read.table(header=TRUE, text=' cond group result hline control A 10 11 treatment A 11.5 12 control B 12 12.5 treatment B 14 15 ') # Define basic bar plot bp <- ggplot(dat, aes(x=cond, y=result, fill=group)) + geom_bar(position=position_dodge(), stat="identity") bp bp + geom_errorbar(aes(ymax=hline, ymin=hline), linetype="dashed", position=position_dodge()) plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

With two continuous axes

Sample data used here: dat <- read.table(header=TRUE, text=' cond xval yval control 11.5 10.8 control 9.3 12.9 control 8.0 9.9 control 11.5 10.1 control 8.6 8.3 control 9.9 9.5 control 8.8 8.7 control 11.7 10.1 control 9.7 9.3 control 9.8 12.0 treatment 10.4 10.6 treatment 12.1 8.6 treatment 11.2 11.0 treatment 10.0 8.8 treatment 12.9 9.5 treatment 9.1 10.0 treatment 13.4 9.6 treatment 11.6 9.8 treatment 11.5 9.8 treatment 12.0 10.6 ') library(ggplot2)

Basic lines

# The basic scatterplot sp <- ggplot(dat, aes(x=xval, y=yval, colour=cond)) + geom_point() # Add a horizontal line sp + geom_hline(aes(yintercept=10)) # Add a red dashed vertical line sp + geom_hline(aes(yintercept=10)) + geom_vline(aes(xintercept=11.5), colour="#BB0000", linetype="dashed") plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9

Drawing lines for the mean

It is also possible to compute a mean value for each subset of data, grouped by some variable. The group means would have to be computed and stored in a separate data frame, and the easiest way to do this is to use the dplyr package. Note that the y range of the line is determined by the data. library(dplyr) lines <- dat %>% group_by(cond) %>% summarise( x = mean(xval), ymin = min(yval), ymax = max(yval) ) # Add colored lines for the mean xval of each group sp + geom_hline(aes(yintercept=10)) + geom_linerange(aes(x=x, y=NULL, ymin=ymin, ymax=ymax), data=lines) #> Warning: Ignoring unknown aesthetics: y plot of chunk unnamed-chunk-10

Using lines with facets

Normally, if you add a line, it will appear in all facets. # Facet, based on cond spf <- sp + facet_grid(. ~ cond) spf # Draw a horizontal line in all of the facets at the same value spf + geom_hline(aes(yintercept=10)) plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11 If you want the different lines to appear in the different facets, there are two options. One is to create a new data frame with the desired values for the lines. Another option (with more limited control) is to use stat and xintercept in geom_line(). dat_vlines <- data.frame(cond=levels(dat$cond), xval=c(10,11.5)) dat_vlines #> cond xval #> 1 control 10.0 #> 2 treatment 11.5 spf + geom_hline(aes(yintercept=10)) + geom_vline(aes(xintercept=xval), data=dat_vlines, colour="#990000", linetype="dashed") spf + geom_hline(aes(yintercept=10)) + geom_linerange(aes(x=x, y=NULL, ymin=ymin, ymax=ymax), data=lines) #> Warning: Ignoring unknown aesthetics: y plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12

Facets (ggplot2)

Problem

You want to do split up your data by one or more variables and plot the subsets of data together.

Solution

Sample data

We will use the tips dataset from the reshape2 package. library(reshape2) # Look at first few rows head(tips) #> total_bill tip sex smoker day time size #> 1 16.99 1.01 Female No Sun Dinner 2 #> 2 10.34 1.66 Male No Sun Dinner 3 #> 3 21.01 3.50 Male No Sun Dinner 3 #> 4 23.68 3.31 Male No Sun Dinner 2 #> 5 24.59 3.61 Female No Sun Dinner 4 #> 6 25.29 4.71 Male No Sun Dinner 4 This is a scatterplot of the tip percentage by total bill size. library(ggplot2) sp <- ggplot(tips, aes(x=total_bill, y=tip/total_bill)) + geom_point(shape=1) sp plot of chunk unnamed-chunk-3

facet_grid

The data can be split up by one or two variables that vary on the horizontal and/or vertical direction. This is done by giving a formula to facet_grid(), of the form vertical ~ horizontal. # Divide by levels of "sex", in the vertical direction sp + facet_grid(sex ~ .) plot of chunk unnamed-chunk-4 # Divide by levels of "sex", in the horizontal direction sp + facet_grid(. ~ sex) plot of chunk unnamed-chunk-5 # Divide with "sex" vertical, "day" horizontal sp + facet_grid(sex ~ day) plot of chunk unnamed-chunk-6

facet_wrap

Instead of faceting with a variable in the horizontal or vertical direction, facets can be placed next to each other, wrapping with a certain number of columns or rows. The label for each plot will be at the top of the plot. # Divide by day, going horizontally and wrapping with 2 columns sp + facet_wrap( ~ day, ncol=2) plot of chunk unnamed-chunk-7

Modifying facet label appearance

sp + facet_grid(sex ~ day) + theme(strip.text.x = element_text(size=8, angle=75), strip.text.y = element_text(size=12, face="bold"), strip.background = element_rect(colour="red", fill="#CCCCFF")) plot of chunk unnamed-chunk-8

Modifying facet label text

There are a few different ways of modifying facet labels. The simplest way is to provide a named vector that maps original names to new names. To map the levels of sex from Female==>Women, and Male==>Men: labels <- c(Female = "Women", Male = "Men") sp + facet_grid(. ~ sex, labeller=labeller(sex = labels)) Another way is to modify the data frame so that the data contains the desired labels: tips2 <- tips levels(tips2$sex)[levels(tips2$sex)=="Female"] <- "Women" levels(tips2$sex)[levels(tips2$sex)=="Male"] <- "Men" head(tips2, 3) #> total_bill tip sex smoker day time size #> 1 16.99 1.01 Women No Sun Dinner 2 #> 2 10.34 1.66 Men No Sun Dinner 3 #> 3 21.01 3.50 Men No Sun Dinner 3 # Both of these will give the same output: sp2 <- ggplot(tips2, aes(x=total_bill, y=tip/total_bill)) + geom_point(shape=1) sp2 + facet_grid(. ~ sex) Both of these will give the same result: plot of chunk unnamed-chunk-11 labeller() can use any function that takes a character vector as input and returns a character vector as output. For example, the capitalize function from the Hmisc package will capitalize the first letters of strings. We can also define our own custom functions, like this one, which reverses strings: # Reverse each strings in a character vector reverse <- function(strings) { strings <- strsplit(strings, "") vapply(strings, function(x) { paste(rev(x), collapse = "") }, FUN.VALUE = character(1)) } sp + facet_grid(. ~ sex, labeller=labeller(sex = reverse)) plot of chunk unnamed-chunk-12

Free scales

Normally, the axis scales on each graph are fixed, which means that they have the same size and range. They can be made independent, by setting scales to free, free_x, or free_y. # A histogram of bill sizes hp <- ggplot(tips, aes(x=total_bill)) + geom_histogram(binwidth=2,colour="white") # Histogram of total_bill, divided by sex and smoker hp + facet_grid(sex ~ smoker) # Same as above, with scales="free_y" hp + facet_grid(sex ~ smoker, scales="free_y") # With panels that have the same scaling, but different range (and therefore different physical sizes) hp + facet_grid(sex ~ smoker, scales="free", space="free") plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13

Multiple graphs on one page (ggplot2)

Problem

You want to put multiple graphs on one page.

Solution

The easy way is to use the multiplot function, defined at the bottom of this page. If it isn’t suitable for your needs, you can copy and modify it. First, set up the plots and store them, but don’t render them yet. The details of these plots aren’t important; all you need to do is store the plot objects in variables. library(ggplot2) # This example uses the ChickWeight dataset, which comes with ggplot2 # First plot p1 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet, group=Chick)) + geom_line() + ggtitle("Growth curve for individual chicks") # Second plot p2 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet)) + geom_point(alpha=.3) + geom_smooth(alpha=.2, size=1) + ggtitle("Fitted growth curve per diet") # Third plot p3 <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, colour=Diet)) + geom_density() + ggtitle("Final weight, by diet") # Fourth plot p4 <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, fill=Diet)) + geom_histogram(colour="black", binwidth=50) + facet_grid(Diet ~ .) + ggtitle("Final weight, by diet") + theme(legend.position="none") # No legend (redundant in this graph) Once the plot objects are set up, we can render them with multiplot. This will make two columns of graphs: multiplot(p1, p2, p3, p4, cols=2) #> `geom_smooth()` using method = 'loess' plot of chunk unnamed-chunk-3

multiplot function

This is the definition of multiplot. It can take any number of plot objects as arguments, or if it can take a list of plot objects passed to plotlist. # Multiple plot function # # ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects) # - cols: Number of columns in layout # - layout: A matrix specifying the layout. If present, 'cols' is ignored. # # If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE), # then plot 1 will go in the upper left, 2 will go in the upper right, and # 3 will go all the way across the bottom. # multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { library(grid) # Make a list from the ... arguments and plotlist plots <- c(list(...), plotlist) numPlots = length(plots) # If layout is NULL, then use 'cols' to determine layout if (is.null(layout)) { # Make the panel # ncol: Number of columns of plots # nrow: Number of rows needed, calculated from # of cols layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) } if (numPlots==1) { print(plots[[1]]) } else { # Set up the page grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:numPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } }

Colors (ggplot2)

Problem

You want to use colors in a graph with ggplot2.

Solution

The default colors in ggplot2 can be difficult to distinguish from one another because they have equal luminance. They are also not friendly for colorblind viewers. A good general-purpose solution is to just use the colorblind-friendly palette below.

Sample data

These two data sets will be used to generate the graphs below. # Two variables df <- read.table(header=TRUE, text=' cond yval A 2 B 2.5 C 1.6 ') # Three variables df2 <- read.table(header=TRUE, text=' cond1 cond2 yval A I 2 A J 2.5 A K 1.6 B I 2.2 B J 2.4 B K 1.2 C I 1.7 C J 2.3 C K 1.9 ')

Simple color assignment

The colors of lines and points can be set directly using colour="red", replacing “red” with a color name. The colors of filled objects, like bars, can be set using fill="red". If you want to use anything other than very basic colors, it may be easier to use hexadecimal codes for colors, like "#FF6699". (See the hexadecimal color chart below.) library(ggplot2) # Default: dark bars ggplot(df, aes(x=cond, y=yval)) + geom_bar(stat="identity") # Bars with red outlines ggplot(df, aes(x=cond, y=yval)) + geom_bar(stat="identity", colour="#FF9999") # Red fill, black outlines ggplot(df, aes(x=cond, y=yval)) + geom_bar(stat="identity", fill="#FF9999", colour="black") # Standard black lines and points ggplot(df, aes(x=cond, y=yval)) + geom_line(aes(group=1)) + # Group all points; otherwise no line will show geom_point(size=3) # Dark blue lines, red dots ggplot(df, aes(x=cond, y=yval)) + geom_line(aes(group=1), colour="#000099") + # Blue lines geom_point(size=3, colour="#CC0000") # Red dots plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Mapping variable values to colors

Instead of changing colors globally, you can map variables to colors – in other words, make the color conditional on a variable, by putting it inside an aes() statement. # Bars: x and fill both depend on cond2 ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") # Bars with other dataset; fill depends on cond2 ggplot(df2, aes(x=cond1, y=yval)) + geom_bar(aes(fill=cond2), # fill depends on cond2 stat="identity", colour="black", # Black outline for all position=position_dodge()) # Put bars side-by-side instead of stacked # Lines and points; colour depends on cond2 ggplot(df2, aes(x=cond1, y=yval)) + geom_line(aes(colour=cond2, group=cond2)) + # colour, group both depend on cond2 geom_point(aes(colour=cond2), # colour depends on cond2 size=3) # larger points, different shape ## Equivalent to above; but move "colour=cond2" into the global aes() mapping # ggplot(df2, aes(x=cond1, y=yval, colour=cond2)) + # geom_line(aes(group=cond2)) + # geom_point(size=3) plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

A colorblind-friendly palette

These are color-blind-friendly palettes, one with gray, and one with black. plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5 To use with ggplot2, it is possible to store the palette in a variable, then use it later. # The palette with grey: cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") # The palette with black: cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") # To use for fills, add scale_fill_manual(values=cbPalette) # To use for line and point colors, add scale_colour_manual(values=cbPalette) This palette is from http://jfly.iam.u-tokyo.ac.jp/color/: Colorblind palette

Color selection

By default, the colors for discrete scales are evenly spaced around a HSL color circle. For example, if there are two colors, then they will be selected from opposite points on the circle; if there are three colors, they will be 120° apart on the color circle; and so on. The colors used for different numbers of levels are shown here: plot of chunk unnamed-chunk-7 The default color selection uses scale_fill_hue() and scale_colour_hue(). For example, adding those commands is redundant in these cases: # These two are equivalent; by default scale_fill_hue() is used ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") # ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_hue() # These two are equivalent; by default scale_colour_hue() is used ggplot(df, aes(x=cond, y=yval, colour=cond)) + geom_point(size=2) # ggplot(df, aes(x=cond, y=yval, colour=cond)) + geom_point(size=2) + scale_colour_hue() plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8

Setting luminance and saturation (chromaticity)

Although scale_fill_hue() and scale_colour_hue() were redundant above, they can be used when you want to make changes from the default, like changing the luminance or chromaticity. # Use luminance=45, instead of default 65 ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_hue(l=40) # Reduce saturation (chromaticity) from 100 to 50, and increase luminance ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_hue(c=45, l=80) # Note: use scale_colour_hue() for lines and points plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9 This is a chart of colors with luminance=45: plot of chunk unnamed-chunk-10

Palettes: Color Brewer

You can also use other color scales, such as ones taken from the RColorBrewer package. See the chart of RColorBrewer palettes below. See the scale section here for more information. ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_brewer() ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_brewer(palette="Set1") ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_brewer(palette="Spectral") # Note: use scale_colour_brewer() for lines and points plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11

Palettes: manually-defined

Finally, you can define your own set of colors with scale_fill_manual(). See the hexadecimal code chart below for help choosing specific colors. ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_manual(values=c("red", "blue", "green")) ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar(stat="identity") + scale_fill_manual(values=c("#CC6666", "#9999CC", "#66CC99")) # Note: use scale_colour_manual() for lines and points plot of chunk unnamed-chunk-12plot of chunk unnamed-chunk-12

Continuous colors

[Not complete] See the scale section here for more information. # Generate some data set.seed(133) df <- data.frame(xval=rnorm(50), yval=rnorm(50)) # Make color depend on yval ggplot(df, aes(x=xval, y=yval, colour=yval)) + geom_point() # Use a different gradient ggplot(df, aes(x=xval, y=yval, colour=yval)) + geom_point() + scale_colour_gradientn(colours=rainbow(4)) plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13

Color charts

Hexadecimal color code chart

Colors can specified as a hexadecimal RGB triplet, such as "#0066CC". The first two digits are the level of red, the next two green, and the last two blue. The value for each ranges from 00 to FF in hexadecimal (base-16) notation, which is equivalent to 0 and 255 in base-10. For example, in the table below, “#FFFFFF” is white and “#990000” is a deep red. (Color chart is from http://www.visibone.com)

RColorBrewer palette chart

plot of chunk unnamed-chunk-14

Output to a file

Problem

You want to save your graph(s) to a file.

Solution

There are several commands which will direct output to a file instead of the screen. You must use the dev.off() command to tell R that you are finished plotting; otherwise your graph will not show up.

PDF

PDF is a vector file format. Vector files are generally preferred for print output because the resulting output can be scaled to any size without pixelation. The size of a vector file is usually smaller than the corresponding bitmap file, except in cases where there are many objects. (For example, a scatter plot with thousands of points may result in a very large vector file, but a smaller bitmap file.) pdf("plots.pdf") plot(...) plot(...) dev.off() PDF’s are 7x7 inches by default, and each new plot is on a new page. The size can be changed: # 6x3 inches pdf("plots.pdf", width=6, height=3) # 10x6 cm pdf("plots.pdf", width=10/2.54, height=6/2.54) If you want to edit your file in a vector editor like Inkscape or Illustrator, some of the plotting point objects might look like letters instead of circles, squares, etc. To avoid this problem: pdf("plots.pdf", useDingbats=FALSE)

SVG

SVG is another vector format. The default settings for svg() doesn’t allow for multiple pages in a single file, since most SVG viewers can’t handle multi-page SVG files. See the PNG section below for outputting to multiple files. svg("plots.svg") plot(...) dev.off() SVG files may work better with vector-editing programs than PDF files.

PNG/TIFF

PNG and TIFF are bitmap (or raster) formats. If they are magnified, the pixels may be visible. png("plot.png") # or tiff("plot.tiff") plot(...) dev.off() By default, the graphs are 480x480 pixels in size, at a resolution of 72 dpi (6.66x6.66 inches). Increasing the resolution will increase the size (in pixels) of the text and graph elements. This occurs because the size of these elements is relative to the physical dimension of the graph (e.g., 4x4 inches), not the pixel dimension of the graph. For example, a 12 point font is 12/72 = 1/6 inch tall; at 72 dpi, this is 12 pixels, but at 120dpi, it is 20 pixels. This would create a graph that is 480x240 pixels at 120dpi, equivalent to 4x2 inches. png("plot.png", width=480, height=240, res=120) plot(...) dev.off() If you want to make more than one graph, you must either execute a new png() command for each one, or put %d in the filename: png("plot-%d.png") plot(...) dev.off() This will generate plot-1.png, plot2.png, and so on.

For import into PDF-incapable programs (MS Office)

Some programs which cannot import PDF files may work with high-resolution PNG or TIFF files. For example, Microsoft Office cannot import PDF files. For print publications, you may be required to use 300dpi images. # Make a 6x6 inch image at 300dpi ppi <- 300 png("plot.png", width=6*ppi, height=6*ppi, res=ppi) plot(...) dev.off()

ggplot2

If you make plots with ggplot2 in a script or function, you must use the print() command to make the graphs actually get rendered. # This will do nothing pdf("plots.pdf") qplot(...) dev.off() # This will do the right thing pdf("plots.pdf") print(qplot(...)) dev.off() To save a ggplot2 graph from the screen to a file, you can use ggsave(). ggsave("plot.pdf") ggsave("plot.pdf", width=4, height=4) # This will save a 400x400 file at 100 ppi ggsave("plot.png", width=4, height=4, dpi=100)

Saving a graph from the screen

If you have a graph on the screen, you can save it to a bitmap file. This will save an exact pixel-for-pixel copy of what’s on screen, but it will probably only work in Linux and on Macs that use X11 for R graphing: # Make a plot on screen plot(...) savePlot("myplot.png") This will save the current graph from the screen, but it re-renders it for the device, which may have different dimensions, so it won’t look exactly the same unless you specify the exact same size in pixels. # Make a plot on screen plot(...) dev.copy(pdf,"myplot.pdf", width=4, height=4) dev.off() # Same as doing: # pdf("myplot.pdf", width=4, height=4) # plot(...) # dev.off() dev.copy(png,"myplot.png", width=400, height=400) dev.off()

Shapes and line types

Problem

You want to use different shapes and line types in your graph.

Solution

plot of chunk unnamed-chunk-2 plot of chunk line_types Note that with bitmap output, the filled symbols 15-18 may render without proper anti-aliasing; they can appear jagged, pixelated, and not properly centered, though this varies among platforms. Symbols 19 and 21-25 have an outline around the filled area, and will render with smoothed edges on most platforms. For symbols 21-25 to appear solid, you will also need to specify a fill (bg) color that is the same as the outline color (col); otherwise they will be hollow.

Standard graphics

Use the pch option to set the shape, and use lty and lwd to set the line type and width. The line type can be specified by name or by number. set.seed(331) # Plot some points with lines # Set up the plotting area par(mar=c(3,3,2,2)) plot(NA, xlim=c(1,4), ylim=c(0,1)) # Plot solid circles with solid lines points(1:4, runif(4), type="b", pch=19) # Add open squares with dashed line, with heavier line width points(1:4, runif(4), type="b", pch=0, lty=2, lwd=3) points(1:4, runif(4), type="b", pch=23, # Diamond shape lty="dotted", cex=2, # Dotted line, double-size shapes col="#000099", bg="#FF6666") # blue line, red fill plot of chunk unnamed-chunk-3

ggplot2

With ggplot2, shapes and line types can be assigned overall (e.g., if you want all points to be squares, or all lines to be dashed), or they can be conditioned on a variable. # Sample data df <- read.table(header=T, text=' cond xval yval A 1 2.0 A 2 2.5 B 1 3.0 B 2 2.0 ') library(ggplot2) # Plot with standard lines and points # group = cond tells it which points to connect with lines ggplot(df, aes(x=xval, y=yval, group = cond)) + geom_line() + geom_point() # Set overall shapes and line type ggplot(df, aes(x=xval, y=yval, group = cond)) + geom_line(linetype="dashed", # Dashed line size = 1.5) + # Thicker line geom_point(shape = 0, # Hollow squares size = 4) # Large points # Condition shapes and line type on variable cond ggplot(df, aes(x=xval, y=yval, group = cond)) + geom_line(aes(linetype=cond), # Line type depends on cond size = 1.5) + # Thicker line geom_point(aes(shape=cond), # Shape depends on cond size = 4) # Large points # Same as previous, but also change the specific linetypes and # shapes that are used ggplot(df, aes(x=xval, y=yval, group = cond)) + geom_line(aes(linetype=cond), # Line type depends on cond size = 1.5) + # Thicker line geom_point(aes(shape=cond), # Shape depends on cond size = 4) + # Large points scale_shape_manual(values=c(6,5)) + # Change shapes scale_linetype_manual(values=c("dotdash", "dotted")) # Change linetypes plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4 By default, ggplot2 uses solid shapes. If you want to use hollow shapes, without manually declaring each shape, you can use scale_shape(solid=FALSE). Note, however, that the lines will visible inside the shape. To avoid this, you can use shapes 21-25 and specify a white fill. # Hollow shapes ggplot(df, aes(x=xval, y=yval, group = cond)) + geom_line(aes(linetype=cond), # Line type depends on cond size = 1.5) + # Thicker line geom_point(aes(shape=cond), # Shape depends on cond size = 4) + # Large points scale_shape(solid=FALSE) # Shapes with white fill ggplot(df, aes(x=xval, y=yval, group = cond)) + geom_line(aes(linetype=cond), # Line type depends on cond size = 1.5) + # Thicker line geom_point(aes(shape=cond), # Shape depends on cond fill = "white", # White fill size = 4) + # Large points scale_shape_manual(values=c(21,24)) # Shapes: Filled circle, triangle plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Note

This code will generate the chart of line types seen at the top. par(mar=c(0,0,0,0)) # Set up the plotting area plot(NA, xlim=c(0,1), ylim=c(6.5, -0.5), xaxt="n", yaxt="n", xlab=NA, ylab=NA ) # Draw the lines for (i in 0:6) { points(c(0.25,1), c(i,i), lty=i, lwd=2, type="l") } # Add labels text(0, 0, "0. 'blank'" , adj=c(0,.5)) text(0, 1, "1. 'solid'" , adj=c(0,.5)) text(0, 2, "2. 'dashed'" , adj=c(0,.5)) text(0, 3, "3. 'dotted'" , adj=c(0,.5)) text(0, 4, "4. 'dotdash'" , adj=c(0,.5)) text(0, 5, "5. 'longdash'", adj=c(0,.5)) text(0, 6, "6. 'twodash'" , adj=c(0,.5))

Fonts

Problem

You want to use different fonts in your graphs.

Solution

Update: Also see the extrafont package for much better support of fonts for PDF and Windows bitmap output. Font support in R is generally not very good. It varies between systems, and between output formats.

geom_text

With geom_text or annotate in ggplot2, you can set a number of properties of the text. geom_text is used to add text from the data frame, and annotate is used to add a single text element.
Name Default value
size 5
family "" (sans)
fontface plain
lineheight 1.2
angle 0
hjust 0.5
vjust 0.5
Note that size is in mm, not points. dat <- data.frame( y = 1:3, text = c("This is text", "Text with\nmultiple lines", "Some more text") ) library(ggplot2) p <- ggplot(dat, aes(x=1, y=y)) + scale_y_continuous(limits=c(0.5, 3.5), breaks=NULL) + scale_x_continuous(breaks=NULL) p + geom_text(aes(label=text)) p + geom_text(aes(label=text), family="Times", fontface="italic", lineheight=.8) + annotate(geom="text", x=1, y=1.5, label="Annotation text", colour="red", size=7, family="Courier", fontface="bold", angle=30) plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

themes and element_text

When controlling elements such as the title, legend, axis labels, and so on, you use element_text, which has the same parameters, except that size is points (not mm), and instead of fontface, it uses face. The default value of size depends on the element; for example, titles are larger than tick labels. p + geom_point() + ggtitle("This is a Title") + theme(plot.title=element_text(family="Times", face="bold", size=20)) plot of chunk unnamed-chunk-3

Table of fonts

You can use this code to generate a graphical table of fonts. Fonts have short names and canonical family names. You can use either one when specifying the family. fonttable <- read.table(header=TRUE, sep=",", stringsAsFactors=FALSE, text=' Short,Canonical mono,Courier sans,Helvetica serif,Times ,AvantGarde ,Bookman ,Helvetica-Narrow ,NewCenturySchoolbook ,Palatino ,URWGothic ,URWBookman ,NimbusMon URWHelvetica,NimbusSan ,NimbusSanCond ,CenturySch ,URWPalladio URWTimes,NimbusRom ') fonttable$pos <- 1:nrow(fonttable) library(reshape2) fonttable <- melt(fonttable, id.vars="pos", measure.vars=c("Short","Canonical"), variable.name="NameType", value.name="Font") # Make a table of faces. Make sure factors are ordered correctly facetable <- data.frame(Face = factor(c("plain","bold","italic","bold.italic"), levels = c("plain","bold","italic","bold.italic"))) fullfonts <- merge(fonttable, facetable) library(ggplot2) pf <- ggplot(fullfonts, aes(x=NameType, y=pos)) + geom_text(aes(label=Font, family=Font, fontface=Face)) + facet_wrap(~ Face, ncol=2) To view it on screen: pf What you see on the screen isn’t necessarily the same as what you get when you output to PNG or PDF. To see PNG output: png('fonttable.png', width=720, height=720, res=72) print(pf) dev.off() Notice that on the system that generated this picture, most of the fonts (at the top) don’t really work. Only the basic fonts (at the bottom) work. plot of chunk unnamed-chunk-7
And PDF output (the image below has been converted from PDF to PNG): pdf('fonttable.pdf', width=10, height=10) print(pf) dev.off() # Convert to PNG using GraphicsMagick: # system("gm convert -resize 720x720 -background white fonttable.pdf fonttable-pdf.png") The PDF device has better font support than the PNG device. All the fonts work (though they’re not necessarily pretty): PDF font table, converted to PNG

Histogram and density plot

Problem

You want to make a histogram or density plot.

Solution

Some sample data: these two vectors contain 200 data points each: set.seed(1234) rating <- rnorm(200) head(rating) #> [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559 rating2 <- rnorm(200, mean=.8) head(rating2) #> [1] 1.2852268 1.4967688 0.9855139 1.5007335 1.1116810 1.5604624 When plotting multiple groups of data, some graphing routines require a data frame with one column for the grouping variable and one for the measure variable. # Make a column to indicate which group each value is in cond <- factor( rep(c("A","B"), each=200) ) data <- data.frame(cond, rating = c(rating,rating2)) head(data) #> cond rating #> 1 A -1.2070657 #> 2 A 0.2774292 #> 3 A 1.0844412 #> 4 A -2.3456977 #> 5 A 0.4291247 #> 6 A 0.5060559 # Histogram hist(rating) # Use 8 bins (this is only approximate - it places boundaries on nice round numbers) # Make it light blue #CCCCFF # Instead of showing count, make area sum to 1, (freq=FALSE) hist(rating, breaks=8, col="#CCCCFF", freq=FALSE) # Put breaks at every 0.6 boundaries <- seq(-3, 3.6, by=.6) boundaries #> [1] -3.0 -2.4 -1.8 -1.2 -0.6 0.0 0.6 1.2 1.8 2.4 3.0 3.6 hist(rating, breaks=boundaries) # Kernel density plot plot(density(rating)) plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4 Multiple groups with kernel density plots. This code is from: http://onertipaday.blogspot.com/2007/09/plotting-two-or-more-overlapping.html plot.multi.dens <- function(s) { junk.x = NULL junk.y = NULL for(i in 1:length(s)) { junk.x = c(junk.x, density(s[[i]])$x) junk.y = c(junk.y, density(s[[i]])$y) } xr <- range(junk.x) yr <- range(junk.y) plot(density(s[[1]]), xlim = xr, ylim = yr, main = "") for(i in 1:length(s)) { lines(density(s[[i]]), xlim = xr, ylim = yr, col = i) } } # the input of the following function MUST be a numeric list plot.multi.dens( list(rating, rating2)) plot of chunk unnamed-chunk-5 The sm package also includes a way of doing multiple density plots. The data must be in a data frame. library(sm) sm.density.compare(data$rating, data$cond) # Add a legend (the color numbers start from 2 and go up) legend("topright", levels(data$cond), fill=2+(0:nlevels(data$cond))) plot of chunk unnamed-chunk-6

Scatterplot

Problem

You want to make a scatterplot.

Solution

Suppose this is your data: set.seed(955) # Make some noisily increasing data dat <- data.frame(xvar = 1:20 + rnorm(20,sd=3), yvar = 1:20 + rnorm(20,sd=3), zvar = 1:20 + rnorm(20,sd=3)) head(dat) #> xvar yvar zvar #> 1 -4.252354 3.473157275 -2.97806724 #> 2 1.702318 0.005939612 -1.16183118 #> 3 4.323054 -0.094252427 4.85516658 #> 4 1.780628 2.072808278 4.65078709 #> 5 11.537348 1.215440358 -0.06613962 #> 6 6.672130 3.608111411 6.24349897

Basic scatterplots

# Plot the points using the vectors xvar and yvar plot(dat$xvar, dat$yvar) # Same as previous, but with formula interface plot(yvar ~ xvar, dat) # Add a regression line fitline <- lm(dat$yvar ~ dat$xvar) abline(fitline) plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Scatterplot matrices

It is also possible to make a matrix of scatterplots if you would like to compare several variables. See this for a way to make a scatterplot matrix with r values. # A scatterplot matrix plot(dat[,1:3]) # Another way of making a scatterplot matrix, with regression lines # and histogram/boxplot/density/qqplot/none along the diagonal library(car) scatterplotMatrix(dat[,1:3], diagonal="histogram", smooth=FALSE) plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4 To calculate the corresponding correlation matrix, see ../../Statistical analysis/Regression and correlation. To visualize the correlation matrix, see ../Correlation matrix.

Box plot

Problem

You want to make a box plot.

Solution

This page shows how to make quick, simple box plots with base graphics. For more sophisticated ones, see Plotting distributions (ggplot2).

Sample data

The examples here will use the ToothGrowth data set, which has two independent variables, and one dependent variable. head(ToothGrowth) #> len supp dose #> 1 4.2 VC 0.5 #> 2 11.5 VC 0.5 #> 3 7.3 VC 0.5 #> 4 5.8 VC 0.5 #> 5 6.4 VC 0.5 #> 6 10.0 VC 0.5 Simple box plots of len against supp and dose: boxplot(len ~ supp, data=ToothGrowth) # Even though `dose` is a numeric variable, `boxplot` will convert it to a factor boxplot(len ~ dose, data=ToothGrowth) plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3 A boxplot of len against supp and dose together. boxplot(len ~ interaction(dose,supp), data=ToothGrowth) plot of chunk unnamed-chunk-4 Note that boxplot and plot have much the same output, except that plot puts in axis labels, and doesn’t automatically convert numeric variables to factors, as was done with dose above. plot(len ~ interaction(dose,supp), data=ToothGrowth) plot of chunk unnamed-chunk-5

Q-Q plot

Problem

You want to compare the distribution of your data to another distribution. This is often used to check whether a sample follows a normal distribution, to check whether two samples are drawn from the same distribution.

Solution

Suppose this is your data: set.seed(183) # Normally distributed numbers x <- rnorm(80, mean=50, sd=5) # Uniformly distributed numbers z <- runif(80) # Compare the numbers sampled with rnorm() against normal distribution qqnorm(x) qqline(x) # The same numbers to the 4th power, compared to the normal distribution qqnorm(x^4) qqline(x^4) # Numbers sampled from the flat distribution, compared to normal qqnorm(z) qqline(z) plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Correlation matrix

Problem

You want to visualize the strength of correlations among many variables.

Solution

Suppose this is your data: set.seed(955) vvar <- 1:20 + rnorm(20,sd=3) wvar <- 1:20 + rnorm(20,sd=5) xvar <- 20:1 + rnorm(20,sd=3) yvar <- (1:20)/2 + rnorm(20, sd=10) zvar <- rnorm(20, sd=6) # A data frame with multiple variables data <- data.frame(vvar, wvar, xvar, yvar, zvar) head(data) #> vvar wvar xvar yvar zvar #> 1 -4.252354 5.1219288 16.02193 -15.156368 -4.086904 #> 2 1.702318 -1.3234340 15.83817 -24.063902 3.468423 #> 3 4.323054 -2.1570874 19.85517 2.306770 -3.044931 #> 4 1.780628 0.7880138 17.65079 2.564663 1.449081 #> 5 11.537348 -1.3075994 10.93386 9.600835 2.761963 #> 6 6.672130 2.0135190 15.24350 -3.465695 5.749642 To make the graph: library(ellipse) # Make the correlation table ctab <- cor(data) round(ctab, 2) #> vvar wvar xvar yvar zvar #> vvar 1.00 0.61 -0.85 0.75 -0.21 #> wvar 0.61 1.00 -0.81 0.54 -0.31 #> xvar -0.85 -0.81 1.00 -0.63 0.24 #> yvar 0.75 0.54 -0.63 1.00 -0.30 #> zvar -0.21 -0.31 0.24 -0.30 1.00 # Make the graph, with reduced margins plotcorr(ctab, mar = c(0.1, 0.1, 0.1, 0.1)) # Do the same, but with colors corresponding to value colorfun <- colorRamp(c("#CC0000","white","#3366CC"), space="Lab") plotcorr(ctab, col=rgb(colorfun((ctab+1)/2), maxColorValue=255), mar = c(0.1, 0.1, 0.1, 0.1)) plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Debugging a script or function

Problem

You want to debug a script or function.

Solution

Insert this into your code at the place where you want to start debugging: browser() When the R interpreter reaches that line, it will pause your code and you will be able to look at and change variables. In the browser, typing these letters will do things:
   
c Continue
n (or Return) Next step
Q quit
Ctrl-C go to top level
When in the browser, you can see what variables are in the current scope. ls() To pause and start a browser for every line in your function: debug(myfunction) myfunction(x)

Useful options

By default, every time you press Enter at the browser prompt, it runs the next step. This is equivalent to pressing n and then Enter. This can be annoying. To disable it use: options(browserNLdisabled=TRUE) To start debugging whenever an error is thrown, run this before your function which throws an error: options(error=recover) If you want these options to be set every time you start R, you can put them in your ~/.Rprofile file.

Measuring elapsed time

Problem

You want to measure how much time it takes to run a particular block of code.

Solution

The system.time() function will measure how long it takes to run something in R. system.time({ # Do something that takes time x <- 1:100000 for (i in seq_along(x)) x[i] <- x[i]+1 }) #> user system elapsed #> 0.144 0.002 0.153 The output means it took 0.153 seconds to run the block of code.

Getting a list of functions and objects in a package

Problem

You want to find out what’s in a package.

Solution

This code snippet will list the functions and objects in a package. # Using search() in a new R session says that these packages are # loaded by default: # "package:stats" "package:graphics" # "package:grDevices" "package:utils" "package:datasets" # "package:methods" "package:base" # Others that are useful: # gplots # ggplot2, reshape, plyr showPackageContents <- function (packageName) { # Get a list of things contained in a particular package funlist <- objects(packageName) # Remove things that don't start with a letter idx <- grep('^[a-zA-Z][a-zA-Z0-9._]*', funlist) funlist <- funlist[idx] # Remove things that contain arrow <- idx <- grep('<-', funlist) if (length(idx)!=0) funlist <- funlist[-idx] # Make a data frame to keep track of status objectlist <- data.frame(name=funlist, primitive=FALSE, func=FALSE, object=FALSE, constant=FALSE, stringsAsFactors=F) for (i in 1:nrow(objectlist)) { fname <- objectlist$name[i] if (exists(fname)) { obj <- get(fname) if (is.primitive(obj)) { objectlist$primitive[i] <- TRUE } if (is.function(obj)) { objectlist$func[i] <- TRUE } if (is.object(obj)) { objectlist$object[i] <- TRUE } # I think these are generally constants if (is.vector(obj)) { objectlist$constant[i] <- TRUE } } } cat(packageName) cat("\n================================================\n") cat("Primitive functions: \n") cat(objectlist$name[objectlist$primitive]) cat("\n") cat("\n================================================\n") cat("Non-primitive functions: \n") cat(objectlist$name[objectlist$func & !objectlist$primitive]) cat("\n") cat("\n================================================\n") cat("Constants: \n") cat(objectlist$name[objectlist$constant]) cat("\n") cat("\n================================================\n") cat("Objects: \n") cat(objectlist$name[objectlist$object]) cat("\n") } # Run the function using base package showPackageContents("package:base")

Generating counterbalanced orders

Problem

You want to generate counterbalanced sequences for an experiment.

Solution

The function latinsquare() (defined below) can be used to generate Latin squares. latinsquare(4) #> [,1] [,2] [,3] [,4] #> [1,] 1 2 4 3 #> [2,] 2 1 3 4 #> [3,] 3 4 1 2 #> [4,] 4 3 2 1 # To generate 2 Latin squares of size 4 (in sequence) latinsquare(4, reps=2) #> [,1] [,2] [,3] [,4] #> [1,] 3 4 1 2 #> [2,] 4 3 2 1 #> [3,] 1 2 4 3 #> [4,] 2 1 3 4 #> [5,] 4 2 1 3 #> [6,] 2 3 4 1 #> [7,] 1 4 3 2 #> [8,] 3 1 2 4 # It is better to put the random seed in the function call, to make it repeatable # This will return the same sequence of two Latin squares every time latinsquare(4, reps=2, seed=5873) #> [,1] [,2] [,3] [,4] #> [1,] 1 4 2 3 #> [2,] 4 1 3 2 #> [3,] 2 3 4 1 #> [4,] 3 2 1 4 #> [5,] 3 2 4 1 #> [6,] 1 4 2 3 #> [7,] 4 3 1 2 #> [8,] 2 1 3 4 There are 576 Latin squares of size 4. The latinsquare function will, in effect, randomly select n of these squares and return them in sequence. This is known as a replicated Latin square design. Once you generate your Latin squares, it is a good idea to inspect them to make sure that there are not many duplicated sequences. This is not uncommon with smaller squares (3x3 or 4x4).

A function for generating Latin squares

This function generates Latin squares. It uses a somewhat brute-force algorithm to generate each square, which can sometimes fail because it runs out of available numbers to put in a given location. In such cases, it just tries again. There may be an elegant way out there to do it, but I am not aware of it. ## - len is the size of the latin square ## - reps is the number of repetitions - how many Latin squares to generate ## - seed is a random seed that can be used to generate repeatable sequences ## - returnstrings tells it to return a vector of char strings for each square, ## instead of a big matrix. This option is only really used for checking the ## randomness of the squares. latinsquare <- function(len, reps=1, seed=NA, returnstrings=FALSE) { # Save the old random seed and use the new one, if present if (!is.na(seed)) { if (exists(".Random.seed")) { saved.seed <- .Random.seed } else { saved.seed <- NA } set.seed(seed) } # This matrix will contain all the individual squares allsq <- matrix(nrow=reps*len, ncol=len) # Store a string id of each square if requested if (returnstrings) { squareid <- vector(mode = "character", length = reps) } # Get a random element from a vector (the built-in sample function annoyingly # has different behavior if there's only one element in x) sample1 <- function(x) { if (length(x)==1) { return(x) } else { return(sample(x,1)) } } # Generate each of n individual squares for (n in 1:reps) { # Generate an empty square sq <- matrix(nrow=len, ncol=len) # If we fill the square sequentially from top left, some latin squares # are more probable than others. So we have to do it random order, # all over the square. # The rough procedure is: # - randomly select a cell that is currently NA (call it the target cell) # - find all the NA cells sharing the same row or column as the target # - fill the target cell # - fill the other cells sharing the row/col # - If it ever is impossible to fill a cell because all the numbers # are already used, then quit and start over with a new square. # In short, it picks a random empty cell, fills it, then fills in the # other empty cells in the "cross" in random order. If we went totally randomly # (without the cross), the failure rate is much higher. while (any(is.na(sq))) { # Pick a random cell which is currently NA k <- sample1(which(is.na(sq))) i <- (k-1) %% len +1 # Get the row num j <- floor((k-1) / len) +1 # Get the col num # Find the other NA cells in the "cross" centered at i,j sqrow <- sq[i,] sqcol <- sq[,j] # A matrix of coordinates of all the NA cells in the cross openCell <-rbind( cbind(which(is.na(sqcol)), j), cbind(i, which(is.na(sqrow)))) # Randomize fill order openCell <- openCell[sample(nrow(openCell)),] # Put center cell at top of list, so that it gets filled first openCell <- rbind(c(i,j), openCell) # There will now be three entries for the center cell, so remove duplicated entries # Need to make sure it's a matrix -- otherwise, if there's just # one row, it turns into a vector, which causes problems openCell <- matrix(openCell[!duplicated(openCell),], ncol=2) # Fill in the center of the cross, then the other open spaces in the cross for (c in 1:nrow(openCell)) { # The current cell to fill ci <- openCell[c,1] cj <- openCell[c,2] # Get the numbers that are unused in the "cross" centered on i,j freeNum <- which(!(1:len %in% c(sq[ci,], sq[,cj]))) # Fill in this location on the square if (length(freeNum)>0) { sq[ci,cj] <- sample1(freeNum) } else { # Failed attempt - no available numbers # Re-generate empty square sq <- matrix(nrow=len, ncol=len) # Break out of loop break; } } } # Store the individual square into the matrix containing all squares allsqrows <- ((n-1)*len) + 1:len allsq[allsqrows,] <- sq # Store a string representation of the square if requested. Each unique # square has a unique string. if (returnstrings) { squareid[n] <- paste(sq, collapse="") } } # Restore the old random seed, if present if (!is.na(seed) && !is.na(saved.seed)) { .Random.seed <- saved.seed } if (returnstrings) { return(squareid) } else { return(allsq) } }

Testing the function for randomness

Some algorithms for generating Latin squares may not create them very randomly. There are 576 unique 4x4 squares, and each one should have an equal probability of being created, but some algorithms do not properly do this. There is probably no need to test the randomness of the function above, but here is some code that will do it. Previous algorithms that I used did not have a good random distribution, which was discovered by running the code below. This code creates 10,000 4x4 Latin squares, then counts how often each of the 576 unique squares appears. The counts should form a not-too-wide normal distribution; otherwise the distribution not very random. I believe the expected standard deviation of the distribution (assuming randomly generated squares) is sqrt(10000/576). # Set up the size and number of squares to generate squaresize <- 4 numsquares <- 10000 # Get number of unique squares of a given size. # There is not a general solution to finding the number of unique nxn squares # so we just hard-code the values here. (From http://oeis.org/A002860) uniquesquares <- c(1, 2, 12, 576, 161280, 812851200)[squaresize] # Generate the squares s <- latinsquare(squaresize, numsquares, seed=122, returnstrings=TRUE) # Get the list of all squares and counts for each slist <- rle(sort(s)) scounts <- slist[[1]] hist(scounts, breaks=(min(scounts):(max(scounts)+1)-.5)) cat(sprintf("Expected and actual standard deviation: %.4f, %.4f\n", sqrt(numsquares/uniquesquares), sd(scounts) )) #> Expected and actual standard deviation: 4.1667, 4.0883 plot of chunk unnamed-chunk-3