R is a programming language and software environment for statistical analysis, graphics representation and reporting.
R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions.
R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand.
R made its first appearance in 1993.
A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive.
Features of R
As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting.
The following are the important features of R -
R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language.
It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors.
R is taught in universities and deployed in mission critical business applications.
This tutorial will teach you R programming along with suitable examples in simple and easy steps.
Local Environment Setup
If you are still willing to set up your environment for R, you can follow the steps given below.
Windows Installation
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.
As it is a Windows installer (.exe) with a name "R-version-win.exe".
You can just double click and run the installer accepting the default settings.
If your Windows is 32-bit version, it installs the 32-bit version.
But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions.
After installation you can locate the icon to run the Program in a directory structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files.
Clicking this icon brings up the R-GUI which is the R console to do R Programming.
Linux Installation
R is available as a binary for many versions of Linux at the location R Binaries.
The instruction to install Linux varies from flavor to flavor.
These steps are mentioned under each type of Linux version in the mentioned link.
However, if you are in a hurry, then you can use yum command to install R as follows -
$ yum install R
Above command will install core functionality of R programming along with standard packages, still you need additional package, then you can launch R prompt as follows -
$ R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
Now you can use install command at R prompt to install the required package.
For example, the following command will install plotrix package which is required for 3D charts.
> install.packages("plotrix")
R - Basic Syntax
As a convention, we will start learning R programming by writing a "Hello, World!" program.
Depending on the needs, you can program either at R command prompt or you can use an R script file to write your program.
Let's check both one by one.
R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just typing the following command at your command prompt -
$ R
This will launch R interpreter and you will get a prompt > where you can start typing your program as follows -
> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"
Here first statement defines a string variable myString, where we assign a string "Hello, World!" and then next statement print() is being used to print the value stored in variable myString.
R Script File
Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript.
So let's start with writing following code in a text file called test.R as under -
Live Demo
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
When we run the above program, it produces the following result.
[1] "Hello, World!"
Comments
Comments are like helping text in your R program and they are ignored by the interpreter while executing your actual program.
Single comment is written using # in the beginning of the statement as follows -
# My first program in R Programming
R does not support multi-line comments but you can perform a trick which is something as follows -
Live Demo
if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
}
myString <- "Hello, World!"
print ( myString)
[1] "Hello, World!"
Though above comments will be executed by R interpreter, they will not interfere with your actual program.
You should put such comments inside, either single or double quote.
R - Data Types
Generally, while doing programming in any programming language, you need to use various variables to store various information.
Variables are nothing but reserved memory locations to store values.
This means that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc.
Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory.
In contrast to other programming languages like C and java in R, the variables are not declared as some data type.
The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable.
There are many types of R-objects.
The frequently used ones are -
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors.
The other R-Objects are built upon the atomic vectors.
Data Type
Example
Verify
Logical
TRUE, FALSE
Live Demo
v <- TRUE
print(class(v))
it produces the following result -
[1] "logical"
Numeric
12.3, 5, 999
Live Demo
v <- 23.5
print(class(v))
it produces the following result -
[1] "numeric"
Integer
2L, 34L, 0L
Live Demo
v <- 2L
print(class(v))
it produces the following result -
[1] "integer"
Complex
3 + 2i
Live Demo
v <- 2+5i
print(class(v))
it produces the following result -
[1] "complex"
Character
'a' , '"good", "TRUE", '23.4'
Live Demo
v <- "TRUE"
print(class(v))
it produces the following result -
[1] "character"
Raw
"Hello" is stored as 48 65 6c 6c 6f
Live Demo
v <- charToRaw("Hello")
print(class(v))
it produces the following result -
[1] "raw"
In R programming, the very basic data types are the R-objects called vectors which hold elements of different classes as shown above.
Please note in R the number of classes is not confined to only the above six types.
For example, we can use many atomic vectors and create an array whose class will become array.
Vectors
When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector.
Live Demo
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
When we execute the above code, it produces the following result -
[1] "red" "green" "yellow"
[1] "character"
Lists
A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.
Live Demo
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
When we execute the above code, it produces the following result -
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set.
It can be created using a vector input to the matrix function.
Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result -
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension.
In the below example we create an array with two elements which are 3x3 matrices each.
Live Demo
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result -
, , 1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Factors
Factors are the r-objects which are created using a vector.
It stores the vector along with the distinct values of the elements in the vector as labels.
The labels are always character irrespective of whether it is numeric or character or Boolean etc.
in the input vector.
They are useful in statistical modeling.
Factors are created using the factor() function.
The nlevels functions gives the count of levels.
Live Demo
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple <- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result -
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Data Frames
Data frames are tabular data objects.
Unlike a matrix in data frame each column can contain different modes of data.
The first column can be numeric while the second column can be character and third column can be logical.
It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
Live Demo
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result -
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
R - Variables
A variable provides us with named storage that our programs can manipulate.
A variable in R can store an atomic vector, group of atomic vectors or a combination of many Robjects.
A valid variable name consists of letters, numbers and the dot or underline characters.
The variable name starts with a letter or the dot not followed by a number.
Variable Name
Validity
Reason
var_name2.
valid
Has letters, numbers, dot and underscore
var_name%
Invalid
Has the character '%'.
Only dot(.) and underscore allowed.
2var_name
invalid
Starts with a number
.var_name,
var.name
valid
Can start with a dot(.) but the dot(.)should not be followed by a number.
.2var_name
invalid
The starting dot is followed by a number making it invalid.
_var_name
invalid
Starts with _ which is not valid
Variable Assignment
The variables can be assigned values using leftward, rightward and equal to operator.
The values of the variables can be printed using print() or cat() function.
The cat() function combines multiple items into a continuous print output.
Live Demo
# Assignment using equal operator.
var.1 = c(0,1,2,3)
# Assignment using leftward operator.
var.2 <- c("learn","R")
# Assignment using rightward operator.
c(TRUE,1) -> var.3
print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")
When we execute the above code, it produces the following result -
[1] 0 1 2 3
var.1 is 0 1 2 3
var.2 is learn R
var.3 is 1 1
Note - The vector c(TRUE,1) has a mix of logical and numeric class.
So logical class is coerced to numeric class making TRUE as 1.
Data Type of a Variable
In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object assigned to it.
So R is called a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program.
Live Demo
var_x <- "Hello"
cat("The class of var_x is ",class(var_x),"\n")
var_x <- 34.5
cat(" Now the class of var_x is ",class(var_x),"\n")
var_x <- 27L
cat(" Next the class of var_x becomes ",class(var_x),"\n")
When we execute the above code, it produces the following result -
The class of var_x is character
Now the class of var_x is numeric
Next the class of var_x becomes integer
Finding Variables
To know all the variables currently available in the workspace we use the ls() function.
Also the ls() function can use patterns to match the variable names.
Live Demo
print(ls())
When we execute the above code, it produces the following result -
[1] "my var" "my_new_var" "my_var" "var.1"
[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"
Note - It is a sample output depending on what variables are declared in your environment.
The ls() function can use patterns to match the variable names.
Live Demo
# List the variables starting with the pattern "var".
print(ls(pattern = "var"))
When we execute the above code, it produces the following result -
[1] "my var" "my_new_var" "my_var" "var.1"
[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"
The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE" argument to ls() function.
Live Demo
print(ls(all.name = TRUE))
When we execute the above code, it produces the following result -
[1] ".cars" ".Random.seed" ".var_name" ".varname" ".varname2"
[6] "my var" "my_new_var" "my_var" "var.1" "var.2"
[11]"var.3" "var.name" "var_name2." "var_x"
Deleting Variables
Variables can be deleted by using the rm() function.
Below we delete the variable var.3.
On printing the value of the variable error is thrown.
Live Demo
rm(var.3)
print(var.3)
When we execute the above code, it produces the following result -
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm() and ls() function together.
Live Demo
rm(list = ls())
print(ls())
When we execute the above code, it produces the following result -
character(0)
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulations.
R language is rich in built-in operators and provides following types of operators.
Types of Operators
We have the following types of operators in R programming -
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language.
The operators act on each element of the vector.
Operator
Description
Example
+
Adds two vectors
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)
it produces the following result -
[1] 10.0 8.5 10.0
-
Subtracts second vector from the first
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)
it produces the following result -
[1] -6.0 2.5 2.0
*
Multiplies both vectors
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)
it produces the following result -
[1] 16.0 16.5 24.0
/
Divide the first vector with the second
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
When we execute the above code, it produces the following result -
[1] 0.250000 1.833333 1.500000
%%
Give the remainder of the first vector with the second
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%%t)
it produces the following result -
[1] 2.0 2.5 2.0
%/%
The result of division of first vector with second (quotient)
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%/%t)
it produces the following result -
[1] 0 1 1
^
The first vector raised to the exponent of second vector
Live Demo
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v^t)
it produces the following result -
[1] 256.000 166.375 1296.000
Relational Operators
Following table shows the relational operators supported by R language.
Each element of the first vector is compared with the corresponding element of the second vector.
The result of comparison is a Boolean value.
Operator
Description
Example
>
Checks if each element of the first vector is greater than the corresponding element of the second vector.
Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>t)
it produces the following result -
[1] FALSE TRUE FALSE FALSE
<
Checks if each element of the first vector is less than the corresponding element of the second vector.
Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v < t)
it produces the following result -
[1] TRUE FALSE TRUE FALSE
==
Checks if each element of the first vector is equal to the corresponding element of the second vector.
Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v == t)
it produces the following result -
[1] FALSE FALSE FALSE TRUE
<=
Checks if each element of the first vector is less than or equal to the corresponding element of the second vector.
Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v<=t)
it produces the following result -
[1] TRUE FALSE TRUE TRUE
>=
Checks if each element of the first vector is greater than or equal to the corresponding element of the second vector.
Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>=t)
it produces the following result -
[1] FALSE TRUE FALSE TRUE
!=
Checks if each element of the first vector is unequal to the corresponding element of the second vector.
Live Demo
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v!=t)
it produces the following result -
[1] TRUE TRUE TRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language.
It is applicable only to vectors of type logical, numeric or complex.
All numbers greater than 1 are considered as logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second vector.
The result of comparison is a Boolean value.
Operator
Description
Example
&
It is called Element-wise Logical AND operator.
It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if both the elements are TRUE.
Live Demo
v <- c(3,1,TRUE,2+3i)
t <- c(4,1,FALSE,2+3i)
print(v&t)
it produces the following result -
[1] TRUE TRUE FALSE TRUE
|
It is called Element-wise Logical OR operator.
It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if one the elements is TRUE.
Live Demo
v <- c(3,0,TRUE,2+2i)
t <- c(4,0,FALSE,2+3i)
print(v|t)
it produces the following result -
[1] TRUE FALSE TRUE TRUE
!
It is called Logical NOT operator.
Takes each element of the vector and gives the opposite logical value.
Live Demo
v <- c(3,0,TRUE,2+2i)
print(!v)
it produces the following result -
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give a vector of single element as output.
Operator
Description
Example
&&
Called Logical AND operator.
Takes first element of both the vectors and gives the TRUE only if both are TRUE.
Live Demo
v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
print(v&&t)
it produces the following result -
[1] TRUE
||
Called Logical OR operator.
Takes first element of both the vectors and gives the TRUE if one of them is TRUE.
Live Demo
v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
print(v||t)
it produces the following result -
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
Operator
Description
Example
<-
or
=
or
<<-
Called Left Assignment
Live Demo
v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
v3 = c(3,1,TRUE,2+3i)
print(v1)
print(v2)
print(v3)
it produces the following result -
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
->
or
->>
Called Right Assignment
Live Demo
c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
print(v1)
print(v2)
it produces the following result -
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical computation.
Operator
Description
Example
:
Colon operator.
It creates the series of numbers in sequence for a vector.
Live Demo
v <- 2:8
print(v)
it produces the following result -
[1] 2 3 4 5 6 7 8
%in%
This operator is used to identify if an element belongs to a vector.
Live Demo
v1 <- 8
v2 <- 12
t <- 1:10
print(v1 %in% t)
print(v2 %in% t)
it produces the following result -
[1] TRUE
[1] FALSE
%*%
This operator is used to multiply a matrix with its transpose.
Live Demo
M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
t = M %*% t(M)
print(t)
it produces the following result -
[,1] [,2]
[1,] 65 82
[2,] 82 117
R - Decision making
Decision making structures require the programmer to specify one or more conditions to be evaluated or tested by the program, along with a statement or statements to be executed if the condition is determined to be true, and optionally, other statements to be executed if the condition is determined to be false.
Following is the general form of a typical decision making structure found in most of the programming languages -
R provides the following types of decision making statements.
Click the following links to check their detail.
Sr.No.
Statement & Description
1
if statement
An if statement consists of a Boolean expression followed by one or more statements.
2
if...else statement
An if statement can be followed by an optional else statement, which executes when the Boolean expression is false.
3
switch statement
A switch statement allows a variable to be tested for equality against a list of values.
R - Loops
There may be a situation when you need to execute a block of code several number of times.
In general, statements are executed sequentially.
The first statement in a function is executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for more complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple times and the following is the general form of a loop statement in most of the programming languages -
R programming language provides the following kinds of loop to handle looping requirements.
Click the following links to check their detail.
Sr.No.
Loop Type & Description
1
repeat loop
Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.
2
while loop
Repeats a statement or group of statements while a given condition is true.
It tests the condition before executing the loop body.
3
for loop
Like a while statement, except that it tests the condition at the end of the loop body.
Loop Control Statements
Loop control statements change execution from its normal sequence.
When execution leaves a scope, all automatic objects that were created in that scope are destroyed.
R supports the following control statements.
Click the following links to check their detail.
Sr.No.
Control Statement & Description
1
break statement
Terminates the loop statement and transfers execution to the statement immediately following the loop.
2
Next statement
The next statement simulates the behavior of R switch.
R - Functions
A function is a set of statements organized together to perform a specific task.
R has a large number of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects.
Function Definition
An R function is created by using the keyword function.
The basic syntax of an R function definition is as follows -
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components
The different parts of a function are -
Function Name - This is the actual name of the function.
It is stored in R environment as an object with this name.
Arguments - An argument is a placeholder.
When a function is invoked, you pass a value to the argument.
Arguments are optional; that is, a function may contain no arguments.
Also arguments can have default values.
Function Body - The function body contains a collection of statements that defines what the function does.
Return Value - The return value of a function is the last expression in the function body to be evaluated.
R has many in-built functions which can be directly called in the program without defining them first.
We can also create and use our own functions referred as user defined functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc.
They are directly called by user written programs.
You can refer most widely used R functions. Live Demo
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
# Find mean of numbers from 25 to 82.
print(mean(25:82))
# Find sum of numbers frm 41 to 68.
print(sum(41:68))
When we execute the above code, it produces the following result -
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
We can create user-defined functions in R.
They are specific to what a user wants and once created they can be used like the built-in functions.
Below is an example of how a function is created and used.
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
Live Demo
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
# Call the function new.function supplying 6 as an argument.
new.function(6)
When we execute the above code, it produces the following result -
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
Calling a Function without an Argument
Live Demo
# Create a function without an argument.
new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}
# Call the function without supplying an argument.
new.function()
When we execute the above code, it produces the following result -
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
Calling a Function with Argument Values (by position and by name)
The arguments to a function call can be supplied in the same sequence as defined in the function or they can be supplied in a different sequence but assigned to the names of the arguments.
Live Demo
# Create a function with arguments.
new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}
# Call the function by position of arguments.
new.function(5,3,11)
# Call the function by names of the arguments.
new.function(a = 11, b = 5, c = 3)
When we execute the above code, it produces the following result -
[1] 26
[1] 58
Calling a Function with Default Argument
We can define the value of the arguments in the function definition and call the function without supplying any argument to get the default result.
But we can also call such functions by supplying new values of the argument and get non default result.
Live Demo
# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}
# Call the function without giving any argument.
new.function()
# Call the function with giving new values of the argument.
new.function(9,5)
When we execute the above code, it produces the following result -
[1] 18
[1] 45
Lazy Evaluation of Function
Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the function body.
Live Demo
# Create a function with arguments.
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)
When we execute the above code, it produces the following result -
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
R - Strings
Any value written within a pair of single quote or double quotes in R is treated as a string.
Internally R stores every string within double quotes, even when you create them with single quote.
Rules Applied in String Construction
The quotes at the beginning and end of a string should be both double quotes or both single quote.
They can not be mixed.
Double quotes can be inserted into a string starting and ending with single quote.
Single quote can be inserted into a string starting and ending with double quotes.
Double quotes can not be inserted into a string starting and ending with double quotes.
Single quote can not be inserted into a string starting and ending with single quote.
Examples of Valid Strings
Following examples clarify the rules about creating a string in R.
Live Demo
a <- 'Start and end with single quote'
print(a)
b <- "Start and end with double quotes"
print(b)
c <- "single quote ' in between double quotes"
print(c)
d <- 'Double quotes " in between single quote'
print(d)
When the above code is run we get the following output -
[1] "Start and end with single quote"
[1] "Start and end with double quotes"
[1] "single quote ' in between double quote"
[1] "Double quote \" in between single quote"
Examples of Invalid Strings
Live Demo
e <- 'Mixed quotes"
print(e)
f <- 'Single quote ' inside single quote'
print(f)
g <- "Double quotes " inside double quotes"
print(g)
When we run the script it fails giving below results.
Error: unexpected symbol in:
"print(e)
f <- 'Single"
Execution halted
String Manipulation
Concatenating Strings - paste() function
Many strings in R are combined using the paste() function.
It can take any number of arguments to be combined together.
Syntax
The basic syntax for paste function is -
paste(..., sep = " ", collapse = NULL)
Following is the description of the parameters used -
... represents any number of arguments to be combined.
sep represents any separator between the arguments.
It is optional.
collapse is used to eliminate the space in between two strings.
But not the space within two words of one string.
Example
Live Demo
a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
print(paste(a,b,c, sep = "-"))
print(paste(a,b,c, sep = "", collapse = ""))
When we execute the above code, it produces the following result -
[1] "Hello How are you? "
[1] "Hello-How-are you? "
[1] "HelloHoware you? "
Formatting numbers & strings - format() function
Numbers and strings can be formatted to a specific style using format() function.
Syntax
The basic syntax for format function is -
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
Following is the description of the parameters used -
x is the vector input.
digits is the total number of digits displayed.
nsmall is the minimum number of digits to the right of the decimal point.
scientific is set to TRUE to display scientific notation.
width indicates the minimum width to be displayed by padding blanks in the beginning.
justify is the display of the string to left, right or center.
Example
Live Demo
# Total number of digits displayed.
Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)
# Display numbers in scientific notation.
result <- format(c(6, 13.14521), scientific = TRUE)
print(result)
# The minimum number of digits to the right of the decimal point.
result <- format(23.47, nsmall = 5)
print(result)
# Format treats everything as a string.
result <- format(6)
print(result)
# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)
# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)
When we execute the above code, it produces the following result -
[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] " 13.7"
[1] "Hello "
[1] " Hello "
Counting number of characters in a string - nchar() function
This function counts the number of characters including spaces in a string.
Syntax
The basic syntax for nchar() function is -
nchar(x)
Following is the description of the parameters used -
x is the vector input.
Example
Live Demo
result <- nchar("Count the number of characters")
print(result)
When we execute the above code, it produces the following result -
[1] 30
Changing the case - toupper() & tolower() functions
These functions change the case of characters of a string.
Syntax
The basic syntax for toupper() & tolower() function is -
toupper(x)
tolower(x)
Following is the description of the parameters used -
x is the vector input.
Example
Live Demo
# Changing to Upper case.
result <- toupper("Changing To Upper")
print(result)
# Changing to lower case.
result <- tolower("Changing To Lower")
print(result)
When we execute the above code, it produces the following result -
[1] "CHANGING TO UPPER"
[1] "changing to lower"
Extracting parts of a string - substring() function
This function extracts parts of a String.
Syntax
The basic syntax for substring() function is -
substring(x,first,last)
Following is the description of the parameters used -
x is the character vector input.
first is the position of the first character to be extracted.
last is the position of the last character to be extracted.
Example
Live Demo
# Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)
When we execute the above code, it produces the following result -
[1] "act"
R - Vectors
Vectors are the most basic R data objects and there are six types of atomic vectors.
They are logical, integer, double, complex, character and raw.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of the above vector types.
Live Demo
# Atomic vector of type character.
print("abc");
# Atomic vector of type double.
print(12.5)
# Atomic vector of type integer.
print(63L)
# Atomic vector of type logical.
print(TRUE)
# Atomic vector of type complex.
print(2+3i)
# Atomic vector of type raw.
print(charToRaw('hello'))
When we execute the above code, it produces the following result -
[1] "abc"
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f
Multiple Elements Vector
Using colon operator with numeric data Live Demo
# Creating a sequence from 5 to 13.
v <- 5:13
print(v)
# Creating a sequence from 6.6 to 12.6.
v <- 6.6:12.6
print(v)
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
When we execute the above code, it produces the following result -
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
Using sequence (Seq.) operator Live Demo
# Create vector with elements from 5 to 9 incrementing by 0.4.
print(seq(5, 9, by = 0.4))
When we execute the above code, it produces the following result -
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0
Using the c() function
The non-character values are coerced to character type if one of the elements is a character.
Live Demo
# The logical and numeric values are converted to characters.
s <- c('apple','red',5,TRUE)
print(s)
When we execute the above code, it produces the following result -
[1] "apple" "red" "5" "TRUE"
Accessing Vector Elements
Elements of a Vector are accessed using indexing.
The [ ] brackets are used for indexing.
Indexing starts with position 1.
Giving a negative value in the index drops that element from result.TRUE, FALSE or 0 and 1 can also be used for indexing.
Live Demo
# Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)
# Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
# Accessing vector elements using negative indexing.
x <- t[c(-2,-5)]
print(x)
# Accessing vector elements using 0/1 indexing.
y <- t[c(0,0,0,0,0,0,1)]
print(y)
When we execute the above code, it produces the following result -
[1] "Mon" "Tue" "Fri"
[1] "Sun" "Fri"
[1] "Sun" "Tue" "Wed" "Fri" "Sat"
[1] "Sun"
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.
Live Demo
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
When we execute the above code, it produces the following result -
[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
Vector Element Recycling
If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operations.
Live Demo
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
add.result <- v1+v2
print(add.result)
sub.result <- v1-v2
print(sub.result)
When we execute the above code, it produces the following result -
[1] 7 19 8 16 4 22
[1] -1 -3 0 -6 -4 0
Vector Element Sorting
Elements in a vector can be sorted using the sort() function.
Live Demo
v <- c(3,8,4,5,0,11, -9, 304)
# Sort the elements of the vector.
sort.result <- sort(v)
print(sort.result)
# Sort the elements in the reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)
# Sorting character vectors in reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
When we execute the above code, it produces the following result -
[1] -9 0 3 4 5 8 11 304
[1] 304 11 8 5 4 3 0 -9
[1] "Blue" "Red" "violet" "yellow"
[1] "yellow" "violet" "Red" "Blue"
R - Lists
Lists are the R objects which contain elements of different types like - numbers, strings, vectors and another list inside it.
A list can also contain a matrix or a function as its elements.
List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.
Live Demo
# Create a list containing strings, numbers, vectors and a logical
# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
When we execute the above code, it produces the following result -
[[1]]
[1] "Red"
[[2]]
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
Naming List Elements
The list elements can be given names and they can be accessed using these names.
Live Demo
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Show the list.
print(list_data)
When we execute the above code, it produces the following result -
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
Accessing List Elements
Elements of the list can be accessed by the index of the element in the list.
In case of named lists it can also be accessed using the names.
We continue to use the list in the above example -
Live Demo
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Access the first element of the list.
print(list_data[1])
# Access the thrid element.
As it is also a list, all its elements will be printed.
print(list_data[3])
# Access the list element using the name of the element.
print(list_data$A_Matrix)
When we execute the above code, it produces the following result -
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
Manipulating List Elements
We can add, delete and update list elements as shown below.
We can add and delete elements only at the end of a list.
But we can update any element.
Live Demo
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Add element at the end of the list.
list_data[4] <- "New element"
print(list_data[4])
# Remove the last element.
list_data[4] <- NULL
# Print the 4th Element.
print(list_data[4])
# Update the 3rd Element.
list_data[3] <- "updated element"
print(list_data[3])
When we execute the above code, it produces the following result -
[[1]]
[1] "New element"
$<NA>
NULL
$`A Inner list`
[1] "updated element"
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
Live Demo
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
# Merge the two lists.
merged.list <- c(list1,list2)
# Print the merged list.
print(merged.list)
When we execute the above code, it produces the following result -
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
Converting List to Vector
A list can be converted to a vector so that the elements of the vector can be used for further manipulation.
All the arithmetic operations on vectors can be applied after the list is converted into vectors.
To do this conversion, we use the unlist() function.
It takes the list as input and produces a vector.
Live Demo
# Create lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
# Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)
print(v1)
print(v2)
# Now add the vectors
result <- v1+v2
print(result)
When we execute the above code, it produces the following result -
[[1]]
[1] 1 2 3 4 5
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
R - Matrices
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout.
They contain elements of the same atomic types.
Though we can create a matrix containing only characters or only logical values, they are not of much use.
We use matrices containing numeric elements to be used in mathematical calculations.
A Matrix is created using the matrix() function.
Syntax
The basic syntax for creating a matrix in R is -
matrix(data, nrow, ncol, byrow, dimnames)
Following is the description of the parameters used -
data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue.
If TRUE then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.
Example
Create a matrix taking a vector of numbers as input.
Live Demo
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
# Elements are arranged sequentially by column.
N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)
When we execute the above code, it produces the following result -
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
[4,] 12 13 14
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14
Accessing Elements of a Matrix
Elements of a matrix can be accessed by using the column and row index of the element.
We consider the matrix P above to find the specific elements below.
Live Demo
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
# Create the matrix.
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
# Access the element at 3rd column and 1st row.
print(P[1,3])
# Access the element at 2nd column and 4th row.
print(P[4,2])
# Access only the 2nd row.
print(P[2,])
# Access only the 3rd column.
print(P[,3])
When we execute the above code, it produces the following result -
[1] 5
[1] 13
col1 col2 col3
6 7 8
row1 row2 row3 row4
5 8 11 14
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators.
The result of the operation is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the operation.
Matrix Addition & Subtraction
Live Demo
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)
# Add the matrices.
result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)
# Subtract the matrices
result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)
When we execute the above code, it produces the following result -
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of addition
[,1] [,2] [,3]
[1,] 8 -1 5
[2,] 11 13 10
Result of subtraction
[,1] [,2] [,3]
[1,] -2 -1 -1
[2,] 7 -5 2
Matrix Multiplication & Division
Live Demo
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)
# Multiply the matrices.
result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result)
# Divide the matrices
result <- matrix1 / matrix2
cat("Result of division","\n")
print(result)
When we execute the above code, it produces the following result -
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of multiplication
[,1] [,2] [,3]
[1,] 15 0 6
[2,] 18 36 24
Result of division
[,1] [,2] [,3]
[1,] 0.6 -Inf 0.6666667
[2,] 4.5 0.4444444 1.5000000
R - Arrays
Arrays are the R data objects which can store data in more than two dimensions.
For example - If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns.
Arrays can store only data type.
An array is created using the array() function.
It takes vectors as input and uses the values in the dim parameter to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
Live Demo
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2))
print(result)
When we execute the above code, it produces the following result -
, , 1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
, , 2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
Naming Columns and Rows
We can give names to the rows, columns and matrices in the array by using the dimnames parameter.
Live Demo
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,column.names,
matrix.names))
print(result)
When we execute the above code, it produces the following result -
, , Matrix1
COL1 COL2 COL3
ROW1 5 10 13
ROW2 9 11 14
ROW3 3 12 15
, , Matrix2
COL1 COL2 COL3
ROW1 5 10 13
ROW2 9 11 14
ROW3 3 12 15
Accessing Array Elements
Live Demo
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,
column.names, matrix.names))
# Print the third row of the second matrix of the array.
print(result[3,,2])
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
# Print the 2nd Matrix.
print(result[,,2])
When we execute the above code, it produces the following result -
COL1 COL2 COL3
3 12 15
[1] 13
COL1 COL2 COL3
ROW1 5 10 13
ROW2 9 11 14
ROW3 3 12 15
Manipulating Array Elements
As array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices.
Live Demo
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
array1 <- array(c(vector1,vector2),dim = c(3,3,2))
# Create two vectors of different lengths.
vector3 <- c(9,1,0)
vector4 <- c(6,0,11,3,14,1,2,6,9)
array2 <- array(c(vector1,vector2),dim = c(3,3,2))
# create matrices from these arrays.
matrix1 <- array1[,,2]
matrix2 <- array2[,,2]
# Add the matrices.
result <- matrix1+matrix2
print(result)
When we execute the above code, it produces the following result -
[,1] [,2] [,3]
[1,] 10 20 26
[2,] 18 22 28
[3,] 6 24 30
Calculations Across Array Elements
We can do calculations across the elements in an array using the apply() function.
Syntax
apply(x, margin, fun)
Following is the description of the parameters used -
x is an array.
margin is the name of the data set used.
fun is the function to be applied across the elements of the array.
Example
We use the apply() function below to calculate the sum of the elements in the rows of an array across all the matrices.
Live Demo
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
new.array <- array(c(vector1,vector2),dim = c(3,3,2))
print(new.array)
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
When we execute the above code, it produces the following result -
, , 1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
, , 2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
[1] 56 68 60
R - Factors
Factors are the data objects which are used to categorize the data and store it as levels.
They can store both strings and integers.
They are useful in the columns which have a limited number of unique values.
Like "Male, "Female" and True, False etc.
They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
Example
Live Demo
# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
print(is.factor(data))
# Apply the factor function.
factor_data <- factor(data)
print(factor_data)
print(is.factor(factor_data))
When we execute the above code, it produces the following result -
[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
Factors in Data Frame
On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it.
Live Demo
# Create the vectors for data frame.
height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)
gender <- c("male","male","female","female","male","female","male")
# Create the data frame.
input_data <- data.frame(height,weight,gender)
print(input_data)
# Test if the gender column is a factor.
print(is.factor(input_data$gender))
# Print the gender column so see the levels.
print(input_data$gender)
When we execute the above code, it produces the following result -
height weight gender
1 132 48 male
2 151 49 male
3 162 66 female
4 139 53 female
5 166 67 male
6 147 52 female
7 122 40 male
[1] TRUE
[1] male male female female male female male
Levels: female male
Changing the Order of Levels
The order of the levels in a factor can be changed by applying the factor function again with new order of the levels.
Live Demo
data <- c("East","West","East","North","North","East","West",
"West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)
# Apply the factor function with required order of the level.
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
When we execute the above code, it produces the following result -
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East North
Levels: East West North
Generating Factor Levels
We can generate factor levels by using the gl() function.
It takes two integers as input which indicates how many levels and how many times each level.
Syntax
gl(n, k, labels)
Following is the description of the parameters used -
n is a integer giving the number of levels.
k is a integer giving the number of replications.
labels is a vector of labels for the resulting factor levels.
Example
Live Demo
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
When we execute the above code, it produces the following result -
Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston
[10] Boston Boston Boston
Levels: Tampa Seattle Boston
R - Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.
Create Data Frame
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
When we execute the above code, it produces the following result -
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27
Get the Structure of the Data Frame
The structure of the data frame can be seen by using str() function.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)
When we execute the above code, it produces the following result -
'data.frame': 5 obs.
of 4 variables:
$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained by applying summary() function.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))
When we execute the above code, it produces the following result -
emp_id emp_name salary start_date
Min.
:1 Length:5 Min.
:515.2 Min.
:2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max.
:5 Max.
:843.2 Max.
:2015-03-27
Extract Data from Data Frame
Extract specific column from a data frame using column name.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
When we execute the above code, it produces the following result -
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Extract the first two rows and then all columns
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)
When we execute the above code, it produces the following result -
emp_id emp_name salary start_date
1 1 Rick 623.3 2012-01-01
2 2 Dan 515.2 2013-09-23
Extract 3rd and 5th row with 2nd and 4th column
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
When we execute the above code, it produces the following result -
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
Expand Data Frame
A data frame can be expanded by adding columns and rows.
Add Column
Just add the column vector using a new column name.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Add the "dept" coulmn.
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)
When we execute the above code, it produces the following result -
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame.
Live Demo
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)
# Create the second data frame
emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)
# Bind the two data frames.
emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
When we execute the above code, it produces the following result -
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Fianance
R - Packages
R packages are a collection of R functions, complied code and sample data.
They are stored under a directory called "library" in the R environment.
By default, R installs a set of packages during installation.
More packages are added later, when they are needed for some specific purpose.
When we start the R console, only the default packages are available by default.
Other packages which are already installed have to be loaded explicitly to be used by the R program that is going to use them.
All the packages available in R language are listed at R Packages.
Below is a list of commands to be used to check, verify and use the R packages.
Check Available R Packages
Get library locations containing R packages
Live Demo
.libPaths()
When we execute the above code, it produces the following result.
It may vary depending on the local settings of your pc.
[2] "C:/Program Files/R/R-3.2.2/library"
Get the list of all the packages installed
Live Demo
library()
When we execute the above code, it produces the following result.
It may vary depending on the local settings of your pc.
Packages in library ‘C:/Program Files/R/R-3.2.2/library’:
base The R Base Package
boot Bootstrap Functions (Originally by Angelo Canty
for S)
class Functions for Classification
cluster "Finding Groups in Data": Cluster Analysis
Extended Rousseeuw et al.
codetools Code Analysis Tools for R
compiler The R Compiler Package
datasets The R Datasets Package
foreign Read Data Stored by 'Minitab', 'S', 'SAS',
'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours
and Fonts
grid The Grid Graphics Package
KernSmooth Functions for Kernel Smoothing Supporting Wand
& Jones (1995)
lattice Trellis Graphics for R
MASS Support Functions and Datasets for Venables and
Ripley's MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML
Smoothness Estimation
nlme Linear and Nonlinear Mixed Effects Models
nnet Feed-Forward Neural Networks and Multinomial
Log-Linear Models
parallel Support for Parallel computation in R
rpart Recursive Partitioning and Regression Trees
spatial Functions for Kriging and Point Pattern
Analysis
splines Regression Spline Functions and Classes
stats The R Stats Package
stats4 Statistical Functions using S4 Classes
survival Survival Analysis
tcltk Tcl/Tk Interface
tools Tools for Package Development
utils The R Utils Package
Get all packages currently loaded in the R environment
Live Demo
search()
When we execute the above code, it produces the following result.
It may vary depending on the local settings of your pc.
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
Install a New Package
There are two ways to add new R packages.
One is installing directly from the CRAN directory and another is downloading the package to your local system and installing it manually.
Install directly from CRAN
The following command gets the packages directly from CRAN webpage and installs the package in the R environment.
You may be prompted to choose a nearest mirror.
Choose the one appropriate to your location.
install.packages("Package Name")
# Install the package named "XML".
install.packages("XML")
Install package manually
Go to the link R Packages to download the package needed.
Save the package as a .zip file in a suitable location in the local system.
Now you can run the following command to install this package in the R environment.
install.packages(file_name_with_path, repos = NULL, type = "source")
# Install the package named "XML"
install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")
Load Package to Library
Before a package can be used in the code, it must be loaded to the current R environment.
You also need to load a package that is already installed previously but not available in the current environment.
A package is loaded using the following command -
library("package Name", lib.loc = "path to library")
# Load the package named "XML"
install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")
R - Data Reshaping
Data Reshaping in R is about changing the way data is organized into rows and columns.
Most of the time data processing in R is done by taking the input data as a data frame.
It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format that is different from format in which we received it.
R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.
Joining Columns and Rows in a Data Frame
We can join multiple vectors to create a data frame using the cbind()function.
Also we can merge two data frames using rbind() function.
Live Demo
# Create vector objects.
city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)
# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)
# Print a header.
cat("# # # # The First data frame\n")
# Print the data frame.
print(addresses)
# Create another data frame with similar columns
new.address <- data.frame(
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)
# Print a header.
cat("# # # The Second data frame\n")
# Print the data frame.
print(new.address)
# Combine rows form both the data frames.
all.addresses <- rbind(addresses,new.address)
# Print a header.
cat("# # # The combined data frame\n")
# Print the result.
print(all.addresses)
When we execute the above code, it produces the following result -
# # # # The First data frame
city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"
# # # The Second data frame
city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949
# # # The combined data frame
city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
5 Lowry CO 80230
6 Charlotte FL 33949
Merging Data Frames
We can merge two data frames by using the merge() function.
The data frames must have same column names on which the merging happens.
In the example below, we consider the data sets about Diabetes in Pima Indian Women available in the library names "MASS".
we merge the two data sets based on the values of blood pressure("bp") and body mass index("bmi").
On choosing these two columns for merging, the records where values of these two variables match in both data sets are combined together to form a single data frame.
Live Demo
library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)
When we execute the above code, it produces the following result -
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
3 24 No
4 21 No
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17
Melting and Casting
One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to get a desired shape.
The functions used to do this are called melt() and cast().
"melt" data so that each row is a unique id-variable combination.
"cast" the melted data into any shape you would like.
mydata
id time x1 x2
1 1 5 6
1 2 3 5
2 1 6 1
2 2 2 4
library(reshape)
melteddata <- melt(mydata, id=c("id","time"))
newdata
id time variable value
1 1 x1 5
1 2 x1 3
2 1 x1 6
2 2 x1 2
1 1 x2 6
1 2 x2 5
2 1 x2 1
2 2 x2 4
# cast the melted data
# cast(data, formula, function)
subjmeans <- cast(melteddata, id~variable, mean)
timemeans <- cast(melteddata, time~variable, mean)
subjmeans
id x1 x2
1 4 5.5
2 4 2.5
timemeans
time x1 x2
1 5.5 3.5
2 2.5 4.5
Another example:
We consider the dataset called ships present in the library called "MASS".
Live Demo
library(MASS)
print(ships)
When we execute the above code, it produces the following result -
type year period service incidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
.............
Melt the Data
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.
library(reshape2)
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
When we execute the above code, it produces the following result with diff structure
type year variable value
1 A 60 period 60
2 A 60 period 75
............
41 A 60 service 127
...........
101 C 70 incidents 6
102 C 70 incidents 2
...........
Cast the Molten Data
We can cast the molten data into a new form where the aggregate of each type of ship for each year is created.
It is done using the cast() function.
recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)
When we execute the above code, it produces the following result -
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
7 B 70 135 20163 56
8 B 75 135 7117 18
9 C 60 135 1731 2
10 C 65 135 1457 1
11 C 70 135 2731 8
12 C 75 135 274 1
13 D 60 135 356 0
14 D 65 135 480 0
15 D 70 135 1557 13
16 D 75 135 2051 4
17 E 60 135 45 0
18 E 65 135 1226 14
19 E 70 135 3318 17
20 E 75 135 542 1
R - CSV Files
In R, we can read data from files stored outside the R environment.
We can also write data into files which will be stored and accessed by the operating system.
R can read and write into various file formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file.
The file should be present in current working directory so that R can read it.
Of course we can also set our own directory and read files from there.
Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function.
You can also set a new working directory using setwd()function.
# Get and print current working directory.
print(getwd())
# Set current working directory.
setwd("/web/com")
# Get and print current working directory.
print(getwd())
When we execute the above code, it produces the following result -
[1] "/web/com/1441086124_2016"
[1] "/web/com"
This result depends on your OS and your current directory where you are working.
Input as CSV File
The csv file is a text file in which the values in the columns are separated by a comma.
Let's consider the following data present in the file named input.csv.
You can create this file using windows notepad by copying and pasting this data.
Save the file as input.csv using the save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current working directory -
data <- read.csv("input.csv")
print(data)
When we execute the above code, it produces the following result -
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Analyzing the CSV File
By default the read.csv() function gives the output as a data frame.
This can be easily checked as follows.
Also we can check the number of columns and rows.
data <- read.csv("input.csv")
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result -
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in subsequent section.
Get the maximum salary
# Create a data frame.
data <- read.csv("input.csv")
# Get the max salary from data frame.
sal <- max(data$salary)
print(sal)
When we execute the above code, it produces the following result -
[1] 843.25
Get the details of the person with max salary
We can fetch rows meeting specific filter criteria similar to a SQL where clause.
# Create a data frame.
data <- read.csv("input.csv")
# Get the max salary from data frame.
sal <- max(data$salary)
# Get the person detail having max salary.
retval <- subset(data, salary == max(salary))
print(retval)
When we execute the above code, it produces the following result -
id name salary start_date dept
5 NA Gary 843.25 2015-03-27 Finance
Get all the people working in IT department
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset( data, dept == "IT")
print(retval)
When we execute the above code, it produces the following result -
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT
Get the persons in IT department whose salary is greater than 600
# Create a data frame.
data <- read.csv("input.csv")
info <- subset(data, salary > 600 & dept == "IT")
print(info)
When we execute the above code, it produces the following result -
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
Get the people who joined on or after 2014
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
print(retval)
When we execute the above code, it produces the following result -
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
Writing into a CSV File
R can create csv file form existing data frame.
The write.csv() function is used to create the csv file.
This file gets created in the working directory.
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
# Write filtered data into a new file.
write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result -
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 NA Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
Here the column X comes from the data set newper.
This can be dropped using additional parameters while writing the file.
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
# Write filtered data into a new file.
write.csv(retval,"output.csv", row.names = FALSE)
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result -
id name salary start_date dept
1 3 Michelle 611.00 2014-11-15 IT
2 4 Ryan 729.00 2014-05-11 HR
3 NA Gary 843.25 2015-03-27 Finance
4 8 Guru 722.50 2014-06-17 Finance
R - Excel File
Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format.
R can read directly from these files using some excel specific packages.
Few such packages are - XLConnect, xlsx, gdata etc.
We will be using xlsx package.
R can also write into excel file using this package.
Install xlsx Package
You can use the following command in the R console to install the "xlsx" package.
It may ask to install some additional packages on which this package is dependent.
Follow the same command with required package name to install the additional packages.
install.packages("xlsx")
Verify and Load the "xlsx" Package
Use the following command to verify and load the "xlsx" package.
# Verify the package is installed.
any(grepl("xlsx",installed.packages()))
# Load the library into R workspace.
library("xlsx")
When the script is run we get the following output.
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
Input as xlsx File
Open Microsoft excel.
Copy and paste the following data in the work sheet named as sheet1.
id name salary start_date dept
1 Rick 623.3 1/1/2012 IT
2 Dan 515.2 9/23/2013 Operations
3 Michelle 611 11/15/2014 IT
4 Ryan 729 5/11/2014 HR
5 Gary 43.25 3/27/2015 Finance
6 Nina 578 5/21/2013 IT
7 Simon 632.8 7/30/2013 Operations
8 Guru 722.5 6/17/2014 Finance
Also copy and paste the following data to another worksheet and rename this worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx".
You should save it in the current working directory of the R workspace.
Reading the Excel File
The input.xlsx is read by using the read.xlsx() function as shown below.
The result is stored as a data frame in the R environment.
# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)
When we execute the above code, it produces the following result -
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
R - Binary Files
A binary file is a file that contains information stored only in form of bits and bytes.(0’s and 1’s).
They are not human readable as the bytes in it translate to characters and symbols which contain many other non-printable characters.
Attempting to read a binary file using any text editor will show characters like Ø and ð.
The binary file has to be read by specific programs to be useable.
For example, the binary file of a Microsoft Word program can be read to a human readable form only by the Word program.
Which indicates that, besides the human readable text, there is a lot more information like formatting of characters and page numbers etc., which are also stored along with alphanumeric characters.
And finally a binary file is a continuous sequence of bytes.
The line break we see in a text file is a character joining first line to the next.
Sometimes, the data generated by other programs are required to be processed by R as a binary file.
Also R is required to create binary files which can be shared with other programs.
R has two functions WriteBin() and readBin() to create and read binary files.
Syntax
writeBin(object, con)
readBin(con, what, n )
Following is the description of the parameters used -
con is the connection object to read or write the binary file.
object is the binary file which to be written.
what is the mode like character, integer etc.
representing the bytes to be read.
n is the number of bytes to read from the binary file.
Example
We consider the R inbuilt data "mtcars".
First we create a csv file from it and convert it to a binary file and store it as a OS file.
Next we read this binary file created into R.
Writing the Binary File
We read the data frame "mtcars" as a csv file and then write it as a binary file to the OS.
# Read the "mtcars" data frame as a csv file and store only the columns
"cyl", "am" and "gear".
write.table(mtcars, file = "mtcars.csv",row.names = FALSE, na = "",
col.names = TRUE, sep = ",")
# Store 5 records from the csv file as a new data frame.
new.mtcars <- read.table("mtcars.csv",sep = ",",header = TRUE,nrows = 5)
# Create a connection object to write the binary file using mode "wb".
write.filename = file("/web/com/binmtcars.dat", "wb")
# Write the column names of the data frame to the connection object.
writeBin(colnames(new.mtcars), write.filename)
# Write the records in each of the column to the file.
writeBin(c(new.mtcars$cyl,new.mtcars$am,new.mtcars$gear), write.filename)
# Close the file for writing so that it can be read by other program.
close(write.filename)
Reading the Binary File
The binary file created above stores all the data as continuous bytes.
So we will read it by choosing appropriate values of column names as well as the column values.
# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("/web/com/binmtcars.dat", "rb")
# First read the column names.
n = 3 as we have 3 columns.
column.names <- readBin(read.filename, character(), n = 3)
# Next read the column values.
n = 18 as we have 3 column names and 15 values.
read.filename <- file("/web/com/binmtcars.dat", "rb")
bindata <- readBin(read.filename, integer(), n = 18)
# Print the data.
print(bindata)
# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)
# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)
# Combine all the read values to a dat frame.
finaldata = cbind(cyldata, amdata, geardata)
colnames(finaldata) = column.names
print(finaldata)
When we execute the above code, it produces the following result and chart -
[1] 7108963 1728081249 7496037 6 6 4
[7] 6 8 1 1 1 0
[13] 0 4 4 4 3 3
[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3
cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3
As we can see, we got the original data back by reading the binary file in R.
R - XML Files
XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text.
It stands for Extensible Markup Language (XML).
Similar to HTML it contains markup tags.
But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file.
You can read a xml file in R using the "XML" package.
This package can be installed using following command.
install.packages("XML")
Input Data
Create a XMl file by copying the below data into a text editor like notepad.
Save the file with a .xml extension and choosing the file type as all files(*.*).
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
Reading XML File
The xml file is read by R using the function xmlParse().
It is stored as a list in R.
# Load the package required to read XML files.
library("XML")
# Also load the other required package.
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Print the result.
print(result)
When we execute the above code, it produces the following result -
1
Rick
623.3
1/1/2012
IT
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
Get Number of Nodes Present in XML File
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Find number of nodes in the root.
rootsize <- xmlSize(rootnode)
# Print the result.
print(rootsize)
When we execute the above code, it produces the following result -
output
[1] 8
Details of the First Node
Let's look at the first record of the parsed file.
It will give us an idea of the various elements present in the top level node.
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Print the result.
print(rootnode[1])
When we execute the above code, it produces the following result -
$EMPLOYEE
1
Rick
623.3
1/1/2012
IT
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
Get Different Elements of a Node
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Get the first element of the first node.
print(rootnode[[1]][[1]])
# Get the fifth element of the first node.
print(rootnode[[1]][[5]])
# Get the second element of the third node.
print(rootnode[[3]][[2]])
When we execute the above code, it produces the following result -
1
IT
Michelle
XML to Data Frame
To handle the data effectively in large files we read the data in the xml file as a data frame.
Then process the data frame for data analysis.
# Load the packages required to read XML files.
library("XML")
library("methods")
# Convert the input xml file to a data frame.
xmldataframe <- xmlToDataFrame("input.xml")
print(xmldataframe)
When we execute the above code, it produces the following result -
ID NAME SALARY STARTDATE DEPT
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
As the data is now available as a dataframe we can use data frame related function to read and manipulate the file.
R - JSON Files
JSON file stores data as text in human-readable format.
Json stands for JavaScript Object Notation.
R can read JSON files using the rjson package.
Install rjson Package
In the R console, you can issue the following command to install the rjson package.
install.packages("rjson")
Input Data
Create a JSON file by copying the below data into a text editor like notepad.
Save the file with a .json extension and choosing the file type as all files(*.*).
{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
Read the JSON File
The JSON file is read by R using the function from JSON().
It is stored as a list in R.
# Load the package required to read JSON files.
library("rjson")
# Give the input file name to the function.
result <- fromJSON(file = "input.json")
# Print the result.
print(result)
When we execute the above code, it produces the following result -
$ID
[1] "1" "2" "3" "4" "5" "6" "7" "8"
$Name
[1] "Rick" "Dan" "Michelle" "Ryan" "Gary" "Nina" "Simon" "Guru"
$Salary
[1] "623.3" "515.2" "611" "729" "843.25" "578" "632.8" "722.5"
$StartDate
[1] "1/1/2012" "9/23/2013" "11/15/2014" "5/11/2014" "3/27/2015" "5/21/2013"
"7/30/2013" "6/17/2014"
$Dept
[1] "IT" "Operations" "IT" "HR" "Finance" "IT"
"Operations" "Finance"
Convert JSON to a Data Frame
We can convert the extracted data above to a R data frame for further analysis using the as.data.frame() function.
# Load the package required to read JSON files.
library("rjson")
# Give the input file name to the function.
result <- fromJSON(file = "input.json")
# Convert JSON file to a data frame.
json_data_frame <- as.data.frame(result)
print(json_data_frame)
When we execute the above code, it produces the following result -
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
R - Web Data
Many websites provide data for consumption by its users.
For example the World Health Organization(WHO) provides reports on health and medical information in the form of CSV, txt and XML files.
Using R programs, we can programmatically extract specific data from such websites.
Some packages in R which are used to scrap data form the web are - "RCurl",XML", and "stringr".
They are used to connect to the URL’s, identify required links for the files and download them to the local environment.
Install R Packages
The following packages are required for processing the URL’s and links to the files.
If they are not available in your R Environment, you can install them using following commands.
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
Input Data
We will visit the URL weather data and download the CSV files using R for the year 2015.
Example
We will use the function getHTMLLinks() to gather the URLs of the files.
Then we will use the function download.file() to save the files to the local system.
As we will be applying the same code again and again for multiple files, we will create a function to be called multiple times.
The filenames are passed as parameters in form of a R list object to this function.
# Read the URL.
url <- "http://www.geos.ed.ac.uk/~weather/jcmb_ws/"
# Gather the html links present in the webpage.
links <- getHTMLLinks(url)
# Identify only the links which point to the JCMB 2015 files.
filenames <- links[str_detect(links, "JCMB_2015")]
# Store the file names as a list.
filenames_list <- as.list(filenames)
# Create a function to download the files by passing the URL and filename list.
downloadcsv <- function (mainurl,filename) {
filedetails <- str_c(mainurl,filename)
download.file(filedetails,filename)
}
# Now apply the l_ply function and save the files into the current R working directory.
l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")
Verify the File Download
After running the above code, you can locate the following files in the current R working directory.
"JCMB_2015.csv" "JCMB_2015_Apr.csv" "JCMB_2015_Feb.csv" "JCMB_2015_Jan.csv"
"JCMB_2015_Mar.csv"
R - Databases
The data is Relational database systems are stored in a normalized format.
So, to carry out statistical computing we will need very advanced and complex Sql queries.
But R can connect easily to many relational databases like MySql, Oracle, Sql server etc.
and fetch records from them as a data frame.
Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using all the powerful packages and functions.
In this tutorial we will be using MySql as our reference database for connecting to R.
RMySQL Package
R has a built-in package named "RMySQL" which provides native connectivity between with MySql database.
You can install this package in the R environment using the following command.
install.packages("RMySQL")
Connecting R to MySql
Once the package is installed we create a connection object in R to connect to the database.
It takes the username, password, database name and host name as input.
# Create a connection Object to MySQL database.
# We will connect to the sampel database named "sakila" that comes with MySql installation.
mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila',
host = 'localhost')
# List the tables available in this database.
dbListTables(mysqlconnection)
When we execute the above code, it produces the following result -
[1] "actor" "actor_info"
[3] "address" "category"
[5] "city" "country"
[7] "customer" "customer_list"
[9] "film" "film_actor"
[11] "film_category" "film_list"
[13] "film_text" "inventory"
[15] "language" "nicer_but_slower_film_list"
[17] "payment" "rental"
[19] "sales_by_film_category" "sales_by_store"
[21] "staff" "staff_list"
[23] "store"
Querying the Tables
We can query the database tables in MySql using the function dbSendQuery().
The query gets executed in MySql and the result set is returned using the R fetch() function.
Finally it is stored as a data frame in R.
# Query the "actor" tables to get all the rows.
result = dbSendQuery(mysqlconnection, "select * from actor")
# Store the result in a R data frame object.
n = 5 is used to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)
When we execute the above code, it produces the following result -
actor_id first_name last_name last_update
1 1 PENELOPE GUINESS 2006-02-15 04:34:33
2 2 NICK WAHLBERG 2006-02-15 04:34:33
3 3 ED CHASE 2006-02-15 04:34:33
4 4 JENNIFER DAVIS 2006-02-15 04:34:33
5 5 JOHNNY LOLLOBRIGIDA 2006-02-15 04:34:33
Query with Filter Clause
We can pass any valid select query to get the result.
result = dbSendQuery(mysqlconnection, "select * from actor where last_name = 'TORN'")
# Fetch all the records(with n = -1) and store it as a data frame.
data.frame = fetch(result, n = -1)
print(data)
When we execute the above code, it produces the following result -
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33
3 102 WALTER TORN 2006-02-15 04:34:33
Updating Rows in the Tables
We can update the rows in a Mysql table by passing the update query to the dbSendQuery() function.
dbSendQuery(mysqlconnection, "update mtcars set disp = 168.5 where hp = 110")
After executing the above code we can see the table updated in the MySql Environment.
Inserting Data into the Tables
dbSendQuery(mysqlconnection,
"insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)"
)
After executing the above code we can see the row inserted into the table in the MySql Environment.
Creating Tables in MySql
We can create tables in the MySql using the function dbWriteTable().
It overwrites the table if it already exists and takes a data frame as input.
# Create the connection object to the database where we want to create the table.
mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila',
host = 'localhost')
# Use the R data frame "mtcars" to create the table in MySql.
# All the rows of mtcars are taken inot MySql.
dbWriteTable(mysqlconnection, "mtcars", mtcars[, ], overwrite = TRUE)
After executing the above code we can see the table created in the MySql Environment.
Dropping Tables in MySql
We can drop the tables in MySql database passing the drop table statement into the dbSendQuery() in the same way we used it for querying data from tables.
dbSendQuery(mysqlconnection, 'drop table if exists mtcars')
After executing the above code we can see the table is dropped in the MySql Environment.
R - Pie Charts
R Programming language has numerous libraries to create charts and graphs.
A pie-chart is a representation of values as slices of a circle with different colors.
The slices are labeled and the numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is -
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used -
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between -1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels.
The below script will create and save the pie chart in the current R working directory.
Live Demo
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city.png")
# Plot the chart.
pie(x,labels)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Pie Chart Title and Colors
We can expand the features of the chart by adding more parameters to the function.
We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart.
The length of the pallet should be same as the number of values we have for the chart.
Hence we use length(x).
Example
The below script will create and save the pie chart in the current R working directory.
Live Demo
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city_title_colours.jpg")
# Plot the chart with title and rainbow color pallet.
pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Slice Percentages and Chart Legend
We can add slice percentage and a chart legend by creating additional chart variables.
Live Demo
# Create data for the graph.
x <- c(21, 62, 10,53)
labels <- c("London","New York","Singapore","Mumbai")
piepercent<- round(100*x/sum(x), 1)
# Give the chart file a name.
png(file = "city_percentage_legends.jpg")
# Plot the chart.
pie(x, labels = piepercent, main = "City pie chart",col = rainbow(length(x)))
legend("topright", c("London","New York","Singapore","Mumbai"), cex = 0.8,
fill = rainbow(length(x)))
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages.
The package plotrix has a function called pie3D() that is used for this.
# Get the library.
library(plotrix)
# Create data for the graph.
x <- c(21, 62, 10,53)
lbl <- c("London","New York","Singapore","Mumbai")
# Give the chart file a name.
png(file = "3d_pie_chart.jpg")
# Plot the chart.
pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
R - Bar Charts
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable.
R uses the function barplot() to create bar charts.
R can draw both vertical and Horizontal bars in the bar chart.
In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is -
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used -
H is a vector or matrix containing numeric values used in bar chart.
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each bar.
col is used to give colors to the bars in the graph.
Example
A simple bar chart is created using just the input vector and the name of each bar.
The below script will create and save the bar chart in the current R working directory.
Live Demo
# Create the data for the chart
H <- c(7,12,28,3,41)
# Give the chart file a name
png(file = "barchart.png")
# Plot the bar chart
barplot(H)
# Save the file
dev.off()
When we execute above code, it produces following result -
Bar Chart Labels, Title and Colors
The features of the bar chart can be expanded by adding more parameters.
The main parameter is used to add title.
The col parameter is used to add colors to the bars.
The args.name is a vector having same number of values as the input vector to describe the meaning of each bar.
Example
The below script will create and save the bar chart in the current R working directory.
Live Demo
# Create the data for the chart
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
# Give the chart file a name
png(file = "barchart_months_revenue.png")
# Plot the bar chart
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")
# Save the file
dev.off()
When we execute above code, it produces following result -
Group Bar Chart and Stacked Bar Chart
We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.
More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.
# Create the input vectors.
colors = c("green","orange","brown")
months <- c("Mar","Apr","May","Jun","Jul")
regions <- c("East","West","North")
# Create the matrix of the values.
Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow = 3, ncol = 5, byrow = TRUE)
# Give the chart file a name
png(file = "barchart_stacked.png")
# Create the bar chart
barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab = "revenue", col = colors)
# Add the legend to the chart
legend("topleft", regions, cex = 1.3, fill = colors)
# Save the file
dev.off()
R - Boxplots
Boxplots are a measure of how well distributed is the data in a data set.
It divides the data set into three quartiles.
This graph represents the minimum, maximum, median, first quartile and third quartile in the data set.
It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is -
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used -
x is a vector or a formula.
data is the data frame.
notch is a logical value.
Set as TRUE to draw a notch.
varwidth is a logical value.
Set as true to draw width of the box proportionate to the sample size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.
Example
We use the data set "mtcars" available in the R environment to create a basic boxplot.
Let's look at the columns "mpg" and "cyl" in mtcars.
Live Demo
input <- mtcars[,c('mpg','cyl')]
print(head(input))
When we execute above code, it produces following result -
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
Creating the Boxplot
The below script will create a boxplot graph for the relation between mpg (miles per gallon) and cyl (number of cylinders).
Live Demo
# Give the chart file a name.
png(file = "boxplot.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Boxplot with Notch
We can draw boxplot with notch to find out how the medians of different data groups match with each other.
The below script will create a boxplot graph with notch for each of the data group.
Live Demo
# Give the chart file a name.
png(file = "boxplot_with_notch.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon",
main = "Mileage Data",
notch = TRUE,
varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low")
)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
R - Histograms
A histogram represents the frequencies of values of a variable bucketed into ranges.
Histogram is similar to bar chat but the difference is it groups the values into continuous ranges.
Each bar in histogram represents the height of the number of values present in that range.
R creates histogram using hist() function.
This function takes a vector as an input and uses some more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is -
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used -
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
Live Demo
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
# Give the chart file a name.
png(file = "histogram.png")
# Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Range of X and Y values
To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.
The width of each of the bar can be decided by using breaks.
Live Demo
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
# Give the chart file a name.
png(file = "histogram_lim_breaks.png")
# Create the histogram.
hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),
breaks = 5)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
R - Line Graphs
A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value.
Line charts are usually used in identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
The basic syntax to create a line chart in R is -
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used -
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input vector and the type parameter as "O".
The below script will create and save a line chart in the current R working directory.
Live Demo
# Create the data for the chart.
v <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = "line_chart.jpg")
# Plot the bar chart.
plot(v,type = "o")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Line Chart Title, Color and Labels
The features of the line chart can be expanded by using additional parameters.
We add color to the points and lines, give a title to the chart and add labels to the axes.
Example
Live Demo
# Create the data for the chart.
v <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = "line_chart_label_colored.jpg")
# Plot the bar chart.
plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Multiple Lines in a Line Chart
More than one line can be drawn on the same chart by using the lines()function.
After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in the chart,
Live Demo
# Create the data for the chart.
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)
# Give the chart file a name.
png(file = "line_chart_2_lines.jpg")
# Plot the bar chart.
plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
lines(t, type = "o", col = "blue")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
R - Scatterplots
Scatterplots show many points plotted in the Cartesian plane.
Each point represents the values of two variables.
One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is -
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used -
x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot.
Let's use the columns "wt" and "mpg" in mtcars.
Live Demo
input <- mtcars[,c('wt','mpg')]
print(head(input))
When we execute the above code, it produces the following result -
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
Creating the Scatterplot
The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon).
Live Demo
# Get the input values.
input <- mtcars[,c('wt','mpg')]
# Give the chart file a name.
png(file = "scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one variable versus the remaining ones we use scatterplot matrix.
We use pairs() function to create matrices of scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is -
pairs(formula, data)
Following is the description of the parameters used -
formula represents the series of variables used in pairs.
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable.
A scatterplot is plotted for each pair.
Live Demo
# Give the chart file a name.
png(file = "scatterplot_matrices.png")
# Plot the matrices between 4 variables giving 12 plots.
# One variable with 3 others and total 4 variables.
pairs(~wt+mpg+disp+cyl,data = mtcars,
main = "Scatterplot Matrix")
# Save the file.
dev.off()
When the above code is executed we get the following output.
R - Mean, Median and Mode
Statistical analysis in R is performed by using many in-built functions.
Most of these functions are part of the R base package.
These functions take R vector as an input along with the arguments and give the result.
The functions we are discussing in this chapter are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is -
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used -
x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
When we execute the above code, it produces the following result -
[1] 8.22
Applying Trim Option
When trim parameter is supplied, the values in the vector get sorted and then the required numbers of observations are dropped from calculating the mean.
When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.
In this case the sorted vector is (-21, -5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed from the vector for calculating mean are (-21,-5,2) from left and (12,18,54) from right.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)
When we execute the above code, it produces the following result -
[1] 5.55
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE.
which means remove the NA values.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)
# Find mean dropping NA values.
result.mean <- mean(x,na.rm = TRUE)
print(result.mean)
When we execute the above code, it produces the following result -
[1] NA
[1] 8.22
Median
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is -
median(x, na.rm = FALSE)
Following is the description of the parameters used -
x is the input vector.
na.rm is used to remove the missing values from the input vector.
Example
Live Demo
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find the median.
median.result <- median(x)
print(median.result)
When we execute the above code, it produces the following result -
[1] 5.6
Mode
The mode is the value that has highest number of occurrences in a set of data.
Unike mean and median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode.
So we create a user function to calculate mode of a data set in R.
This function takes the vector as input and gives the mode value as output.
Example
Live Demo
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Create the vector with numbers.
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
# Calculate the mode using the user function.
result <- getmode(v)
print(result)
# Create the vector with characters.
charv <- c("o","it","the","it","it")
# Calculate the mode using the user function.
result <- getmode(charv)
print(result)
When we execute the above code, it produces the following result -
[1] 2
[1] "it"
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model between two variables.
One of these variable is called predictor variable whose value is gathered through experiments.
The other variable is called response variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1.
Mathematically a linear relationship represents a straight line when plotted as a graph.
A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is -
y = ax + b
Following is the description of the parameters used -
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known.
To do this we need to have the relationship between height and weight of a person.
The steps to create the relationship is -
Carry out the experiment of gathering a sample of observed values of height and corresponding weight.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model created and create the mathematical equation using these
Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations -
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is -
lm(formula,data)
Following is the description of the parameters used -
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients
Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
When we execute the above code, it produces the following result -
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
Get the Summary of the Relationship
Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(summary(relation))
When we execute the above code, it produces the following result -
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
predict() Function
Syntax
The basic syntax for predict() in linear regression is -
predict(object, newdata)
Following is the description of the parameters used -
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
Predict the weight of new persons
Live Demo
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
# The resposne vector.
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
# Find weight of a person with height 170.
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
When we execute the above code, it produces the following result -
1
76.22869
Visualize the Regression Graphically
Live Demo
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
# Give the chart file a name.
png(file = "linearregression.png")
# Plot the chart.
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
R - Multiple Regression
Multiple regression is an extension of linear regression into relationship between more than two variables.
In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response variable.
The general mathematical equation for multiple regression is -
y = a + b1x1 + b2x2 +...bnxn
Following is the description of the parameters used -
y is the response variable.
a, b1, b2...bn are the coefficients.
x1, x2, ...xn are the predictor variables.
We create the regression model using the lm() function in R.
The model determines the value of the coefficients using the input data.
Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is -
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used -
formula is a symbol presenting the relation between the response variable and predictor variables.
data is the vector on which the formula will be applied.
Example
Input Data
Consider the data set "mtcars" available in the R environment.
It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse power("hp"), weight of the car("wt") and some more parameters.
The goal of the model is to establish the relationship between "mpg" as a response variable with "disp","hp" and "wt" as predictor variables.
We create a subset of these variables from the mtcars data set for this purpose.
Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
When we execute the above code, it produces the following result -
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
Create Relationship Model & get the Coefficients
Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
When we execute the above code, it produces the following result -
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
# # # # The Coefficient Values # # #
(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891
Create Equation for Regression Model
Based on the above intercept and coefficient values, we create the mathematical equation.
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
Apply Equation for predicting New Values
We can use the regression equation created above to predict the mileage when a new set of values for displacement, horse power and weight is provided.
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is -
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
R - Logistic Regression
The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1.
It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.
The general mathematical equation for logistic regression is -
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
Following is the description of the parameters used -
y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.
The function used to create the regression model is the glm() function.
Syntax
The basic syntax for glm() function in logistic regression is -
glm(formula,data,family)
Following is the description of the parameters used -
formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model.
It's value is binomial for logistic regression.
Example
The in-built data set "mtcars" describes different models of a car with their various engine specifications.
In "mtcars" data set, the transmission mode (automatic or manual) is described by the column am which is a binary value (0 or 1).
We can create a logistic regression model between the columns "am" and 3 other columns - hp, wt and cyl.
Live Demo
# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
When we execute the above code, it produces the following result -
am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460
Create Regression Model
We use the glm() function to create the regression model and get its summary for analysis.
Live Demo
input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary(am.data))
When we execute the above code, it produces the following result -
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std.
Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and "hp", we consider them to be insignificant in contributing to the value of the variable "am".
Only weight (wt) impacts the "am" value in this regression model.
R - Normal Distribution
In a random collection of data from independent sources, it is generally observed that the distribution of data is normal.
Which means, on plotting a graph with the value of the variable in the horizontal axis and the count of the values in the vertical axis we get a bell shape curve.
The center of the curve represents the mean of the data set.
In the graph, fifty percent of values lie to the left of the mean and the other fifty percent lie to the right of the graph.
This is referred as normal distribution in statistics.
R has four in built functions to generate normal distribution.
They are described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)
Following is the description of the parameters used in above functions -
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations(sample size).
mean is the mean value of the sample data.
It's default value is zero.
sd is the standard deviation.
It's default value is 1.
dnorm()
This function gives height of the probability distribution at each point for a given mean and standard deviation.
Live Demo
# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
# Give the chart file a name.
png(file = "dnorm.png")
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a given number.
It is also called "Cumulative Distribution Function".
Live Demo
# Create a sequence of numbers between -10 and 10 incrementing by 0.2.
x <- seq(-10,10,by = .2)
# Choose the mean as 2.5 and standard deviation as 2.
y <- pnorm(x, mean = 2.5, sd = 2)
# Give the chart file a name.
png(file = "pnorm.png")
# Plot the graph.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
qnorm()
This function takes the probability value and gives a number whose cumulative value matches the probability value.
Live Demo
# Create a sequence of probability values incrementing by 0.02.
x <- seq(0, 1, by = 0.02)
# Choose the mean as 2 and standard deviation as 3.
y <- qnorm(x, mean = 2, sd = 1)
# Give the chart file a name.
png(file = "qnorm.png")
# Plot the graph.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
rnorm()
This function is used to generate random numbers whose distribution is normal.
It takes the sample size as input and generates that many random numbers.
We draw a histogram to show the distribution of the generated numbers.
Live Demo
# Create a sample of 50 numbers which are normally distributed.
y <- rnorm(50)
# Give the chart file a name.
png(file = "rnorm.png")
# Plot the histogram for this sample.
hist(y, main = "Normal DIstribution")
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
R - Binomial Distribution
The binomial distribution model deals with finding the probability of success of an event which has only two possible outcomes in a series of experiments.
For example, tossing of a coin always gives a head or a tail.
The probability of finding exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the binomial distribution.
R has four in-built functions to generate binomial distribution.
They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
Following is the description of the parameters used -
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
dbinom()
This function gives the probability density distribution at each point.
Live Demo
# Create a sample of 50 numbers which are incremented by 1.
x <- seq(0,50,by = 1)
# Create the binomial distribution.
y <- dbinom(x,50,0.5)
# Give the chart file a name.
png(file = "dbinom.png")
# Plot the graph for this sample.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
pbinom()
This function gives the cumulative probability of an event.
It is a single value representing the probability.
Live Demo
# Probability of getting 26 or less heads from a 51 tosses of a coin.
x <- pbinom(26,51,0.5)
print(x)
When we execute the above code, it produces the following result -
[1] 0.610116
qbinom()
This function takes the probability value and gives a number whose cumulative value matches the probability value.
Live Demo
# How many heads will have a probability of 0.25 will come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
When we execute the above code, it produces the following result -
[1] 23
rbinom()
This function generates required number of random values of given probability from a given sample.
Live Demo
# Find 8 random values from a sample of 150 with probability of 0.4.
x <- rbinom(8,150,.4)
print(x)
When we execute the above code, it produces the following result -
[1] 58 61 59 66 55 60 61 67
R - Poisson Regression
Poisson Regression involves regression models in which the response variable is in the form of counts and not fractional numbers.
For example, the count of number of births or number of wins in a football match series.
Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is -
log(y) = a + b1x1 + b2x2 + bnxn.....
Following is the description of the parameters used -
y is the response variable.
a and b are the numeric coefficients.
x is the predictor variable.
The function used to create the Poisson regression model is the glm() function.
Syntax
The basic syntax for glm() function in Poisson regression is -
glm(formula,data,family)
Following is the description of the parameters used in above functions -
formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model.
It's value is 'Poisson' for Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension (low, medium or high) on the number of warp breaks per loom.
Let's consider "breaks" as the response variable which is a count of number of breaks.
The wool "type" and "tension" are taken as predictor variables.
Input Data Live Demo
input <- warpbreaks
print(head(input))
When we execute the above code, it produces the following result -
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
Create Regression Model
Live Demo
output <-glm(formula = breaks ~ wool+tension, data = warpbreaks,
family = poisson)
print(summary(output))
When we execute the above code, it produces the following result -
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std.
Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 297.37 on 53 degrees of freedom
Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06
Number of Fisher Scoring iterations: 4
In the summary we look for the p-value in the last column to be less than 0.05 to consider an impact of the predictor variable on the response variable.
As seen the wooltype B having tension type M and H have impact on the count of breaks.
R - Analysis of Covariance
We use Regression analysis to create models which describe the effect of variation in predictor variables on the response variable.
Sometimes, if we have a categorical variable with values like Yes/No or Male/Female etc.
The simple regression analysis gives multiple results for each value of the categorical variable.
In such scenario, we can study the effect of the categorical variable by using it along with the predictor variable and comparing the regression lines for each level of the categorical variable.
Such an analysis is termed as Analysis of Covariance also called as ANCOVA.
Example
Consider the R built in data set mtcars.
In it we observer that the field "am" represents the type of transmission (auto or manual).
It is a categorical variable with values 0 and 1.
The miles per gallon value(mpg) of a car can also depend on it besides the value of horse power("hp").
We study the effect of the value of "am" on the regression between "mpg" and "hp".
It is done by using the aov() function followed by the anova() function to compare the multiple regressions.
Input Data
Create a data frame containing the fields "mpg", "hp" and "am" from the data set mtcars.
Here we take "mpg" as the response variable, "hp" as the predictor variable and "am" as the categorical variable.
Live Demo
input <- mtcars[,c("am","mpg","hp")]
print(head(input))
When we execute the above code, it produces the following result -
am mpg hp
Mazda RX4 1 21.0 110
Mazda RX4 Wag 1 21.0 110
Datsun 710 1 22.8 93
Hornet 4 Drive 0 21.4 110
Hornet Sportabout 0 18.7 175
Valiant 0 18.1 105
ANCOVA Analysis
We create a regression model taking "hp" as the predictor variable and "mpg" as the response variable taking into account the interaction between "am" and "hp".
Model with interaction between categorical variable and predictor variable
Live Demo
# Get the dataset.
input <- mtcars
# Create the regression model.
result <- aov(mpg~hp*am,data = input)
print(summary(result))
When we execute the above code, it produces the following result -
Df Sum Sq Mean Sq F value Pr(>F)
hp 1 678.4 678.4 77.391 1.50e-09 ***
am 1 202.2 202.2 23.072 4.75e-05 ***
hp:am 1 0.0 0.0 0.001 0.981
Residuals 28 245.4 8.8
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This result shows that both horse power and transmission type has significant effect on miles per gallon as the p value in both cases is less than 0.05.
But the interaction between these two variables is not significant as the p-value is more than 0.05.
Model without interaction between categorical variable and predictor variable
Live Demo
# Get the dataset.
input <- mtcars
# Create the regression model.
result <- aov(mpg~hp+am,data = input)
print(summary(result))
When we execute the above code, it produces the following result -
Df Sum Sq Mean Sq F value Pr(>F)
hp 1 678.4 678.4 80.15 7.63e-10 ***
am 1 202.2 202.2 23.89 3.46e-05 ***
Residuals 29 245.4 8.5
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This result shows that both horse power and transmission type has significant effect on miles per gallon as the p value in both cases is less than 0.05.
Comparing Two Models
Now we can compare the two models to conclude if the interaction of the variables is truly in-significant.
For this we use the anova() function.
Live Demo
# Get the dataset.
input <- mtcars
# Create the regression models.
result1 <- aov(mpg~hp*am,data = input)
result2 <- aov(mpg~hp+am,data = input)
# Compare the two models.
print(anova(result1,result2))
When we execute the above code, it produces the following result -
Model 1: mpg ~ hp * am
Model 2: mpg ~ hp + am
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 245.43
2 29 245.44 -1 -0.0052515 6e-04 0.9806
As the p-value is greater than 0.05 we conclude that the interaction between horse power and transmission type is not significant.
So the mileage per gallon will depend in a similar manner on the horse power of the car in both auto and manual transmission mode.
R - Time Series Analysis
Time series is a series of data points in which each data point is associated with a timestamp.
A simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year.
R language uses many functions to create, manipulate and plot the time series data.
The data for the time series is stored in an R object called time-series object.
It is also a R data object like a vector or data frame.
The time series object is created by using the ts() function.
Syntax
The basic syntax for ts() function in time series analysis is -
timeseries.object.name <- ts(data, start, end, frequency)
Following is the description of the parameters used -
data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional.
Example
Consider the annual rainfall details at a place starting from January 2012.
We create an R time series object for a period of 12 months and plot it.
Live Demo
# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
# Convert it to a time series object.
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)
# Print the timeseries data.
print(rainfall.timeseries)
# Give the chart file a name.
png(file = "rainfall.png")
# Plot a graph of the time series.
plot(rainfall.timeseries)
# Save the file.
dev.off()
When we execute the above code, it produces the following result and chart -
Jan Feb Mar Apr May Jun Jul Aug Sep
2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
Oct Nov Dec
2012 985.0 882.8 1071.0
The Time series chart -
Different Time Intervals
The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured.
A value of 12 indicates that the time series is for 12 months.
Other values and its meaning is as below -
frequency = 12 pegs the data points for every month of a year.
frequency = 4 pegs the data points for every quarter of a year.
frequency = 6 pegs the data points for every 10 minutes of an hour.
frequency = 24*6 pegs the data points for every 10 minutes of a day.
Multiple Time Series
We can plot multiple time series in one chart by combining both the series into a matrix.
Live Demo
# Get the data points in form of a R vector.
rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall2 <-
c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)
# Convert them to a matrix.
combined.rainfall <- matrix(c(rainfall1,rainfall2),nrow = 12)
# Convert it to a time series object.
rainfall.timeseries <- ts(combined.rainfall,start = c(2012,1),frequency = 12)
# Print the timeseries data.
print(rainfall.timeseries)
# Give the chart file a name.
png(file = "rainfall_combined.png")
# Plot a graph of the time series.
plot(rainfall.timeseries, main = "Multiple Time Series")
# Save the file.
dev.off()
When we execute the above code, it produces the following result and chart -
Series 1 Series 2
Jan 2012 799.0 655.0
Feb 2012 1174.8 1306.9
Mar 2012 865.1 1323.4
Apr 2012 1334.6 1172.2
May 2012 635.4 562.2
Jun 2012 918.5 824.0
Jul 2012 685.5 822.4
Aug 2012 998.6 1265.5
Sep 2012 784.2 799.6
Oct 2012 985.0 1105.6
Nov 2012 882.8 1106.7
Dec 2012 1071.0 1337.8
The Multiple Time series chart -
R - Nonlinear Least Square
When modeling real world data for regression analysis, we observe that it is rarely the case that the equation of the model is a linear equation giving a linear graph.
Most of the time, the equation of the model of real world data involves mathematical functions of higher degree like an exponent of 3 or a sin function.
In such a scenario, the plot of the model gives a curve rather than a line.
The goal of both linear and non-linear regression is to adjust the values of the model's parameters to find the line or curve that comes closest to your data.
On finding these values we will be able to estimate the response variable with good accuracy.
In Least Square regression, we establish a regression model in which the sum of the squares of the vertical distances of different points from the regression curve is minimized.
We generally start with a defined model and assume some values for the coefficients.
We then apply the nls() function of R to get the more accurate values along with the confidence intervals.
Syntax
The basic syntax for creating a nonlinear least square test in R is -
nls(formula, data, start)
Following is the description of the parameters used -
formula is a nonlinear model formula including variables and parameters.
data is a data frame used to evaluate the variables in the formula.
start is a named list or named numeric vector of starting estimates.
Example
We will consider a nonlinear model with assumption of initial values of its coefficients.
Next we will see what is the confidence intervals of these assumed values so that we can judge how well these values fir into the model.
So let's consider the below equation for this purpose -
a = b1*x^2+b2
Let's assume the initial coefficients to be 1 and 3 and fit these values into nls() function.
Live Demo
xvalues <- c(1.6,2.1,2,2.23,3.71,3.25,3.4,3.86,1.19,2.21)
yvalues <- c(5.19,7.43,6.94,8.11,18.75,14.88,16.06,19.12,3.21,7.58)
# Give the chart file a name.
png(file = "nls.png")
# Plot these values.
plot(xvalues,yvalues)
# Take the assumed values and fit into the model.
model <- nls(yvalues ~ b1*xvalues^2+b2,start = list(b1 = 1,b2 = 3))
# Plot the chart with new data by fitting it to a prediction from 100 data points.
new.data <- data.frame(xvalues = seq(min(xvalues),max(xvalues),len = 100))
lines(new.data$xvalues,predict(model,newdata = new.data))
# Save the file.
dev.off()
# Get the sum of the squared residuals.
print(sum(resid(model)^2))
# Get the confidence intervals on the chosen values of the coefficients.
print(confint(model))
When we execute the above code, it produces the following result -
[1] 1.081935
Waiting for profiling to be done...
2.5% 97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484
We can conclude that the value of b1 is more close to 1 while the value of b2 is more close to 2 and not 3.
R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree.
The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions.
It is mostly used in Machine Learning and Data Mining applications using R.
Examples of use of decision tress is - predicting an email as spam or not spam, predicting of a tumor is cancerous or predicting a loan as a good or bad credit risk based on the factors in each of these.
Generally, a model is created with observed data also called training data.
Then a set of validation data is used to verify and improve the model.
R has packages which are used to create and visualize decision trees.
For new set of predictor variable, we use this model to arrive at a decision on the category (yes/No, spam/not spam) of the data.
The R package "party" is used to create decision trees.
Install R Package
Use the below command in R console to install the package.
You also have to install the dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is -
ctree(formula, data)
Following is the description of the parameters used -
formula is a formula describing the predictor and response variables.
data is the name of the data set used.
Input Data
We will use the R in-built data set named readingSkills to create a decision tree.
It describes the score of someone's readingSkills if we know the variables "age","shoesize","score" and whether the person is a native speaker or not.
Here is the sample data.
# Load the party package.
It will automatically load other
# dependent packages.
library(party)
# Print some records from data set readingSkills.
print(head(readingSkills))
When we execute the above code, it produces the following result and chart -
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package.
It will automatically load other
# dependent packages.
library(party)
# Create the input data frame.
input.dat <- readingSkills[c(1:105),]
# Give the chart file a name.
png(file = "decision_tree.png")
# Create the tree.
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
# Plot the tree.
plot(output.tree)
# Save the file.
dev.off()
When we execute the above code, it produces the following result -
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: sandwich
Conclusion
From the decision tree shown above we can conclude that anyone whose readingSkills score is less than 38.3 and age is more than 6 is not a native Speaker.
R - Random Forest
In the random forest approach, a large number of decision trees are created.
Every observation is fed into every decision tree.
The most common outcome for each observation is used as the final output.
A new observation is fed into all the trees and taking a majority vote for each classification model.
An error estimate is made for the cases which were not used while building the tree.
That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.
The R package "randomForest" is used to create random forests.
Install R Package
Use the below command in R console to install the package.
You also have to install the dependent packages if any.
install.packages("randomForest)
The package "randomForest" has the function randomForest() which is used to create and analyze random forests.
Syntax
The basic syntax for creating a random forest in R is -
randomForest(formula, data)
Following is the description of the parameters used -
formula is a formula describing the predictor and response variables.
data is the name of the data set used.
Input Data
We will use the R in-built data set named readingSkills to create a decision tree.
It describes the score of someone's readingSkills if we know the variables "age","shoesize","score" and whether the person is a native speaker.
Here is the sample data.
# Load the party package.
It will automatically load other
# required packages.
library(party)
# Print some records from data set readingSkills.
print(head(readingSkills))
When we execute the above code, it produces the following result and chart -
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the randomForest() function to create the decision tree and see it's graph.
# Load the party package.
It will automatically load other
# required packages.
library(party)
library(randomForest)
# Create the forest.
output.forest <- randomForest(nativeSpeaker ~ age + shoeSize + score,
data = readingSkills)
# View the forest results.
print(output.forest)
# Importance of each predictor.
print(importance(fit,type = 2))
When we execute the above code, it produces the following result -
Call:
randomForest(formula = nativeSpeaker ~ age + shoeSize + score,
data = readingSkills)
Type of random forest: classification
Number of trees: 500
No.
of variables tried at each split: 1
OOB estimate of error rate: 1%
Confusion matrix:
no yes class.error
no 99 1 0.01
yes 1 99 0.01
MeanDecreaseGini
age 13.95406
shoeSize 18.91006
score 56.73051
Conclusion
From the random forest shown above we can conclude that the shoesize and score are the important factors deciding if someone is a native speaker or not.
Also the model has only 1% error which means we can predict with 99% accuracy.
R - Survival Analysis
Survival analysis deals with predicting the time when a specific event is going to occur.
It is also known as failure time analysis or analysis of time to death.
For example predicting the number of days a person with cancer will survive or predicting the time when a mechanical system is going to fail.
The R package named survival is used to carry out survival analysis.
This package contains the function Surv() which takes the input data as a R formula and creates a survival object among the chosen variables for analysis.
Then we use the function survfit() to create a plot for the analysis.
Install Package
install.packages("survival")
Syntax
The basic syntax for creating survival analysis in R is -
Surv(time,event)
survfit(formula)
Following is the description of the parameters used -
time is the follow up time until the event occurs.
event indicates the status of occurrence of the expected event.
formula is the relationship between the predictor variables.
Example
We will consider the data set named "pbc" present in the survival packages installed above.
It describes the survival data points about people affected with primary biliary cirrhosis (PBC) of the liver.
Among the many columns present in the data set we are primarily concerned with the fields "time" and "status".
Time represents the number of days between registration of the patient and earlier of the event between the patient receiving a liver transplant or death of the patient.
# Load the library.
library("survival")
# Print first few rows.
print(head(pbc))
When we execute the above code, it produces the following result and chart -
id time status trt age sex ascites hepato spiders edema bili chol
1 1 400 2 1 58.76523 f 1 1 1 1.0 14.5 261
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302
3 3 1012 2 1 70.07255 m 0 0 0 0.5 1.4 176
4 4 1925 2 1 54.74059 f 0 1 1 0.5 1.8 244
5 5 1504 1 2 38.10541 f 0 1 1 0.0 3.4 279
6 6 2503 2 2 66.25873 f 0 1 0 0.0 0.8 248
albumin copper alk.phos ast trig platelet protime stage
1 2.60 156 1718.0 137.95 172 190 12.2 4
2 4.14 54 7394.8 113.52 88 221 10.6 3
3 3.48 210 516.0 96.10 55 151 12.0 4
4 2.54 64 6121.8 60.63 92 183 10.3 4
5 3.53 143 671.0 113.15 72 136 10.9 3
6 3.98 50 944.0 93.00 63 NA 11.0 3
From the above data we are considering time and status for our analysis.
Applying Surv() and survfit() Function
Now we proceed to apply the Surv() function to the above data set and create a plot that will show the trend.
# Load the library.
library("survival")
# Create the survival object.
survfit(Surv(pbc$time,pbc$status == 2)~1)
# Give the chart file a name.
png(file = "survival.png")
# Plot the graph.
plot(survfit(Surv(pbc$time,pbc$status == 2)~1))
# Save the file.
dev.off()
When we execute the above code, it produces the following result and chart -
Call: survfit(formula = Surv(pbc$time, pbc$status == 2) ~ 1)
n events median 0.95LCL 0.95UCL
418 161 3395 3090 3853
The trend in the above graph helps us predicting the probability of survival at the end of a certain number of days.
R - Chi Square Test
Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them.
Both those variables should be from same population and they should be categorical like - Yes/No, Male/Female, Red/Green etc.
For example, we can build a data set with observations on people's ice-cream buying pattern and try to correlate the gender of a person with the flavor of the ice-cream they prefer.
If a correlation is found we can plan for appropriate stock of flavors by knowing the number of gender of people visiting.
Syntax
The function used for performing chi-Square test is chisq.test().
The basic syntax for creating a chi-square test in R is -
chisq.test(data)
Following is the description of the parameters used -
data is the data in form of a table containing the count value of the variables in the observation.
Example
We will take the Cars93 data in the "MASS" library which represents the sales of different models of car in the year 1993.
Live Demo
library("MASS")
print(str(Cars93))
When we execute the above code, it produces the following result -
'data.frame': 93 obs.
of 27 variables:
$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
The above result shows the dataset has many Factor variables which can be considered as categorical variables.
For our model we will consider the variables "AirBags" and "Type".
Here we aim to find out any significant correlation between the types of car sold and the type of Air bags it has.
If correlation is observed we can estimate which types of cars can sell better with what types of air bags.
Live Demo
# Load the library.
library("MASS")
# Create a data frame from the main data set.
car.data <- data.frame(Cars93$AirBags, Cars93$Type)
# Create a table with the needed variables.
car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)
# Perform the Chi-Square test.
print(chisq.test(car.data))
When we execute the above code, it produces the following result -
Compact Large Midsize Small Sporty Van
Driver & Passenger 2 4 7 0 3 0
Driver only 9 7 11 5 8 3
None 5 0 4 16 3 6
Pearson's Chi-squared test
data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723
Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Conclusion
The result shows the p-value of less than 0.05 which indicates a string correlation.
R - Interview Questions
Dear readers, these R Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of R programming.
As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer -
R is a programming language meant for statistical analysis and creating graphs for this purpose.Instead of data types, it has data objects which are used for calculations.
It is used in the fields of data mining, Regression analysis, Probability estimation etc., using many packages available in it.
There are 6 data objects in R.
They are vectors, lists, arrays, matrices, data frames and tables.
A valid variable name consists of letters, numbers and the dot or underline characters.
The variable name starts with a letter or the dot not followed by a number.
A matrix is always two dimensional as it has only rows and columns.
But an array can be of any number of dimensions and each dimension is a matrix.
For example a 3x3x2 array represents 2 matrices each of dimension 3x3.
The Factor data objects in R are used to store and process categorical data in R.
A csv file can be loaded using the read.csv function.
R creates a data frame on reading the csv files using this function.
The command getwd() gives the current working directory in the R environment.
This is the package which is loaded by default when R environment is set.
It provides the basic functionalities like input/output, arithmetic calculations etc.
in the R environment.
Logistic regression deals with measuring the probability of a binary response variable.
In R the function glm() is used to create the logistic regression.
The expression M[4,2] gives the element at 4th row and 2nd column.
When two vectors of different length are involved in a operation then the elements of the shorter vector are reused to complete the operation.
This is called element recycling.
Example - v1 <- c(4,1,0,6) and V2 <- c(2,4) then v1*v2 gives (8,4,0,24).
The elements 2 and 4 are repeated.
We can call a function in R in 3 ways.
First method is to call by using position of the arguments.
Second method id to call by using the name of the arguments and the third method is to call by default arguments.
The lazy evaluation of a function means, the argument is evaluated only if it is used inside the body of the function.
If there is no reference to the argument in the body of the function then it is simply ignored.
To install a package in R we use the below command.
install.packages("package Name")
The package named "XML" is used to read and process the XML files.
We can update any of the element but we can delete only the element at the end of the list.
The general expression to create a matrix in R is - matrix(data, nrow, ncol, byrow, dimnames)
The boxplot() function is used to create boxplots in R.
It takes a formula and a data frame as inputs to create the boxplots.
Frequency 6 indicates the time interval for the time series data is every 10 minutes of an hour.
In R the data objects can be converted from one form to another.
For example we can create a data frame by merging many lists.
This involves a series of R commands to bring the data into the new format.
This is called data reshaping.
It generates 4 random numbers between 0 and 1.
Use the command
installed.packages()
It splits the strings in vector x into substrings at the position of letter e.
x <- "The quick brown fox jumps over the lazy dog"
split.string <- strsplit(x, " ")
extract.words <- split.string[[1]]
result <- unique(tolower(extract.words))
print(result)
Error in v * x[1] : non-numeric argument to binary operator
[1] 5 12 21 32s
It converts a list to a vector.
x <- pbinom(26,51,0.5)
print(x)
NA
Using the function as.data.frame()
function(x) { x[is.na(x)] <- sum(x, na.rm = TRUE); x }
It is used to apply the same function to each of the elements in an Array.
For example finding the mean of the rows in every row.
Every matrix can be called an array but not the reverse.
Matrix is always two dimensional but array can be of any dimension.
?NA
sd(x, na.rm=TRUE)
setwd("Path")
"%%" gives remainder of the division of first vector with second while "%/%" gives the quotient of the division of first vector with second.
Find the column has the maximum value for each row.
hist()
rm(x)
data(package = "MASS")
data(package = .packages(all.available = TRUE))
It is used to install a r package from local directory by browsing and selecting the file.
15 %in% x
pairs(formula, data)
Where formula represents the series of variables used in pairs and data represents the data set from which the variables will be taken.
The subset() functions is used to select variables and observations.
The sample() function is used to choose a random sample of size n from a dataset.
is.matrix(m) should retrun TRUE.
[1] NA
The function t() is used for transposing a matrix.
Example - t(m) , where m is a matrix.
The "next" statement in R programming language is useful when we want to skip the current iteration of a loop without terminating it.
What is Next?
Further, you can go through your past assignments you have done with the subject and make sure you are able to speak confidently on them.
If you are fresher then interviewer does not expect you will answer very complex questions, rather you have to make your basics concepts very strong.
Second it really doesn't matter much if you could not answer few questions but it matters that whatever you answered, you must have answered with confidence.
So just feel confident during your interview.
We at tutorialspoint wish you best luck to have a good interviewer and all the very best for your future endeavor.
Cheers :-)
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions).
At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started.
Graphic User Interfaces
Aside from the built in R console, RStudio is the most popular R code editor, and it interfaces with R for Windows, MacOS, and Linux platforms.
R's binary and logical operators will look very familiar to programmers.
Note that binary operators work on vectors and matrices as well as scalars.
Arithmetic Operators include:
Use the assignment operator <- to create new variables.
# An example of computing the mean with variables
mydata$sum <- mydata$x1 + mydata$x2
mydata$mean <- (mydata$x1 + mydata$x2)/2
Almost everything in R is done through functions.
A function is a piece of code written to carry out a specified task; it may accept arguments or parameters (or not) and it may return one or more values (or not!).
In R, a function is defined with the construct:
function ( arglist ) {body}
The code in between the curly braces is the body of the function.
Note that by using built-in functions, the only thing you need to worry about is how to effectively communicate the correct input arguments (arglist) and manage the return value/s (if any).
Importing data into R is fairly simple.
R offers options to import many file types, from CSVs to databases.
For example, this is how to import a CSV into R.
# first row contains variable names, comma is separator
# assign the variable id to row names
# note the / instead of \ on mswindows systems
mydata <- read.table("c:/mydata.csv", header=TRUE,
sep=",", row.names="id")
R provides a wide range of functions for obtaining summary statistics.
One way to get descriptive statistics is to use the sapply( ) function with a specified summary statistic.
Below is how to get the mean with the sapply( ) function:
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.
In R, graphs are typically created interactively.
Here is an example:
# Creating a Graph
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
The plot( ) function opens a graph window and plots weight vs.
miles per gallon.
The next line of code adds a regression line to this graph.
The final line adds a title.
Packages
Packages are collections of R functions, data, and compiled code in a well-defined format.
The directory where packages are stored is called the library.
R comes with a standard set of packages.
Others are available for download and installation.
Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
Once R is installed, there is a comprehensive built-in help system.
At the program's command prompt you can use any of the following:
help.start() # general help
help(foo) # help about function foo
?foo # same thing
apropos("foo")
# list all functions containing string foo
example(foo) # show an example of function foo
Going Further
If you prefer an online interactive environment to learn R, this free R tutorial by DataCamp is a great way to get started.
R is a dialect of the S language.
It is a case-sensitive, interpreted language.
You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file.
There is a wide variety of data types, including vectors (numerical, character, logical), matrices, data frames, and lists.
Most functionality is provided through built-in and user-created functions and all data objects are kept in memory during an interactive session.
Basic functions are available by default.
Other functions are contained in packages that can be attached to a current session as needed.
R is a case sensitive language.
FOO, Foo, and foo are three different objects!
This section describes working with the R interface.
A key skill to using R effectively is learning how to use the built-in help system.
Other sections describe the working environment, inputting programs and outputting results, installing new functionality through packages, GUIs that have been developed for R, customizing the environment, producing high quality output, and running programs in batch.
A fundamental design feature of R is that the output from most functions can be used as input to other functions.
This is described in reusing results.
Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays,
data frames) that you can operate on through functions that perform statistical analyses and create graphs.
In this way,
R is similar to PROC IML.
This section describes how to enter or import data into R, and how to prepare it for use in statistical analyses.
Topics include R data structures, importing data
(from Excel, SPSS, SAS, Stata, and ASCII Text Files), entering data from the keyboard, creating an
interface with a database management system, exporting data
(to Excel, SPSS, SAS, Stata, and Tab Delimited Text Files), annotating data (with variable
labels and value labels), and listing data.
In addition,
methods for handling missing values and date values are presented.
To Practice
Loading data into R is covered in the free first chapter of this interactive course: Introduction to Data.
Once you have access to your data, you will want to massage it into useful form.
This includes creating new variables (including recoding and renaming existing variables), sorting and merging datasets, aggregating data, reshaping data, and subsetting datasets (including selecting observations that meet criteria, randomly sampling observeration, and dropping or keeping variables).
Each of these activities usually involve the use of R's built-in operators (arithmetic and logical) and functions (numeric, character, and statistical).
Additionally, you may need to use control structures (if-then, for, while, switch) in your programs and/or create your own functions.
Finally you may need to convert variables or datasets from one type to another (e.g.
numeric to character or matrix to data frame).
This section describes each task from an R perspective.
To Practice
To practice managing data in R, try the first chapter of this interactive course.
This section describes basic (and not so basic) statistics.
It includes code for obtaining descriptive statistics, frequency counts and crosstabulations (including tests of independence), correlations (pearson, spearman, kendall, polychoric), t-tests (with equal and unequal variances), nonparametric tests of group differences (Mann Whitney U, Wilcoxon Signed Rank, Kruskall Wallis Test, Friedman Test), multiple linear regression (including diagnostics, cross-validation and variable selection), analysis of variance (including ANCOVA and MANOVA), and statistics based on resampling.
Since modern data analyses almost always involve graphical assessments of relationships and assumptions, links to appropriate graphical methods are provided throughout.
It is always important to check model assumptions before making statistical inferences.
Although it is somewhat artificial to separate regression modeling and an ANOVA framework in this regard, many people learn these topics separately, so I've followed the same convention here.
Regression diagnostics cover outliers, influential observations, non-normality, non-constant error variance, multicolinearity, nonlinearity, and non-independence of errors.
Classical test assumptions for ANOVA/ANCOVA/MANCOVA include the assessment of normality and homogeneity of variances in the univariate case, and multivariate normality and homogeneity of covariance matrices in the multivariate case.
The identification of multivariate outliers is also considered.
Power analysis provides methods of statistical power analysis and sample size estimation for a variety of designs.
Finally, two functions that aid in efficient processing (with and by) are described.
More advanced statistical modeling can be found in the Advanced Statistics section.
Going Further
To practice statistics in R interactively, try this course on the introduction to statistics.
This section describes more advanced statistical methods.
This includes the discovery and exploration of complex multivariate relationships among variables.
Links to appropriate graphical methods are also provided throughout.
Basic statistics are described in the previous section.
It is difficult to order these topics in a straight-forward way.
I have chosen the following (admittedly arbitrary) headings.
Cluster Analysis includes partitioning (k-means), hierarchical agglomerative, and model based approaches.
Tree-Based methods (which could easily have gone under predictive models!) include classification and regression trees, random forests, and other partitioning methodologies.
Other Tools
This section includes tools that are broadly useful including bootstrapping in R and matrix algebra programming (think MATRIX in SPSS or PROC IML in SAS).
Going Further
Try the Kaggle R Tutorial on Machine Learning which includes an exercise with Random Forests.
One of the main reasons data analysts turn to R is for its strong graphic capabilities.
Creating a Graph provides an overview of creating and saving graphs in R.
The remainder of the section describes how to create basic graph types.
These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and Scatterplots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).
The Advanced Graphs section describes how to customize and annotate graphs, and covers more statistically complex types of graphs.
To Practice
To practice the basics of plotting in R interactively, try this course from DataCamp.
This section describes how to customize your graphs.
It also covers more statistically sophisticated graphs.
This is one of the many places that R really shines.
Customization
Graphical parameters describes how to change a graph's symbols, fonts, colors, and lines.
Axes and text describe how to customize a graph's axes, add reference lines, text annotations and a legend.
Combining plots describes how to organize multiple plots into a single graph.
Advanced Graph Types
The lattice package provides a comprehensive system for visualizing multivariate data, including the ability to create plots conditioned on one or more variables.
The ggplot2 package offers a elegant systems for generating univariate and multivariate graphs based on a grammar of graphics.
Other graph types include probability plots, mosaic plots, and correlograms.
Finally methods of interacting with graphs (e.g.
linking multiple graphs with color brushing, or interactive rotation in real-time) are provided.
For simpler, more fundamental graphs, see the Basic Graphs section.
I have been a hardcore SAS and SPSS programmer for more than 25 years, a Systat programmer for 15 years and a Stata programmer for 2 years.
But when I started learning R recently, I found it frustratingly difficult.
Why?
I think that there are two reasons why R can be challenging to learn quickly.
First, while there are many introductory tutorials (covering data types, basic commands, the interface), none alone are comprehensive.
In part, this is because much of the advanced functionality of R comes from hundreds of user contributed packages.
Hunting for what you want can be time consuming, and it can be hard to get a clear overview of what procedures are available.
The second reason is more ephemeral.
As users of statistical packages, we tend to run one prescribed procedure for each type of analysis.
Think of PROC GLM in SAS.
We can carefully set up the run with all the parameters and options that we need.
When we run the procedure, the resulting output may be a hundred pages long.
We then sift through this output pulling out what we need and discarding the rest.
The paradigm in R is different.
Rather than setting up a complete analysis at once, the process is highly interactive.
You run a command (say fit a model), take the results and process it through another command (say a set of diagnostic plots), take those results and process it through another command (say cross-validation), etc.
The cycle may include transforming the data, and looping back through the whole process again.
You stop when you feel that you have fully analyzed the data.
It may sound trite, but this reminds me of the paradigm shift from top-down procedural programming to object oriented programming we saw a few years ago.
It is not an easy mental shift for many of us to make.
In that in the end, however, I believe that you will feel much more intimately in touch with your data and in control of your work.
And it's fun!
To Practice
This free interactive course covers the basics of R.
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions).
At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started.
Commands are entered interactively at the R user prompt.
Up and down arrow keys scroll through your command history.
You will probably want to keep different projects in different physical directories.
Here are some standard commands for managing your workspace.
getwd() # print the current working directory - cwd
ls() # list the objects in the current workspace setwd(mydirectory) # change to mydirectory
setwd("c:/docs/mydir") # note / instead of \ in windows
setwd("/usr/rob/mydir") # on linux# view and set options for the session
help(options) # learn about available options
options() # view current option settings
options(digits=3) # number of digits to print on output
# work with your previous commands
history() # display last 25 commands
history(max.show=Inf) # display all previous commands
# save your command history
savehistory(file="myfile") # default is ".Rhistory"
# recall your command history
loadhistory(file="myfile")
# default is ".Rhistory"# save the workspace to the file .RData in the cwd
save.image()
# save specific objects to a file
# if you don't specify the path, the cwd is assumed
save(object list,file="myfile.RData") # load a workspace into the current session
# if you don't specify the path, the cwd is assumed
load("myfile.RData")
q() # quit R.
You will be prompted to save the workspace.
Important Note to Windows Users:
R gets confused if you use a path in your code like:
c:\mydocuments\myfile.txt
This is because R sees "\" as an escape character.
Instead, use:
c:\\my documents\\myfile.txt
c:/mydocuments/myfile.txt
Either will work.
I use the second convention throughout this website.
To Practice
This free intro to R course will get you familiar with the R workspace.
R is a command line driven program.
The user enters commands at the prompt ( > by default ) and each command is executed one at a time.
There have been a number of attempts to create a more graphical interface, ranging from code editors that interact with R, to full-blown GUIs that present the user with menus and dialog boxes.
click to view
RStudio is my favorite example of a code editor that interfaces with R for Windows, MacOS, and Linux platforms.
https://www.statmethods.net/interface/images/smrcmdr.jpg
click to view
Perhaps the most stable, full-blown GUI is R Commander, which can also run under Windows, Linux, and MacOS (see the documentation for technical requirements).
Both of these programs can make R a lot easier to use.
To Practice
This interactive course gives an overview of installing and working with RStudio.
R's binary and logical operators will look very familiar to programmers.
Note that binary operators work on vectors and matrices as well as scalars.
Arithmetic Operators
Operator
Description
+
addition
-
subtraction
*
multiplication
/
division
^ or **
exponentiation
x %% y
modulus (x mod y) 5%%2 is 1
x %/% y
integer division 5%/%2 is 2
Logical Operators
Operator
Description
<
less than
<=
less than or equal to
>
greater than
>=
greater than or equal to
==
exactly equal to
!=
not equal to
!x
Not x
x | y
x OR y
x & y
x AND y
isTRUE(x)
test if X is TRUE
# An example
x <- c(1:10)
x[(x>8) | (x<5)]
# yields 1 2 3 4 9 10
# How it works
x <- c(1:10)
x
1 2 3 4 5 6 7 8 9 10
x > 8
F F F F F F F F T T
x < 5
T T T T F F F F F F
x > 8 | x < 5
T T T T F F F F T T
x[c(T,T,T,T,F,F,F,F,T,T)]
1 2 3 4 9 10
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector
Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.
The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,
dimnames=list(char_vector_rownames, char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows.
byrow=FALSE indicates that the matrix should be filled by columns (the default).
dimnames provides optional labels for the columns and rows.
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))
Identify rows, columns or elements using subscripts.
x[,4] # 4th column of matrix
x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
Arrays
Arrays are similar to matrices but can have more than two dimensions.
See help(array) for details.
Data Frames
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).
This is similar to SAS and SPSS datasets.
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") # variable names
There are a variety of ways to identify the elements of a data frame .
myframe[3:5] # columns 3,4,5 of data frame
myframe[c("ID","Age")] # columns ID and Age from data frame
myframe$X1 # variable x1 in the data frame
Lists
An ordered collection of objects (components).
A list allows you to gather a variety of (possibly unrelated) objects under one name.
# example of a list with 4 components -
#
a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists
v <- c(list1,list2)
Identify elements of a list using the [[]] convention.
mylist[[2]] # 2nd component of the list
mylist[["mynumbers"]] # component named mynumbers in list
Factors
Tell R that a variable is nominal by making it a factor.
The factor stores the nominal values as a vector of integers in the range [ 1...
k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
# variable gender with 20 "male" entries and
#
30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
#
1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
An ordered factor is used to represent an ordinal variable.
# variable rating coded as "large", "medium", "small'
rating <- ordered(rating)
# recodes rating to 1,2,3 and associates
#
1=large, 2=medium, 3=small internally
# R now treats rating as ordinal
R will treat factors as nominal variables and ordered factors as ordinal variables in statistical proceedures and graphical analyses.
You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overiding the alphabetical ordering).
You can also use factors to create value labels.
For more on factors see the UCLA page.
Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
object # prints the object
ls() # list current objects
rm(object) # delete an object
newobject <- edit(object) # edit copy and save as newobject
fix(object) # edit in place
To Practice
To explore data types in R, try this free interactive introduction to R course
Use the assignment operator <- to create new variables.
A wide array of operators and functions are available here.
# Three examples for doing the same computations
mydata$sum <- mydata$x1 + mydata$x2
mydata$mean <- (mydata$x1 + mydata$x2)/2
attach(mydata)
mydata$sum <- x1 + x2
mydata$mean <- (x1 + x2)/2
detach(mydata)
mydata <- transform( mydata,
sum = x1 + x2,
mean = (x1 + x2)/2
)
(To practice working with variables in R, try the first chapter of this free interactive course.)
Recoding variables
In order to recode data, you will probably use one or more of R's control structures.
# create 2 age categories
mydata$agecat <- ifelse(mydata$age > 70,
c("older"), c("younger"))
# another example: create 3 age categories
attach(mydata)
mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"
detach(mydata)
Renaming variables
You can rename variables programmatically or interactively.
# rename interactively
fix(mydata) # results are saved on close
# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))
# you can re-enter all the variable names in order
# changing the ones you need to change.the limitation
#
is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
Almost everything in R is done through functions.
Here I'm only refering to numeric and character functions that are commonly used in creating or recoding variables.
(To practice working with functions, try the functions sections of this this interactive course.)
Numeric Functions
Function
Description
abs(x)
absolute value
sqrt(x)
square root
ceiling(x)
ceiling(3.475) is 4
floor(x)
floor(3.475) is 3
trunc(x)
trunc(5.99) is 5
round(x, digits=n)
round(3.475, digits=2) is 3.48
signif(x, digits=n)
signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x)
also acos(x), cosh(x), acosh(x), etc.
log(x)
natural logarithm
log10(x)
common logarithm
exp(x)
e^x
Character Functions
Function
Description
substr(x, start=n1, stop=n2)
Extract or replace substrings in a character vector.
x <- "abcdef"
substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef"
grep(pattern, x , ignore.case=FALSE, fixed=FALSE)
Search for pattern in x.
If fixed =FALSE then pattern is a regular expression.
If fixed=TRUE then pattern is a text string.
Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
Find pattern in x and replace with replacement text.
If fixed=FALSE then pattern is a regular expression.
If fixed = T then pattern is a text string.
sub("\\s",".","Hello There") returns "Hello.There"
strsplit(x, split)
Split the elements of character vector x at split.
strsplit("abc", "") returns 3 element vector "a","b","c"
paste(..., sep="")
Concatenate strings after using sep string to seperate them.
paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date())
toupper(x)
Uppercase
tolower(x)
Lowercase
Statistical Probability Functions
The following table describes functions related to probaility distributions.
For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.
Function
Description
dnorm(x)
normal density function (by default m=0 sd=1)
# plot standard normal curve
x <- pretty(c(-3,3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q)
cumulative normal probability for q
(area under the normal curve to the left of q)
pnorm(1.96) is 0.975
qnorm(p)
normal quantile.
value at the p percentile of normal distribution
qnorm(.9) is 1.28 # 90th percentile
rnorm(n, m=0,sd=1)
n random normal deviates with mean m
and standard deviation sd.
#50 random normal variates with mean=50, sd=10
x <- rnorm(50, m=50, sd=10)
binomial distribution where size is the sample size
and prob is the probability of a heads (pi)
# prob of 0 to 5 heads of fair coin out of 10 flips
dbinom(0:5, 10, .5)
# prob of 5 or less heads of fair coin out of 10 flips
pbinom(5, 10, .5)
poisson distribution with m=std=lamda
#probability of 0,1, or 2 events with lamda=4
dpois(0:2, 4)
# probability of at least 3 events with lamda=4
1- ppois(2,4)
uniform distribution, follows the same pattern
as the normal distribution above.
#10 uniform random variates
x <- runif(10)
Other Statistical Functions
Other useful statistical functions are provided in the following table.
Each has the option na.rm to strip missing values before calculations.
Otherwise the presence of missing values will lead to a missing result.
Object can be a numeric vector or data frame.
Function
Description
mean(x, trim=0,
na.rm=FALSE)
mean of object x
# trimmed mean, removing any missing values and
# 5 percent of highest and lowest scores
mx <- mean(x,trim=.05,na.rm=TRUE)
sd(x)
standard deviation of object(x).
also look at var(x) for variance and mad(x) for median absolute deviation.
median(x)
median
quantile(x, probs)
quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x)
range
sum(x)
sum
diff(x, lag=1)
lagged differences, with lag indicating which lag to use
min(x)
minimum
max(x)
maximum
scale(x, center=TRUE, scale=TRUE)
column center or standardize a matrix.
Other Useful Functions
Function
Description
seq(from , to, by)
generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9)
rep(x, ntimes)
repeat xn times
y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)
cut(x, n)
divide continuous variable in factor with n levels
y <- cut(x, 5)
Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well.
Importing data into R is fairly simple.
For Stata and Systat, use the foreign package.
For SPSS and SAS I would recommend the Hmisc package for ease and functionality.
See the Quick-R section on packages, for information on obtaining and installing the these packages.
Example of importing data are provided below.
From A Comma Delimited Text File
# first row contains variable names, comma is separator
# assign the variable id to row names
# note the / instead of \ on mswindows systems
mydata <- read.table("c:/mydata.csv", header=TRUE,
sep=",", row.names="id")
(To practice importing a csv file, try this exercise.)
From Excel
One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above.
Alternatively you can use the xlsx package to access Excel files.
The first row should contain variable/column names.
# read in the first worksheet from the workbook myexcel.xlsx
# first row contains variable names
library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", 1)
# read in the worksheet named mysheet
mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
(To practice, try this exercise on importing an Excel worksheet into R.)
From SPSS
# save SPSS dataset in trasport format
get file='c:\mydata.sav'.
export outfile='c:\mydata.por'.
# in R
library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
(To practice importing SPSS data with the foreign package, try this exercise.)
From SAS
# save SAS dataset in trasport format
libname out xport 'c:/mydata.xpt';
data out.mydata;
set sasuser.mydata;
run;
# in R
library(Hmisc)
mydata <- sasxport.get("c:/mydata.xpt")
# character variables are converted to R factors
From Stata
# input Stata file
library(foreign)
mydata <- read.dta("c:/mydata.dta")
(To practice importing Stata data with the foreign package, try this exercise.)
Try this interactive course: Importing Data in R (Part 1), to work with csv and xlsx files in R.
To work with SAS, Stata, and other formats try Part 2.
R provides a wide range of functions for obtaining summary statistics.
One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic.
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.
There are also numerous R functions designed to provide a range of descriptive statistics at once.
For example
# mean,median,25th and 75th quartiles,min,max
summary(mydata)
# Tukey min,lower-hinge, median,upper-hinge,max
fivenum(x)
Using the Hmisc package
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores
Using the pastecspackage
library(pastecs)
stat.desc(mydata)
# nbr.val, nbr.null, nbr.na, min max, range, sum,
#
median, mean, SE.mean, CI.mean, var, std.dev, coef.var
Using the psych package
library(psych)
describe(mydata)
# item name ,item number, nvalid,
mean, sd,
#
median, mad, min, max, skew, kurtosis, se
Summary Statistics by Group
A simple way of generating summary statistics by grouping variable is available in the psych package.
library(psych)
describe.by(mydata, group,...)
The doBy package provides much of the functionality of SAS PROC SUMMARY.
It defines the desired table using a model formula and a function.
Here is a simple example.
library(doBy)
summaryBy(mpg + wt ~ cyl + vs, data = mtcars,
FUN = function(x) {
c(m = mean(x), s = sd(x))
} )
# produces mpg.m wt.m mpg.s wt.s for each
# combination of the levels of cyl and vs See also: aggregating data.
To Practice
Want to practice interactively? Try this free course on statistics and R
In R, graphs are typically created interactively.
# Creating a Graph
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
The plot( ) function opens a graph window and plots weight vs.
miles per gallon.
The next line of code adds a regression line to this graph.
The final line adds a title.
click to view
Saving Graphs
You can save the graph in a variety of formats from the menu
File -> Save As.
You can also save the graph via code using one of the following functions.
Creating a new graph by issuing a high level plotting command (plot, hist, boxplot, etc.) will typically overwrite a previous graph.
To avoid this, open a new graph window before creating a new graph.
To open a new graph window use one of the functions below.
Function
Platform
windows()
Windows
X11()
Unix
quartz()
Mac
You can have multiple graph windows open at one time.
See help(dev.cur) for more details.
Alternatively, after opening the first graph window, choose History -> Recording from the graph window menu.
Then you can use Previous and Next to step through the graphs you have created.
Try the creating graph exercises in this course on data visualization in R.
Packages are collections of R functions, data, and compiled code in a well-defined format.
The directory where packages are stored is called the library.
R comes with a standard set of packages.
Others are available for download and installation.
Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
Adding Packages
You can expand the types of analyses you do be adding other packages.
A complete list of contributed packages is available from CRAN.
Follow these steps:
Download and install a package (you only need to do this once).To use the package, invoke the library(package) command to load it into the current session.
(You need to do this once in each session, unless you customize your environment to automatically load it each time.)
On MS Windows:
Choose Install Packages from the Packages menu.
Select a CRAN Mirror.
(e.g.
Norway)Select a package.
(e.g.
boot)
Then use the library(package) function to load it for use.
(e.g.
library(boot))
On Linux:
Download the package of interest as a compressed file.
At the command prompt, install it using
R CMD INSTALL [options] [l-lib]pkgsUse the library(package) function within R to load it for use in the session.
This free interactive course covers the basics of R.
Once R is installed, there is a comprehensive built-in help system.
At the program's command prompt you can use any of the following:
help.start() # general help
help(foo) # help about function foo
?foo # same thing
apropos("foo")
# list all functions containing string foo
example(foo) # show an example of function foo# search for foo in help manuals and archived mailing lists
RSiteSearch("foo") # get vignettes on using installed packages
vignette() # show available vingettes
vignette("foo") # show specific vignette
Sample Datasets
R comes with a number of sample datasets that you can experiment with.
Type data( ) to see the available datasets.
The results will depend on which packages you have loaded.
Type help(datasetname) for details on a sample dataset.
To Practice
This free interactive course covers the basics of R.
Once R is installed, there is a comprehensive built-in help system.
At the program's command prompt you can use any of the following:
help.start() # general help
help(foo) # help about function foo
?foo # same thing
apropos("foo")
# list all functions containing string foo
example(foo) # show an example of function foo# search for foo in help manuals and archived mailing lists
RSiteSearch("foo") # get vignettes on using installed packages
vignette() # show available vingettes
vignette("foo") # show specific vignette
Sample Datasets
R comes with a number of sample datasets that you can experiment with.
Type data( ) to see the available datasets.
The results will depend on which packages you have loaded.
Type help(datasetname) for details on a sample dataset.
To Practice
This free interactive course covers the basics of R.
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions).
At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started.
Commands are entered interactively at the R user prompt.
Up and down arrow keys scroll through your command history.
You will probably want to keep different projects in different physical directories.
Here are some standard commands for managing your workspace.
getwd() # print the current working directory - cwd
ls() # list the objects in the current workspace setwd(mydirectory) # change to mydirectory
setwd("c:/docs/mydir") # note / instead of \ in windows
setwd("/usr/rob/mydir") # on linux# view and set options for the session
help(options) # learn about available options
options() # view current option settings
options(digits=3) # number of digits to print on output
# work with your previous commands
history() # display last 25 commands
history(max.show=Inf) # display all previous commands
# save your command history
savehistory(file="myfile") # default is ".Rhistory"
# recall your command history
loadhistory(file="myfile")
# default is ".Rhistory"# save the workspace to the file .RData in the cwd
save.image()
# save specific objects to a file
# if you don't specify the path, the cwd is assumed
save(object list,file="myfile.RData") # load a workspace into the current session
# if you don't specify the path, the cwd is assumed
load("myfile.RData")
q() # quit R.
You will be prompted to save the workspace.
Important Note to Windows Users:
R gets confused if you use a path in your code like:
c:\mydocuments\myfile.txt
This is because R sees "\" as an escape character.
Instead, use:
c:\\my documents\\myfile.txt
c:/mydocuments/myfile.txt
Either will work.
I use the second convention throughout this website.
To Practice
This free intro to R course will get you familiar with the R workspace.
By default, launching R starts an interactive session with input from the keyboard and output to the screen.
However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations.
Input
The source( ) function runs a script in the current session.
If the filename does not include a path, the file is taken from the current working directory.
# input a script
source("myfile")
Output
The sink( ) function defines the direction of the output.
# direct output to a file
sink("myfile", append=FALSE, split=FALSE)
# return output to the terminal
sink()
The append option controls whether output overwrites or adds to a file.
The split option determines if output is also sent to the screen as well as the output file.
Here are some examples of the sink() function.
# output directed to output.txt in c:\projects directory.
# output overwrites existing file.
no output to terminal.
sink("c:/projects/output.txt")
# output directed to myfile.txt in cwd.
output is appended
# to existing file.
output also send to terminal.
sink("myfile.txt", append=TRUE, split=TRUE)
When redirecting output, use the cat( ) function to annotate the output.
Graphs
sink( ) will not redirect graphic output.
To redirect graphic output use one of the following functions.
Use dev.off( ) to return output to the terminal.
Function
Output to
pdf("mygraph.pdf")
pdf file
win.metafile("mygraph.wmf")
windows metafile
png("mygraph.png")
png file
jpeg("mygraph.jpg")
jpeg file
bmp("mygraph.bmp")
bmp file
postscript("mygraph.ps")
postscript file
Use a full path in the file name to save the graph outside of the current working directory.
# example - output graph to jpeg file
jpeg("c:/mygraphs/myplot.jpg")
plot(x)
dev.off()
To Practice
To start running scripts in R, try this free interactive introduction to R course.
Packages are collections of R functions, data, and compiled code in a well-defined format.
The directory where packages are stored is called the library.
R comes with a standard set of packages.
Others are available for download and installation.
Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
Adding Packages
You can expand the types of analyses you do be adding other packages.
A complete list of contributed packages is available from CRAN.
Follow these steps:
Download and install a package (you only need to do this once).To use the package, invoke the library(package) command to load it into the current session.
(You need to do this once in each session, unless you customize your environment to automatically load it each time.)
On MS Windows:
Choose Install Packages from the Packages menu.
Select a CRAN Mirror.
(e.g.
Norway)Select a package.
(e.g.
boot)
Then use the library(package) function to load it for use.
(e.g.
library(boot))
On Linux:
Download the package of interest as a compressed file.
At the command prompt, install it using
R CMD INSTALL [options] [l-lib]pkgsUse the library(package) function within R to load it for use in the session.
This free interactive course covers the basics of R.
R is a command line driven program.
The user enters commands at the prompt ( > by default ) and each command is executed one at a time.
There have been a number of attempts to create a more graphical interface, ranging from code editors that interact with R, to full-blown GUIs that present the user with menus and dialog boxes.
click to view
RStudio is my favorite example of a code editor that interfaces with R for Windows, MacOS, and Linux platforms.
click to view
Perhaps the most stable, full-blown GUI is R Commander, which can also run under Windows, Linux, and MacOS (see the documentation for technical requirements).
Both of these programs can make R a lot easier to use.
To Practice
This interactive course gives an overview of installing and working with RStudio.
You can customize the R environment through a site initialization file or a directory initialization file.
R will always source the Rprofile.site file first.
On Windows, the file is in the C:\Program Files\R\R-n.n.n\etc directory.
You can also place a .Rprofile file in any directory that you are going to run R from or in the user home directory.
At startup, R will source the Rprofile.site file.
It will then look for a .Rprofile file to source in the current working directory.
If it doesn't find it, it will look for one in the user's home directory.
There are two special functions you can place in these files.
.First( ) will be run at the start of the R session and .Last( ) will be run at the end of the session.
# Sample Rprofile.site file
# Things you might want to change
# options(papersize="a4")
# options(editor="notepad")
# options(pager="internal")
# R interactive prompt
# options(prompt="> ")
# options(continue="+ ")
# to prefer Compiled HTML
help
options(chmhelp=TRUE)
# to prefer HTML help
# options(htmlhelp=TRUE)
#
General options
options(tab.width = 2)
options(width = 130)
options(graphics.record=TRUE)
.First <- function(){
library(Hmisc)
library(R2HTML)
cat("\nWelcome at", date(), "\n")
}
.Last <- function(){
cat("\nGoodbye at ", date(), "\n")
}
Going Further
To explore customizing the RStudio interface, try this RStudio course which is taught by Garrett Grolemund, data scientist for RStudio.
Compared with SAS and SPSS, R's ability to output results for publication quality reports is somewhat rudimentary (although this is evolving).
The R2HTML package lets you output text, tables, and graphs in HTML format.
Here is a sample session, followed by an explanation.
# Sample Session
library(R2HTML)
HTMLStart(outdir="c:/mydir", file="myreport",
extension="html", echo=FALSE, HTMLframe=TRUE)
HTML.title("My Report", HR=1)
HTML.title("Description of my data", HR=3)
summary(mydata)
HTMLhr()
HTML.title("X Y Scatter Plot", HR=2)
plot(mydata$y~mydata$x)
HTMLplot()
HTMLStop()
Once you invoke HTMLStart( ), the prompt will change to HTML> until you end with HTMLStop().
The echo=TRUE option copies commands to the same file as the output.
HTMLframe=TRUE creates framed output, with commands in the left frame, linked to output in the right frame.
By default, a CSS file named R2HTML.css controlling page look and feel is output to the same directory.
Optionally, you can include a CSSFile= option to use your own formatting file.
Use HTML.title() to annotate the output.
The HR option refers to HTML title types (H1, H2, H3, etc.) .
The default is HR=2.
HTMLhr() creates a horizontal rule.
Since several interactive commands may be necessary to create a finished graph, invoke the HTMLplot() function when each graph is ready to output.
The RNews article The R2HTML Package has more complex examples using titles, annotations, header and footer files, and cascading style sheets.
Other Options
The R Markdown Package from R Studio supports dozens of static and dynamic output formats including HTML, PDF, MS Word, scientific articles, websites, and more.
(To practice R Markdown, try this tutorial taught by Garrett Grolemund, Data Scientist for R Studio.)
Sweave allows you to imbed R code in LaTeX, producing attractive reports if you know that markup language.
The odfWeave package has functions that allow you to imbedd R output in Open Document Format (ODF) files.
These are the types of files created by OpenOffice software.
The SWordInstaller package allows you to add R output to Microsoft Word documents.
The R2PPT provides wrappers for adding R output to Microsoft PowerPoint presentations.
You can run R non-interactively with input from infile and send output (stdout/stderr) to another file.
Here are examples.
# on Linux
R CMD BATCH [options] my_script.R [outfile] # on Microsoft Windows (adjust the path to R.exe as needed)
"C:\Program Files\R\R-2.13.1\bin\R.exe" CMD BATCH
--vanilla --slave "c:\my projects\my_script.R"
Be sure to look at the section on I/O for help writing R scripts.
See an Introduction to R (Appendix B) for information on the command line options.
To Practice
To start running scripts in R, try this free interactive introduction to R course.
In SAS, you can save the results of statistical analyses using the Output Delivery System (ODS).
While ODS is a vast improvement over PROC PRINTO, it's sophistication can make some features very hard to learn (just try mastering PROC TEMPLATE).
In SPSS you can do the same thing with the Output Management System (OMS).
Again, not one of the easiest topics to learn.
One of the most useful design features of R is that the output of analyses can easily be saved and used as input to additional analyses.
# Example 1
lm(mpg~wt, data=mtcars)
This will run a simple linear regression of miles per gallon on car weight using the data frame mtcars.
Results are sent to the screen.
Nothing is saved.
# Example 2
fit <- lm(mpg~wt, data=mtcars)
This time, the same regression is performed but the results are saved under the name fit.
No output is sent to the screen.
However, you now can manipulate the results.
# Example 2 (continued...)
str(fit) # view the contents/structure of "fit"
The assignment has actually created a list called "fit" that contains a wide range of information (including the predicted values, residuals, coefficients, and more.
# Example 2 (continued again)
# plot residuals by fitted values
plot(fit$residuals, fit$fitted.values)
To see what a function returns, look at the value section of the online help for that function.
Here we would look at help(lm).
The results can also be used by a wide range of other functions.
# Example 2 (one last time, I promise)
# produce diagnostic plots
plot(fit)
# predict mpg from wt in a new set of data
predict(fit, mynewdata)
# get and save influence statistics
cook <- cooks.distance(fit)
To Practice
To practice reusing results in variables, try this interactive course on the introduction to R programming from DataCamp.
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.
Vectors
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector
Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.
The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,
dimnames=list(char_vector_rownames, char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows.
byrow=FALSE indicates that the matrix should be filled by columns (the default).
dimnames provides optional labels for the columns and rows.
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))
Identify rows, columns or elements using subscripts.
x[,4] # 4th column of matrix
x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
Arrays
Arrays are similar to matrices but can have more than two dimensions.
See help(array) for details.
Data Frames
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).
This is similar to SAS and SPSS datasets.
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") # variable names
There are a variety of ways to identify the elements of a data frame .
myframe[3:5] # columns 3,4,5 of data frame
myframe[c("ID","Age")] # columns ID and Age from data frame
myframe$X1 # variable x1 in the data frame
Lists
An ordered collection of objects (components).
A list allows you to gather a variety of (possibly unrelated) objects under one name.
# example of a list with 4 components -
#
a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists
v <- c(list1,list2)
Identify elements of a list using the [[]] convention.
mylist[[2]] # 2nd component of the list
mylist[["mynumbers"]] # component named mynumbers in list
Factors
Tell R that a variable is nominal by making it a factor.
The factor stores the nominal values as a vector of integers in the range [ 1...
k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
# variable gender with 20 "male" entries and
#
30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
#
1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
An ordered factor is used to represent an ordinal variable.
# variable rating coded as "large", "medium", "small'
rating <- ordered(rating)
# recodes rating to 1,2,3 and associates
#
1=large, 2=medium, 3=small internally
# R now treats rating as ordinal
R will treat factors as nominal variables and ordered factors as ordinal variables in statistical proceedures and graphical analyses.
You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overiding the alphabetical ordering).
You can also use factors to create value labels.
For more on factors see the UCLA page.
Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
object # prints the object
ls() # list current objects
rm(object) # delete an object
newobject <- edit(object) # edit copy and save as newobject
fix(object) # edit in place
To Practice
To explore data types in R, try this free interactive introduction to R course
Importing data into R is fairly simple.
For Stata and Systat, use the foreign package.
For SPSS and SAS I would recommend the Hmisc package for ease and functionality.
See the Quick-R section on packages, for information on obtaining and installing the these packages.
Example of importing data are provided below.
From A Comma Delimited Text File
# first row contains variable names, comma is separator
# assign the variable id to row names
# note the / instead of \ on mswindows systems
mydata <- read.table("c:/mydata.csv", header=TRUE,
sep=",", row.names="id")
(To practice importing a csv file, try this exercise.)
From Excel
One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above.
Alternatively you can use the xlsx package to access Excel files.
The first row should contain variable/column names.
# read in the first worksheet from the workbook myexcel.xlsx
# first row contains variable names
library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", 1)
# read in the worksheet named mysheet
mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
(To practice, try this exercise on importing an Excel worksheet into R.)
From SPSS
# save SPSS dataset in trasport format
get file='c:\mydata.sav'.
export outfile='c:\mydata.por'.
# in R
library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
(To practice importing SPSS data with the foreign package, try this exercise.)
From SAS
# save SAS dataset in trasport format
libname out xport 'c:/mydata.xpt';
data out.mydata;
set sasuser.mydata;
run;
# in R
library(Hmisc)
mydata <- sasxport.get("c:/mydata.xpt")
# character variables are converted to R factors
From Stata
# input Stata file
library(foreign)
mydata <- read.dta("c:/mydata.dta")
(To practice importing Stata data with the foreign package, try this exercise.)
Try this interactive course: Importing Data in R (Part 1), to work with csv and xlsx files in R.
To work with SAS, Stata, and other formats try Part 2.
Usually you will obtain a data frame by importing it from SAS, SPSS, Excel, Stata, a database, or an ASCII file.
To create it interactively, you can do something like the following.
# create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220)
mydata <- data.frame(age,gender,weight)
You can also use R's built in spreadsheet to enter the data interactively, as in the following example.
# enter data using editor
mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
mydata <- edit(mydata)
# note that without the assignment in the line above,
# the edits are not saved!
The RODBC package provides access to databases (including Microsoft Access and Microsoft SQL Server) through an ODBC interface.
The primary functions are given below.
Function
Description
odbcConnect(dsn, uid="", pwd="")
Open a connection to an ODBC database
sqlFetch(channel, sqtable)
Read a table from an ODBC database into a data frame
sqlQuery(channel, query)
Submit a query to an ODBC database and return the results
Write or update (append=True) a data frame to a table in the ODBC database
sqlDrop(channel, sqtable)
Remove a table from the ODBC database
close(channel)
Close the connection
# RODBC Example
# import 2 tables (Crime and Punishment) from a DBMS
# into R
data frames (and call them crimedat and pundat)
library(RODBC)
myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, "Crime")
pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
Other Interfaces
The RMySQL package provides an interface to MySQL.
The ROracle package provides aninterface for Oracle.
The RJDBC package provides access to databases through a JDBC interface.
Going Further
This tutorial at DataCamp has another example with the RODBC package.
There are numerous methods for exporting R objects into other formats .
For SPSS, SAS and Stata, you will need to load the foreign packages.
For Excel, you will need the xlsReadWrite package.
# write out text datafile and
# an SPSS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps", package="SPSS")
(Alternatively, to practice importing SPSS data with the foreign package, try this exercise.)
To SAS
# write out text datafile and
# an SAS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas", package="SAS")
To Stata
# export data frame to Stata binary format
library(foreign)
write.dta(mydata, "c:/mydata.dta")
(Alternatively, to practice importing Stata data with the foreign package, try this exercise.)
There are a number of functions for listing the contents of an object or dataset.
# list objects in the working environment
ls() # list the variables in mydata
names(mydata) # list the structure of mydata
str(mydata) # list levels of factor v1 in mydata
levels(mydata$v1)# dimensions of an object
dim(object)
# class of an object (numeric, matrix, data frame, etc)
class(object) # print mydata
mydata # print first 10 rows of mydata
head(mydata, n=10) # print last 5 rows of mydata
tail(mydata, n=5)
To Practice
Try the free first chapter of this course on cleaning data.
R's ability to handle variable labels is somewhat unsatisfying.
If you use the Hmisc package, you can take advantage of some labeling features.
library(Hmisc)
label(mydata$myvar) <- "Variable label for variable myvar"
describe(mydata)
Unfortunately the label is only in effect for functions provided by the Hmisc package, such as describe().
Your other option is to use the variable label as the variable name and then refer to the variable by position index.
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the variable
To Practice
Want to practice more? Try this exercise on variable recoding from DataCamp
To understand value labels in R, you need to understand the data structure factor.
You can use the factor function to create your own value labels.
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue, 3=green
mydata$v1 <- factor(mydata$v1,
levels = c(1,2,3),
labels = c("red", "blue", "green"))
# variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium, 5=High
mydata$v1 <- ordered(mydata$y,
levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Use the factor() function for nominal data and the ordered() function for ordinal data.
R statistical and graphic functions will then treat the data appriopriately.
Note: factor and ordered are used the same way, with the same arguments.
The former creates factors and the later creates ordered factors.
To Practice
Factors are covered in the fourth chapter of this free interactive introduction to R course.
In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
Unlike SAS, R uses the same symbol for character and numeric data.
For more practice on working with missing data, try this course on cleaning data in R.
Testing for Missing Values
is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
Recoding Values to Missing
# recode 99 to missing for variable v1
# select rows where v1 is 99 and recode column v1
mydata$v1[mydata$v1==99] <- NA
Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values.
x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2
The function complete.cases() returns a logical vector indicating which cases are complete.
# list rows of data that have missing values
mydata[!complete.cases(mydata),]
The function na.omit() returns the object with listwise deletion of missing values.
# create new dataset without missing data
newdata <- na.omit(mydata)
Advanced Handling of Missing Data
Most modeling functions in R offer options for dealing with missing values.
You can go beyond pairwise of listwise deletion of missing values through methods such as multiple imputation.
Good implementations that can be accessed through R include Amelia II, Mice, and mitools.
Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates.
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]Sys.Date( ) returns today's date.
date() returns the current date and time.
The following symbols can be used with the format( ) function to print dates.
Symbol
Meaning
Example
%d
day as a number (0-31)
01-31
%a
%A
abbreviated weekday
unabbreviated weekday
Mon
Monday
%m
month (00-12)
00-12
%b
%B
abbreviated month
unabbreviated month
Jan
January
%y
%Y
2-digit year
4-digit year
07
2007
Here is an example.
# print today's date
today <-
Sys.Date()
format(today, format="%B %d %Y")
"June 20 2007"
Date Conversion
Character to Date
You can use the as.Date() function to convert character data to dates.
The format is as.Date(x,"format"), where x is the character data and format gives the appropriate format.
# convert date info in format 'mm/dd/yyyy'
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
The default format is yyyy-mm-ddmydates <- as.Date(c("2007-06-22", "2004-02-13"))
Date to Character
You can convert dates to character data using the as.Character( ) function.
# convert dates to character data
strDates <- as.character(dates)
Learning More
See help(as.Date) and help(strftime) for details on converting character data to dates.
See help(ISOdatetime) for more information about formatting date/times.
To Practice
This intermediate R course includes a section on working with times and dates.
Use the assignment operator <- to create new variables.
A wide array of operators and functions are available here.
# Three examples for doing the same computations
mydata$sum <- mydata$x1 + mydata$x2
mydata$mean <- (mydata$x1 + mydata$x2)/2
attach(mydata)
mydata$sum <- x1 + x2
mydata$mean <- (x1 + x2)/2
detach(mydata)
mydata <- transform( mydata,
sum = x1 + x2,
mean = (x1 + x2)/2
)
(To practice working with variables in R, try the first chapter of this free interactive course.)
Recoding variables
In order to recode data, you will probably use one or more of R's control structures.
# create 2 age categories
mydata$agecat <- ifelse(mydata$age > 70,
c("older"), c("younger"))
# another example: create 3 age categories
attach(mydata)
mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"
detach(mydata)
Renaming variables
You can rename variables programmatically or interactively.
# rename interactively
fix(mydata) # results are saved on close
# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))
# you can re-enter all the variable names in order
# changing the ones you need to change.the limitation
#
is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
R's binary and logical operators will look very familiar to programmers.
Note that binary operators work on vectors and matrices as well as scalars.
Arithmetic Operators
Operator
Description
+
addition
-
subtraction
*
multiplication
/
division
^ or **
exponentiation
x %% y
modulus (x mod y) 5%%2 is 1
x %/% y
integer division 5%/%2 is 2
Logical Operators
Operator
Description
<
less than
<=
less than or equal to
>
greater than
>=
greater than or equal to
==
exactly equal to
!=
not equal to
!x
Not x
x | y
x OR y
x & y
x AND y
isTRUE(x)
test if X is TRUE
# An example
x <- c(1:10)
x[(x>8) | (x<5)]
# yields 1 2 3 4 9 10
# How it works
x <- c(1:10)
x
1 2 3 4 5 6 7 8 9 10
x > 8
F F F F F F F F T T
x < 5
T T T T F F F F F F
x > 8 | x < 5
T T T T F F F F T T
x[c(T,T,T,T,F,F,F,F,T,T)]
1 2 3 4 9 10
Going Further
To practice working with logical operators in R, try the free first chapter on conditionals of this interactive course.
Almost everything in R is done through functions.
Here I'm only refering to numeric and character functions that are commonly used in creating or recoding variables.
(To practice working with functions, try the functions sections of this this interactive course.)
Numeric Functions
Function
Description
abs(x)
absolute value
sqrt(x)
square root
ceiling(x)
ceiling(3.475) is 4
floor(x)
floor(3.475) is 3
trunc(x)
trunc(5.99) is 5
round(x, digits=n)
round(3.475, digits=2) is 3.48
signif(x, digits=n)
signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x)
also acos(x), cosh(x), acosh(x), etc.
log(x)
natural logarithm
log10(x)
common logarithm
exp(x)
e^x
Character Functions
Function
Description
substr(x, start=n1, stop=n2)
Extract or replace substrings in a character vector.
x <- "abcdef"
substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef"
grep(pattern, x , ignore.case=FALSE, fixed=FALSE)
Search for pattern in x.
If fixed =FALSE then pattern is a regular expression.
If fixed=TRUE then pattern is a text string.
Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
Find pattern in x and replace with replacement text.
If fixed=FALSE then pattern is a regular expression.
If fixed = T then pattern is a text string.
sub("\\s",".","Hello There") returns "Hello.There"
strsplit(x, split)
Split the elements of character vector x at split.
strsplit("abc", "") returns 3 element vector "a","b","c"
paste(..., sep="")
Concatenate strings after using sep string to seperate them.
paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date())
toupper(x)
Uppercase
tolower(x)
Lowercase
Statistical Probability Functions
The following table describes functions related to probaility distributions.
For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.
Function
Description
dnorm(x)
normal density function (by default m=0 sd=1)
# plot standard normal curve
x <- pretty(c(-3,3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q)
cumulative normal probability for q
(area under the normal curve to the left of q)
pnorm(1.96) is 0.975
qnorm(p)
normal quantile.
value at the p percentile of normal distribution
qnorm(.9) is 1.28 # 90th percentile
rnorm(n, m=0,sd=1)
n random normal deviates with mean m
and standard deviation sd.
#50 random normal variates with mean=50, sd=10
x <- rnorm(50, m=50, sd=10)
binomial distribution where size is the sample size
and prob is the probability of a heads (pi)
# prob of 0 to 5 heads of fair coin out of 10 flips
dbinom(0:5, 10, .5)
# prob of 5 or less heads of fair coin out of 10 flips
pbinom(5, 10, .5)
poisson distribution with m=std=lamda
#probability of 0,1, or 2 events with lamda=4
dpois(0:2, 4)
# probability of at least 3 events with lamda=4
1- ppois(2,4)
uniform distribution, follows the same pattern
as the normal distribution above.
#10 uniform random variates
x <- runif(10)
Other Statistical Functions
Other useful statistical functions are provided in the following table.
Each has the option na.rm to strip missing values before calculations.
Otherwise the presence of missing values will lead to a missing result.
Object can be a numeric vector or data frame.
Function
Description
mean(x, trim=0,
na.rm=FALSE)
mean of object x
# trimmed mean, removing any missing values and
# 5 percent of highest and lowest scores
mx <- mean(x,trim=.05,na.rm=TRUE)
sd(x)
standard deviation of object(x).
also look at var(x) for variance and mad(x) for median absolute deviation.
median(x)
median
quantile(x, probs)
quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x)
range
sum(x)
sum
diff(x, lag=1)
lagged differences, with lag indicating which lag to use
min(x)
minimum
max(x)
maximum
scale(x, center=TRUE, scale=TRUE)
column center or standardize a matrix.
Other Useful Functions
Function
Description
seq(from , to, by)
generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9)
rep(x, ntimes)
repeat xn times
y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)
cut(x, n)
divide continuous variable in factor with n levels
y <- cut(x, 5)
Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well.
R has the standard control structures you would expect.
expr can be multiple (compound) statements by enclosing them in braces { }.
It is more efficient to use built-in functions rather than control structures whenever possible.
if-else
if (cond) expr
if (cond) expr1 else expr2
for
for (var in seq) expr
while
while (cond) expr
switch
switch(expr, ...)
ifelse
ifelse(test,yes,no)
Example
# transpose of a matrix
# a poor alternative to built-in t() function
mytrans <- function(x) {
if (!is.matrix(x)) {
warning("argument is not a matrix: returning NA")
return(NA_real_)
}
y <- matrix(1, nrow=ncol(x), ncol=nrow(x))
for (i in 1:nrow(x)) {
for (j in 1:ncol(x)) {
y[j,i] <- x[i,j]
}
}
return(y)
}
# try it
z <- matrix(1:10, nrow=5, ncol=2)
tz <- mytrans(z)
Going Further
To practice working with control structures in R, try the chapter on conditionals and control flow of this interactive R course.
One of the great strengths of R is the user's ability to add functions.
In fact, many of the functions in R are actually functions of functions.
The structure of a function is given below.
myfunction <- function(arg1, arg2, ...
){
statements
return(object)
}
Objects in the function are local to the function.
The object returned can be any data type.
Here is an example.
# function example - get measures of central tendency
# and spread for a numeric vector x.
The user has a
# choice of measures and whether the results are printed.
mysummary <- function(x,npar=TRUE,print=TRUE) {
if (!npar) {
center <- mean(x); spread <- sd(x)
} else {
center <- median(x); spread <- mad(x)
}
if (print & !npar) {
cat("Mean=", center, "\n", "SD=", spread, "\n")
}
else if (print & npar) {
cat("Median=", center, "\n", "MAD=", spread, "\n")
}
result <- list(center=center,spread=spread)
return(result)
}
# invoking the function
set.seed(1234)
x <- rpois(500, 4)
y <- mysummary(x)
Median= 4
MAD= 1.4826
# y$center is the median (4)
# y$spread is the median absolute deviation (1.4826)
y <- mysummary(x, npar=FALSE, print=FALSE)
# no output
# y$center is the mean (4.052)
#
y$spread is the standard deviation (2.01927)
It can be instructive to look at the code of a function.
In R, you can view a function's code by typing the function name without the ( ).
If this method fails, look at the following R Wiki link for hints on viewing function sourcecode.
Finally, you may want to store your own functions, and have them available in every session.
You can customize the R environment to load your functions at start-up.
To Practice
Try this interactive course on writing functions in R.
To sort a data frame in R, use the order( ) function.
By default, sorting is ASCENDING.
Prepend the sorting variable by a minus sign to indicate DESCENDING order.
Here are some examples.
# sorting examples using the mtcars dataset
attach(mtcars)
# sort by mpg
newdata <- mtcars[order(mpg),]
# sort by mpg and cyl
newdata <- mtcars[order(mpg, cyl),]
#sort by mpg (ascending) and cyl (descending)
newdata <- mtcars[order(mpg, -cyl),]
detach(mtcars)
To practice, try this sorting exercise with the order() function.
Adding Columns
To merge two data frames (datasets) horizontally, use the merge function.
In most cases, you join two data frames by one or more common key variables (i.e., an inner join).
# merge two data frames by ID
total <- merge(data frameA,data frameB,by="ID")# merge two data frames by ID and Country
total <- merge(data frameA,data frameB,by=c("ID","Country"))
Adding Rows
To join two data frames (datasets) vertically, use the rbind function.
The two data frames must have the same variables, but they do not have to be in the same order.
total <- rbind(data frameA, data frameB)
If data frameA has variables that data frameB does not, then either:
Delete the extra variables in data frameA orCreate the additional variables in data frameB and set them to NA (missing)
before joining them with rbind( ).
Going Further
To practice manipulating data frames with the dplyr package, try this interactive course on data frame manipulation in R.
It is relatively easy to collapse data in R using one or more BY variables and a defined function.
# aggregate data frame mtcars by cyl and vs, returning means
# for numeric variables
attach(mtcars)
aggdata <-aggregate(mtcars, by=list(cyl,vs),
FUN=mean, na.rm=TRUE)
print(aggdata)
detach(mtcars)
When using the aggregate() function, the by variables must be in a list (even if there is only one).
The function can be built-in or user provided.
See also:
To practice aggregate() and other functions, try the exercises in this manipulating data tutorial.
R provides a variety of methods for reshaping data prior to analysis.
Transpose
Use the t() function to transpose a matrix or a data frame.
In the later case, rownames become variable (column) names.
# example using built-in dataset
mtcars
t(mtcars)
The Reshape Package
Hadley Wickham has created a comprehensive package called reshape to massage data.
Both an introduction and article are available.
There is even a video!
Basically, you "melt" data so that each row is a unique id-variable combination.
Then you "cast" the melted data into any shape you would like.
Here is a very simple example.
mydata
id
time
x1
x2
1
1
5
6
1
2
3
5
2
1
6
1
2
2
2
4
# example of melt function
library(reshape)
mdata <- melt(mydata, id=c("id","time"))
newdata
id
time
variable
value
1
1
x1
5
1
2
x1
3
2
1
x1
6
2
2
x1
2
1
1
x2
6
1
2
x2
5
2
1
x2
1
2
2
x2
4
# cast the melted data
# cast(data, formula, function)
subjmeans <- cast(mdata, id~variable, mean)
timemeans <- cast(mdata, time~variable, mean)
subjmeans
id
x1
x2
1
4
5.5
2
4
2.5
timemeans
time
x1
x2
1
5.5
3.5
2
2.5
4.5
There is much more that you can do with the melt( ) and cast( ) functions.
See the documentation for more details.
Going Further
To practice massaging data, try this course in cleaning data in R.
R has powerful indexing features for accessing object elements.
These features can be used to select and exclude variables and observations.
The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.
Selecting (Keeping) Variables
# select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
# another method
myvars <- paste("v", 1:3, sep="")
newdata <- mydata[myvars]
# select 1st and 5th thru 10th variables
newdata <- mydata[c(1,5:10)]
To practice this interactively, try the selection of data frame elements exercises in the Data frames chapter of this introduction to R course.
# first 5 observations
newdata <- mydata[1:5,]
# based on variable values
newdata <- mydata[ which(mydata$gender=='F'
& mydata$age > 65), ]
# or
attach(mydata)
newdata <- mydata[ which(gender=='F' & age > 65),]
detach(mydata)
Selection using the Subset Function
The subset( ) function is the easiest way to select variables and observations.
In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10.
We keep the ID and Weight columns.
# using subset function
newdata <- subset(mydata, age >= 20 | age < 10,
select=c(ID, Weight))
In the next example, we select all men over the age of 25 and we keep variables weight through income (weight, income and all columns between them).
# using subset function (part 2)
newdata <- subset(mydata, sex=="m" & age > 25,
select=weight:income)
To practice the subset() function, try this this interactive exercise. on subsetting data.tables.
Random Samples
Use the sample( ) function to take a random sample of size n from a dataset.
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50,
replace=FALSE),]
Type conversions in R work as you would expect.
For example, adding a character string to a numeric vector converts all the elements in the vector to character.
Use is.foo to test for data type foo.
Returns TRUE or FALSE
Use as.foo to explicitly convert it.
is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame)
Examples
to one long
vector
to
matrix
to
data frame
from
vector
c(x,y)
cbind(x,y)
rbind(x,y)
data.frame(x,y)
from
matrix
as.vector(mymatrix)
as.data.frame(mymatrix)
from
data frame
as.matrix(myframe)
Dates
You can convert dates to and from character or numeric data.
See date values for more information.
To Practice
To explore data types in R, try this free interactive introduction to R course.
R provides a wide range of functions for obtaining summary statistics.
One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic.
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.
There are also numerous R functions designed to provide a range of descriptive statistics at once.
For example
# mean,median,25th and 75th quartiles,min,max
summary(mydata)
# Tukey min,lower-hinge, median,upper-hinge,max
fivenum(x)
Using the Hmisc package
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores
Using the pastecspackage
library(pastecs)
stat.desc(mydata)
# nbr.val, nbr.null, nbr.na, min max, range, sum,
#
median, mean, SE.mean, CI.mean, var, std.dev, coef.var
Using the psych package
library(psych)
describe(mydata)
# item name ,item number, nvalid,
mean, sd,
#
median, mad, min, max, skew, kurtosis, se
Summary Statistics by Group
A simple way of generating summary statistics by grouping variable is available in the psych package.
library(psych)
describe.by(mydata, group,...)
The doBy package provides much of the functionality of SAS PROC SUMMARY.
It defines the desired table using a model formula and a function.
Here is a simple example.
library(doBy)
summaryBy(mpg + wt ~ cyl + vs, data = mtcars,
FUN = function(x) {
c(m = mean(x), s = sd(x))
} )
# produces mpg.m wt.m mpg.s wt.s for each
# combination of the levels of cyl and vs See also: aggregating data.
To Practice
Want to practice interactively? Try this free course on statistics and R
This section describes the creation of frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results.
Generating Frequency Tables
R provides many methods for creating frequency and contingency tables.
Three are described below.
In the following examples, assume that A, B, and C represent categorical variables.
table
You can generate frequency tables using the table( ) function, tables of proportions using the prop.table( ) function, and marginal frequencies using margin.table( ).
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages table( ) can also generate multidimensional tables based on 3 or more categorical variables.
In this case, use the ftable( ) function to print the results more attractively.
# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
Table ignores missing values.
To include NA as a category in counts, include the table option exclude=NULL if the variable is a vector.
If the variable is a factor you have to create a new factor using newfactor <- factor(oldfactor, exclude=NULL).
xtabs
The xtabs( ) function allows you to create crosstabulations using formula style input.
# 3-Way Frequency Table
mytable <- xtabs(~A+B+c, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence
If a variable is included on the left side of the formula, it is assumed to be a vector of frequencies (useful if the data have already been tabulated).
Crosstable
The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS.
It has a wealth of options.
# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)
There are options to report percentages (row, column, cell), specify decimal places, produce Chi-square, Fisher, and McNemar tests of independence, report expected and residual values (pearson, standardized, adjusted standardized), include missing values as valid, annotate with row and column titles, and format as SAS or SPSS style output!
See help(CrossTable) for details.
Tests of Independence
Chi-Square Test
For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable.
By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic.
Optionally, the p-value can be derived via Monte Carlo simultation.
Fisher Exact Test
fisher.test(x) provides an exact test of independence.
x is a two dimensional contingency table in matrix form.
Mantel-Haenszel test
Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null hypothesis that two nominal variables are conditionally independent in each stratum, assuming that there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension refers to the strata.
Loglinear Models
You can use the loglm( ) function in the MASS package to produce log-linear models.
For example, let's assume we have a 3-way contingency table based on variables A, B, and C.
library(MASS)
mytable <- xtabs(~A+B+C, data=mydata)
We can perform the following tests:
Mutual Independence: A, B, and C are pairwise independent.
loglm(~A+B+C, mytable)Partial Independence: A is partially independent of B and C (i.e., A is independent of the composite variable BC).
loglin(~A+B+C+B*C, mytable) Conditional Independence: A is independent of B, given C.
loglm(~A+B+C+A*C+B*C, mytable)No Three-Way Interactionloglm(~A+B+C+A*B+A*C+B*C, mytable)
Martin Theus and Stephan Lauer have written an excellent article on Visualizing Loglinear Models, using mosaic plots.
Measures of Association
The assocstats(mytable) function in the vcd package calculates the phi coefficient, contingency coefficient, and Cramer's V for an rxc table.
The kappa(mytable) function in the vcd package calculates Cohen's kappa and weighted kappa for a confusion matrix.
See Richard Darlington's article on Measures of Association in Crosstab Tables for an excellent review of these statistics.
Visualizing results
Use bar and pie charts for visualizing frequencies in one dimension.
Use the vcd package for visualizing relationships among categorical data (e.g.
mosaic and association plots).
Use the ca package for correspondence analysis (visually exploring relationships between rows and columns in contingency tables).
To practice making these charts, try the data visualization course at DataCamp.
Converting Frequency Tables to an "Original" Flat file
Finally, there may be times that you wil need the original "flat file" data frame rather than the frequency table.
Marc Schwartz has provided code on the Rhelp mailing list for converting a table back into a data frame.
You can use the cor( ) function to produce correlations and the cov( ) function to produces covariances.
A simplified format is cor(x, use=, method= ) where
Option
Description
x
Matrix or data frame
use
Specifies the handling of missing data.
Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
method
Specifies the type of correlation.
Options are pearson, spearman or kendall.
# Correlations/covariances among numeric variables in
# data frame mtcars.
Use listwise deletion of missing data.
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")
Unfortunately, neither cor( ) or cov( ) produce tests of significance, although you can use the cor.test( ) function to test a single correlation coefficient.
The rcorr( ) function in the Hmisc package produces correlations/covariances and significance levels for pearson and spearman correlations.
However, input must be a matrix and pairwise deletion is used.
# Correlations with significance levels
library(Hmisc)
rcorr(x, type="pearson")
# type can be pearson or spearman
#mtcars is a data frame
rcorr(as.matrix(mtcars))
You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the columns of Y.
This similar to the VAR and WITH commands in SAS PROC CORR.
# Correlation matrix from mtcars
# with
mpg, cyl, and disp as rows
# and hp, drat, and wt as columns
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
Other Types of Correlations
# polychoric correlation
# x is a contingency table of counts
library(polycor)
polychor(x)
# heterogeneous correlations in one matrix
#
pearson (numeric-numeric),
#
polyserial (numeric-ordinal),
# and polychoric (ordinal-ordinal)
# x is a data frame with
ordered factors
# and
numeric variables
library(polycor)
hetcor(x)
# partial correlations
library(ggm)
data(mydata)
pcor(c("a", "b", "x", "y", "z"), var(mydata))
# partial corr between a and b controlling for x, y, z
Try this interactive course on correlations and regressions in R.
The t.test( ) function produces a variety of t-tests.
Unlike most statistical packages, the default assumes unequal variance and applies the Welsh df modification.# independent 2-group t-test
t.test(y~x) # where y is numeric and x is a binary factor
# independent 2-group t-test
t.test(y1,y2) # where y1 and y2 are numeric # paired t-test
t.test(y1,y2,paired=TRUE) # where y1 & y2 are numeric # one sample t-test
t.test(y,mu=3) # Ho: mu=3
You can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate.
You can use the alternative="less" or alternative="greater" option to specify a one tailed test.
Nonparametric and resampling alternatives to t-tests are available.
The chapter "Introduction to t-tests" of this online statistics in R course has a number of interactive exercises on how to do t-tests in R.
R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank, Kruskal Wallis, and Friedman tests.
# independent 2-group Mann-Whitney U Test
wilcox.test(y~A)
# where y is numeric and A is A binary factor
# independent 2-group Mann-Whitney U Test
wilcox.test(y,x) # where y and x are numeric # dependent 2-group Wilcoxon Signed Rank Test
wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric # Kruskal Wallis Test One Way Anova by Ranks
kruskal.test(y~A) # where y1 is numeric and A is a factor # Randomized Block Design - Friedman Test
friedman.test(y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor
For the wilcox.test you can use the alternative="less" or alternative="greater" option to specify a one tailed test.
Parametric and resampling alternatives are available.
The package pgirmess provides nonparametric multiple comparisons.
(Note: This package has been
withdrawn but is still available in the CRAN archives.)
library(npmc)
npmc(x)
#
where x is a data frame containing variable 'var'
#
(response variable) and 'class' (grouping variable)
This interactive example allows you to practice the Wilcoxon Signed Rank test with R.
R provides comprehensive support for multiple linear regression.
The topics below are provided in order of increasing complexity.
Fitting the Model
# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results# Other useful functions
coefficients(fit) # model coefficients
confint(fit, level=0.95) # CIs for model parameters
fitted(fit) # predicted values
residuals(fit) # residuals
anova(fit) # anova table
vcov(fit) # covariance matrix for model parameters
influence(fit) # regression diagnostics
Diagnostic Plots
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit) click to view
For a more comprehensive evaluation of model fit see regression diagnostics or the exercises in this interactive course on regression.
Comparing Models
You can compare nested models with the anova( ) function.
The following code provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2.
# compare models
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2)
anova(fit1, fit2)
Cross Validation
You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.
# K-fold cross-validation
library(DAAG)
cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation
Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate.
You can assess R2 shrinkage via K-fold cross-validation.
Using the crossval() function from the bootstrappackage, do the following:
# Assessing R2 shrinkage using 10-Fold Cross-Validation
fit <- lm(y~x1+x2+x3,data=mydata)
library(bootstrap)
# define functions
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
# matrix of predictors
X
<- as.matrix(mydata[c("x1","x2","x3")])
# vector of predicted values
y <- as.matrix(mydata[c("y")])
results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)
cor(y, fit$fitted.values)**2 # raw R2
cor(y,results$cv.fit)**2 # cross-validated R2
Variable Selection
Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic.
You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package.
stepAIC( ) performs stepwise model selection by exact AIC.
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package.
In the following code nbest indicates the number of subsets of each size to report.
Here, the ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.).
# All Subsets Regression
library(leaps)
attach(mydata)
leaps<-regsubsets(y~x1+x2+x3+x4,data=mydata,nbest=10)
# view results
summary(leaps)
# plot a table of models showing variables in each model.
#
models are ordered by the selection statistic.
plot(leaps,scale="r2")
# plot statistic by subset size
library(car)
subsets(leaps, statistic="rsq") click to view
Other options for plot( ) are bic, Cp, and adjr2.
Other options for plotting with
subset( ) are bic, cp, adjr2, and rss.
Relative Importance
The relaimpo package provides measures of relative importance for each of the predictors in the model.
See help(calc.relimp) for details on the four measures of relative importance provided.
# Calculate Relative Importance for Each Predictor
library(relaimpo)
calc.relimp(fit,type=c("lmg","last","first","pratt"),
rela=TRUE)
# Bootstrap Measures of Relative Importance (1000 samples)
boot <- boot.relimp(fit, b = 1000, type = c("lmg",
"last", "first", "pratt"), rank = TRUE,
diff = TRUE, rela = TRUE)
booteval.relimp(boot) # print result
plot(booteval.relimp(boot,sort=TRUE)) # plot result click to view
Graphic Enhancements
The car package offers a wide variety of plots for regression, including added variable plots, and enhanced diagnostic and Scatterplots.
There are many functions in R to aid with robust regression.
For example, you can perform robust regression with the rlm( ) function in the MASS package.
John Fox's (who else?) Robust Regression provides a good starting overview.
The UCLA Statistical Computing website has Robust Regression Examples.
The robust package provides a comprehensive library of robust methods, including regression.
The robustbase package also provides basic robust statistics including model selection methods.
And David Olive has provided an detailed online review of Applied Robust Statistics with sample R code.
To Practice
This course in machine learning in R includes excercises in multiple regression and cross validation.
An excellent review of regression diagnostics is provided in John Fox's aptly named Overview of Regression Diagnostics. Dr.
Fox's car package provides advanced utilities for regression modeling.
# Assume that we are fitting a multiple linear regression
#
on the MTCARS data
library(car)
fit <- lm(mpg~disp+hp+wt+drat, data=mtcars)
This example is for exposition only.
We will ignore the fact that this may not be a great way of modeling the this particular set of data!
Outliers
# Assessing Outliers
outlierTest(fit) # Bonferonni p-value for most extreme obs
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid
leveragePlots(fit) # leverage plots
click to view
Influential Observations
# Influential Observations
# added variable plots
av.Plots(fit)
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff)
# Influence Plot
influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" ) click to view
Non-normality
# Normality of Residuals
# qq plot for studentized resid
qqPlot(fit, main="QQ Plot")
# distribution of studentized residuals
library(MASS)
sresid <- studres(fit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit) click to view
Non-constant Error Variance
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(fit)
# plot
studentized residuals vs.
fitted values
spreadLevelPlot(fit) click to view
# Test for Autocorrelated Errors
durbinWatsonTest(fit)
Additional Diagnostic Help
The gvlma( ) function in the gvlma package, performs a global validation of linear model assumptions as well separate evaluations of skewness, kurtosis, and heteroscedasticity.
# Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)
Going Further
If you would like to delve deeper into regression diagnostics, two books written by John Fox can help: Applied regression analysis and generalized linear models (2nd ed) and An R and S-Plus companion to applied regression.
If you have been analyzing ANOVA designs in traditional statistical packages, you are likely to find R's approach less coherent and user-friendly.
A good online presentation on ANOVA in R can
be found in ANOVA section of the Personality Project.
(Note: I have found that these pages render fine in Chrome and Safari browsers, but can appear distorted in iExplorer.)
1.
Fit a Model
In the following examples lower case letters are numeric variables and upper case letters are factors.
# One Way Anova (Completely Randomized Design)
fit <- aov(y ~ A, data=mydataframe)
# Randomized Block Design (B is the blocking factor)
fit <- aov(y ~ A + B, data=mydataframe)
# Two Way Factorial Design
fit <- aov(y ~ A + B + A:B, data=mydataframe)
fit <- aov(y ~ A*B, data=mydataframe)
# same thing
# Analysis of Covariance
fit <- aov(y ~ A + x, data=mydataframe)
For within subjects designs, the data frame has to be rearranged so that each measurement on a subject is a separate observation.
See R and Analysis of Variance.
# One Within Factor
fit <- aov(y~A+Error(Subject/A),data=mydataframe)# Two Within Factors W1 W2, Two Between Factors B1 B2
fit <- aov(y~(W1*W2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),
data=mydataframe)
2.
Look at Diagnostic Plots
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.layout(matrix(c(1,2,3,4),2,2)) # optional layout
plot(fit) # diagnostic plots
For details on the evaluation of test requirements, see (M)ANOVA Assumptions.
3.
Evaluate Model Effects
WARNING: R provides Type I sequential SS, not the default Type III marginal SS reported by SAS and SPSS.
In a nonorthogonal design with more than one term on the right hand side of the equation order will matter (i.e., A+B and B+A will produce different results)! We will need use the drop1( ) function to produce the familiar Type III results.
It will compare each term with the full model.
Alternatively, we can use anova(fit.model1, fit.model2) to compare nested models directly.
summary(fit) # display Type I ANOVA table
drop1(fit,~.,test="F") # type III SS and F Tests Nonparametric and resampling alternatives are available.
Multiple Comparisons
You can get Tukey HSD tests using the function below.
By default, it calculates post hoc comparisons on each factor in the model.
You can specify specific factors as an option.
Again, remember that results are based on Type I SS! # Tukey Honestly Significant Differences
TukeyHSD(fit) # where fit comes from aov()
Visualizing Results
Use box plots and line plots to visualize group differences.
There are also two functions specifically designed for visualizing mean differences in ANOVA layouts.
interaction.plot( ) in the base stats package produces plots for two-way interactions.
plotmeans( ) in the gplots package produces mean plots for single factors, and includes confidence intervals.
# Two-way Interaction Plot
attach(mtcars)
gears <- factor(gears)
cyl <- factor(cyl)
interaction.plot(cyl, gear, mpg, type="b", col=c(1:3),
leg.bty="o", leg.bg="beige", lwd=2, pch=c(18,24,22),
xlab="Number of Cylinders",
ylab="Mean Miles Per Gallon",
main="Interaction Plot") click to view
# Plot Means with Error Bars
library(gplots)
attach(mtcars)
cyl <- factor(cyl)
plotmeans(mpg~cyl,xlab="Number of Cylinders",
ylab="Miles Per Gallon", main="Mean Plot\nwith 95% CI")
click to view
MANOVA
If there is more than one dependent (outcome) variable, you can test them simultaneously using a multivariate analysis of variance (MANOVA).
In the following example, let Y be a matrix whose columns are the dependent variables.
# 2x2 Factorial MANOVA with 3 Dependent Variables.
Y <- cbind(y1,y2,y3)
fit <- manova(Y ~ A*B)
summary(fit, test="Pillai")
Other test options are "Wilks", "Hotelling-Lawley", and "Roy".
Use summary.aov( ) to get univariate statistics.
TukeyHSD( ) and plot( ) will not work with a MANOVA fit.
Run each dependent variable separately to obtain them.
Like ANOVA, MANOVA results in R are based on Type I SS.
To obtain Type III SS, vary the order of variables in the model and rerun the analyses.
For example, fit y~A*B for the TypeIII B effect and y~B*A for the Type III A effect.
Going Further
R has excellent facilities for fitting linear and generalized linear mixed-effects models.
The lastest implimentation is in package lme4.
See the R News Article on Fitting Mixed Linear Models in R for details.
In classical parametric procedures we often assume normality and constant variance for the model error term.
Methods of exploring these assumptions in an ANOVA/ANCOVA/MANOVA framework are discussed here.
Regression diagnostics are covered under multiple linear regression.
Outliers
Since outliers can severly affect normality and homogeneity of variance, methods for detecting disparate observerations are described first.
The aq.plot() function in the mvoutlier package allows you to identfy multivariate outliers by plotting the ordered squared robust Mahalanobis distances of the observations against the empirical distribution function of the MD2i.
Input consists of a matrix or data frame.
The function produces 4 graphs and returns a boolean vector identifying the outliers.
# Detect Outliers in the MTCARS Data
library(mvoutlier)
outliers <-
aq.plot(mtcars[c("mpg","disp","hp","drat","wt","qsec")])
outliers # show list of outliers click to view
Univariate Normality
You can evaluate the normality of a variable using a Q-Q plot.
# Q-Q Plot for variable MPG
attach(mtcars)
qqnorm(mpg)
qqline(mpg) click to view
Significant departures from the line suggest violations of normality.
You can also perform a Shapiro-Wilk test of normality with the shapiro.test(x) function, where x is a numeric vector.
Additional functions for testingnormality are available in nortest package.
Multivariate Normality
MANOVA assumes multivariate normality.
The function mshapiro.test( ) in the mvnormtest package produces the Shapiro-Wilk test for multivariate normality.
Input must be a numeric matrix.
# Test Multivariate Normality
mshapiro.test(M)
If we have p x 1 multivariate normal random vector
then the squared Mahalanobis distance between x and μ is going to be chi-square distributed with p degrees of freedom.
We can use this fact to construct a Q-Q plot to assess multivariate normality.
# Graphical Assessment of Multivariate Normality
x <- as.matrix(mydata) # n x p numeric matrix
center <- colMeans(x) # centroid
n <- nrow(x); p <- ncol(x); cov <- cov(x);
d <-
mahalanobis(x,center,cov) # distances
qqplot(qchisq(ppoints(n),df=p),d,
main="QQ Plot Assessing Multivariate Normality",
ylab="Mahalanobis D2")
abline(a=0,b=1)
click to view
Homogeneity of Variances
The bartlett.test( ) function provides a parametric K-sample test of the equality of variances.
The fligner.test( ) function provides a non-parametric test of the same.
In the following examples y is a numeric variable and G is the grouping variable.
# Bartlett Test of Homogeneity of Variances
bartlett.test(y~G, data=mydata)
# Figner-Killeen Test of Homogeneity of Variances
fligner.test(y~G, data=mydata)
The hovPlot( ) function in the HH package provides a graphic test of homogeneity of variances based on Brown-Forsyth.
In the following example, y is numeric and G is a grouping factor.
Note that G must be of type factor.
# Homogeneity of Variance Plot
library(HH)
hov(y~G, data=mydata)
hovPlot(y~G,data=mydata)
click to view
Homogeneity of Covariance Matrices
MANOVA and LDF assume homogeneity of variance-covariance matrices.
The assumption is usually tested with Box's M.
Unfortunately the test is very sensitive to violations of normality, leading to rejection in most typical cases.
Box's M is available via the boxM function
in the biotools package.
To Practice
Try the free first chapter of this course on ANOVA with R.
The coin package provides the ability to perform a wide variety of re-randomization or permutation based statistical tests.
These tests do not assume random sampling from well-defined populations.
They can be a reasonable alternative to classical procedures when test assumptions can not be met.
See coin: A Computational Framework for Conditional Inference for details.
In the examples below, lower case letters represent numerical variables and upper case letters represent categorical factors.
Monte-Carlo simulation are available for all tests.
Exact tests are available for 2 group procedures.
Independent Two- and K-Sample Location Tests
# Exact Wilcoxon Mann Whitney Rank Sum Test
# where y is numeric and A is a binary factor
library(coin)
wilcox_test(y~A, data=mydata, distribution="exact") # One-Way Permutation Test based on 9999 Monte-Carlo
# resamplings.
y is numeric and A is a categorical factor
library(coin)
oneway_test(y~A, data=mydata,
distribution=approximate(B=9999))
Symmetry of a response for repeated measurements
# Exact Wilcoxon Signed Rank Test
# where y1 and y2 are repeated measures
library(coin)
wilcoxsign_test(y1~y2, data=mydata, distribution="exact")# Freidman Test based on 9999 Monte-Carlo resamplings.
# y is numeric, A is a grouping factor, and B is a
#
blocking factor.
library(coin)
friedman_test(y~A|B, data=mydata,
distribution=approximate(B=9999))
Independence of Two Numeric Variables
# Spearman Test of Independence based on 9999 Monte-Carlo
# resamplings.
x and y are numeric variables.
library(coin)
spearman_test(y~x, data=mydata,
distribution=approximate(B=9999))
Independence in Contingency Tables
# Independence in 2-way Contingency Table based on
# 9999 Monte-Carlo resamplings.
A and B are factors.
library(coin)
chisq_test(A~B, data=mydata,
distribution=approximate(B=9999))
# Cochran-Mantel-Haenzsel Test of 3-way Contingency Table
# based on 9999 Monte-Carlo resamplings.
A, B, are
factors
# and C is a stratefying factor.
library(coin)
mh_test(A~B|C, data=mydata,
distribution=approximate(B=9999))# Linear by Linear Association Test based on 9999
#
Monte-Carlo resamplings.
A and B are ordered factors.
library(coin)
lbl_test(A~B, data=mydata,
distribution=approximate(B=9999))
Many other univariate and multivariate tests are possible using the functions in the coinpackage.
See A Lego System for Conditional Inference for more details.
Power analysis is an important aspect of experimental design.
It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence.
Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints.
If the probability is unacceptably low, we would be wise to alter or abandon the experiment.
The following four quantities have an intimate relationship:
sample sizeeffect sizesignificance level = P(Type I error) = probability of finding an effect that is not therepower = 1 - P(Type II error) = probability of finding an effect that is there
Given any three, we can determine the fourth.
Power Analysis in R
The pwr package develped by Stéphane Champely, impliments power analysis as outlined by Cohen (!988).
Some of the more important functions are listed below.
function
power calculations for
pwr.2p.test
two proportions (equal n)
pwr.2p2n.test
two proportions (unequal n)
pwr.anova.test
balanced one way ANOVA
pwr.chisq.test
chi-square test
pwr.f2.test
general linear model
pwr.p.test
proportion (one sample)
pwr.r.test
correlation
pwr.t.test
t-tests (one sample, 2 sample, paired)
pwr.t2n.test
t-test (two samples with unequal n)
For each of these functions, you enter three of the four quantities (effect size, sample size, significance level, power) and the fourth is calculated.
The significance level defaults to 0.05.
Therefore, to calculate the significance level, given an effect size, sample size, and power, use the option "sig.level=NULL".
Specifying an effect size can be a daunting task.
ES formulas and Cohen's suggestions (based on social science research) are provided below.
Cohen's suggestions should only be seen as very rough guidelines.
Your own subject matter experience should be brought to bear.
(To explore confidence intervals and drawing conclusions from samples try this interactive course on the foundations of inference.)
t-tests
For t-tests, use the following functions:
pwr.t.test(n = , d = , sig.level = , power = ,
type = c("two.sample", "one.sample", "paired"))
where n is the sample size, d is the effect size, and type indicates a two-sample t-test, one-sample t-test or paired t-test.
If you have unequal sample sizes, use
pwr.t2n.test(n1 = , n2= , d = , sig.level =, power = )
where n1 and n2 are the sample sizes.
For t-tests, the effect size is assessed as
Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively.
You can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test.
A two tailed test is the default.
ANOVA
For a one-way analysis of variance use
pwr.anova.test(k = , n = , f = , sig.level = , power = )
where k is the number of groups and n is the common sample size in each group.
For a one-way ANOVA effect size is measured by f where
Cohen suggests that f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes respectively.
Correlations
For correlation coefficients use
pwr.r.test(n = , r = , sig.level = , power = )
where n is the sample size and r is the correlation.
We use the population correlation coefficient as the effect size measure.
Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively.
Linear Models
For linear models (e.g., multiple regression) use
pwr.f2.test(u =, v = , f2 = , sig.level = , power = )
where u and v are the numerator and denominator degrees of freedom.
We use f2 as the effect size measure.
The first formula is appropriate when we are evaluating the impact of a set of predictors on an outcome.
The second formula is appropriate when we are evaluating the impact of one set of predictors above and beyond a second set of predictors (or covariates).
Cohen suggests f2 values of 0.02, 0.15, and 0.35 represent small, medium, and large effect sizes.
Tests of Proportions
When comparing two proportions use
pwr.2p.test(h = , n = , sig.level =, power = )
where h is the effect size and n is the common sample size in each group.
Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively.
For unequal n's use
pwr.2p2n.test(h = , n1 = , n2 = , sig.level = , power = )
To test a single proportion use
pwr.p.test(h = , n = , sig.level = power = )
For both two sample and one sample proportion tests, you can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test.
A two tailed test is the default.
Chi-square Tests
For chi-square tests use
pwr.chisq.test(w =, N = , df = , sig.level =, power = )
where w is the effect size, N is the total sample size, and df is the degrees of freedom.
The effect size w is defined as
Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively.
Some Examples
library(pwr)
# For a one-way ANOVA comparing 5 groups, calculate the
# sample size needed in each group to obtain a power of
#
0.80, when the effect size is moderate (0.25) and a
#
significance level of 0.05 is employed.
pwr.anova.test(k=5,f=.25,sig.level=.05,power=.8)
# What is the power of a one-tailed t-test, with a
# significance level of 0.01, 25 people in each group,
# and an effect size equal to 0.75?
pwr.t.test(n=25,d=0.75,sig.level=.01,alternative="greater")
# Using a two-tailed test proportions, and assuming a
#
significance level of 0.01 and a common sample size of
#
30 for each
proportion, what effect size can be detected
#
with a power of .75?
pwr.2p.test(n=30,sig.level=0.01,power=0.75)
Creating Power or Sample Size Plots
The functions in the pwr package can be used to generate power and sample size graphs.
# Plot sample size curves for detecting correlations of
# various sizes.
library(pwr)
# range of correlations
r <- seq(.1,.5,.01)
nr <- length(r)
# power values
p <- seq(.4,.9,.1)
np <- length(p)
# obtain sample sizes
samsize <- array(numeric(nr*np), dim=c(nr,np))
for (i in 1:np){
for (j in 1:nr){
result <- pwr.r.test(n = NULL, r = r[j],
sig.level = .05, power = p[i],
alternative = "two.sided")
samsize[j,i] <- ceiling(result$n)
}
}
# set up graph
xrange <- range(r)
yrange <- round(range(samsize))
colors <- rainbow(length(p))
plot(xrange, yrange, type="n",
xlab="Correlation Coefficient (r)",
ylab="Sample Size (n)" )
# add power curves
for (i in 1:np){
lines(r, samsize[,i], type="l", lwd=2, col=colors[i])
}
# add annotation (grid lines, title, legend)
abline(v=0, h=seq(0,yrange[2],50), lty=2, col="grey89")
abline(h=0, v=seq(xrange[1],xrange[2],.02), lty=2,
col="grey89")
title("Sample Size Estimation for Correlation Studies\n
Sig=0.05 (Two-tailed)")
legend("topright", title="Power",
as.character(p),
fill=colors) click to view
There are two functions that can help write simpler and more efficient code.
With
The with( ) function applys an expression to a dataset.
It is similar to DATA= in SAS.
# with(data, expression)
# example applying a t-test to a data frame mydata
with(mydata, t.test(y ~ group))
By
The by( ) function applys a function to each level of a factor or factors.
It is similar to BY processing in SAS.
# by(data, factorlist, function)
# example obtain variable means separately for
# each level of byvar in data frame mydata
by(mydata, mydata$byvar, function(x) mean(x))
To Practice
This data manipulation tutorial in R includes excercises on using the by() function.
Generalized linear models are fit using the glm( ) function.
The form of the glm function is
glm(formula, family=familytype(link=linkfunction), data=)
Family
Default Link Function
binomial
(link = "logit")
gaussian
(link = "identity")
Gamma
(link = "inverse")
inverse.gaussian
(link = "1/mu^2")
poisson
(link = "log")
quasi
(link = "identity", variance = "constant")
quasibinomial
(link = "logit")
quasipoisson
(link = "log")
See help(glm) for other modeling options.
See help(family) for other allowable link functions for each family.
Three subtypes of generalized linear models will be covered here: logistic regression, poisson regression, and survival analysis.
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous predictor variables.
It is frequently preferred over discriminant function analysis because of its less restrictive assumptions.
# Logistic Regression
# where F is a binary factor and
#
x1-x3 are continuous predictors
fit <- glm(F~x1+x2+x3,data=mydata,family=binomial())
summary(fit) # display results
confint(fit) # 95% CI for the coefficients
exp(coef(fit)) # exponentiated coefficients
exp(confint(fit)) # 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals
You can use anova(fit1,fit2, test="Chisq") to compare nested models.
Additionally, cdplot(F~x, data=mydata) will display the conditional density plot of the binary outcome F on the continuous x variable.
click to view
Poisson Regression
Poisson regression is useful when predicting an outcome variable representing counts from a set of continuous predictor variables.
# Poisson Regression
# where count is a count and
#
x1-x3 are continuous predictors
fit <- glm(count ~ x1+x2+x3, data=mydata, family=poisson())
summary(fit) display results
If you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may want to use quasipoisson() instead of poisson().
Survival Analysis
Survival analysis (also called event history analysis or reliability analysis) covers a set of techniques for modeling the time to an event.
Data may be right censored - the event may not have occured by the end of the study or we may have incomplete information on an observation but know that up to a certain time the event had not occured (e.g.
the participant dropped out of study in week 10 but was alive at that time).
While generalized linear models are typically analyzed using the glm( ) function, survival analyis is typically carried out using functions from the survival package .
The survival package can handle one and two sample problems, parametric accelerated failure models, and the Cox proportional hazards model.
Data are typically entered in the format start time, stop time, and status (1=event occured, 0=event did not occur).
Alternatively, the data may be in the format time to event and status (1=event occured, 0=event did not occur).
A status=0 indicates that the observation is right cencored.
Data are bundled into a Surv object via the Surv( ) function prior to further analyses.
survfit( ) is used to estimate a survival distribution for one or more groups.
survdiff( ) tests for differences in survival distributions between two or more groups.
coxph( ) models the hazard function on a set of predictor variables.
# Mayo Clinic Lung Cancer Data
library(survival)
# learn about the dataset
help(lung)
# create a Surv object
survobj <- with(lung, Surv(time,status))
# Plot survival distribution of the total sample
# Kaplan-Meier estimator
fit0 <- survfit(survobj~1, data=lung)
summary(fit0)
plot(fit0, xlab="Survival Time in Days",
ylab="% Surviving", yscale=100,
main="Survival Distribution (Overall)")
# Compare the survival distributions of men and women
fit1 <- survfit(survobj~sex,data=lung)
# plot the survival distributions by sex
plot(fit1, xlab="Survival Time in Days",
ylab="% Surviving", yscale=100, col=c("red","blue"),
main="Survival Distributions by Gender")
legend("topright", title="Gender", c("Male", "Female"),
fill=c("red", "blue"))
# test for difference between male and female
# survival curves (logrank test)
survdiff(survobj~sex, data=lung)
# predict male survival from age and medical scores
MaleMod <- coxph(survobj~age+ph.ecog+ph.karno+pat.karno,
data=lung, subset=sex==1)
# display results
MaleMod
# evaluate the proportional hazards assumption
cox.zph(MaleMod)
click to view
See Thomas Lumley's R news article on the survival package for more information.
Other good sources include Mai Zhou's Use R Software to do Survival Analysis and Simulation and M.
J.
Crawley's chapter on Survival Analysis.
To Practice
Try this interactive exercise on basic logistic regression with R using age as a predictor for credit risk.
The MASS package contains functions for performing linearand quadratic discriminant function analysis.
Unless prior probabilities are specified, each assumes proportional prior probabilities (i.e., prior probabilities are based on sample sizes).
In the examples below, lower case letters are numeric variables and upper case letters are categorical factors.
Linear Discriminant Function
# Linear Discriminant Analysis with Jacknifed Prediction
library(MASS)
fit <- lda(G ~ x1 + x2 + x3, data=mydata,
na.action="na.omit", CV=TRUE)
fit # show results
The code above performs an LDA, using listwise deletion of missing data.
CV=TRUE generates jacknifed (i.e., leave one out) predictions.
The code below assesses the accuracy of the prediction.
# Assess the accuracy of the prediction
# percent correct for each category of G
ct <- table(mydata$G, fit$class)
diag(prop.table(ct, 1))
# total percent correct
sum(diag(prop.table(ct)))lda() prints discriminant functions based on centered (not standardized) variables.
The "proportion of trace" that is printed is the proportion of between-class variance that is explained by successive discriminant functions.
No significance tests are produced.
Refer to the section on MANOVA for such tests.
Quadratic Discriminant Function
To obtain a quadratic discriminant function use qda( ) instead of lda( ).
Quadratic discriminant function does not assume homogeneity of variance-covariance matrices.
# Quadratic Discriminant Analysis with 3 groups applying
#
resubstitution prediction and equal prior probabilities.
library(MASS)
fit <- qda(G ~ x1 + x2 + x3 + x4, data=na.omit(mydata),
prior=c(1,1,1)/3))
Note the alternate way of specifying listwise deletion of missing data.
Re-subsitution (using the same data to derive the functions and evaluate their prediction accuracy) is the default method unless CV=TRUE is specified.
Re-substitution will be overly optimistic.
Visualizing the Results
You can plot each observation in the space of the first 2 linear discriminant functions using the following code.
Points are identified with the group ID.
# Scatter plot using the 1st two discriminant dimensions
plot(fit) # fit from lda click to view
The following code displays histograms and density plots for the observations in each group on the first linear discriminant dimension.
There is one panel for each group and they all appear lined up on the same graph.
# Panels of histograms and overlayed density plots
# for 1st discriminant function
plot(fit, dimen=1, type="both") # fit from lda
click to view
The partimat( ) function in the klaR package can display the results of a linear or quadratic classifications 2 variables at a time.
# Exploratory Graph for LDA or QDA
library(klaR)
partimat(G~x1+x2+x3,data=mydata,method="lda")
click to view
You can also produce a scatterplot matrix with color coding by group.
# Scatterplot for 3 Group Problem
pairs(mydata[c("x1","x2","x3")], main="My Title ", pch=22,
bg=c("red", "yellow", "blue")[unclass(mydata$G)]) click to view
Test Assumptions
See (M)ANOVA Assumptions for methods of evaluating multivariate normality and homogeneity of covariance matrices.
To Practice
To practice improving predictions, try the Kaggle R Tutorial on Machine Learning
R has extensive facilities for analyzing time series data.
This section describes the creation of a time series, seasonal decomposition, modeling with exponential and ARIMA models, and forecasting with the forecast package.
Creating a time series
The ts() function will convert a numeric vector into an R time series object.
The format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
# save a numeric vector containing 72 monthly observations
# from Jan 2009 to Dec 2014 as a time series object
myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)
# subset the time series (June 2014 to December 2014)
myts2 <- window(myts, start=c(2014, 6), end=c(2014, 12))
# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the stl() function.
Note that a series with multiplicative effects can often by transformed into series with additive effects through a log transformation (i.e., newts <- log(myts)).
# Seasonal decomposition
fit <- stl(myts, s.window="period")
plot(fit)
Both the HoltWinters() function in the base installation, and the ets() function in the forecast package, can be used to fit exponential models.
# simple exponential - models level
fit <- HoltWinters(myts, beta=FALSE, gamma=FALSE)
# double exponential - models level and trend
fit <- HoltWinters(myts, gamma=FALSE)
# triple exponential - models level, trend, and seasonal components
fit <- HoltWinters(myts)
# predict next three future values
library(forecast)
forecast(fit, 3)
plot(forecast(fit, 3))
ARIMA Models
The arima() function can be used to fit an autoregressive integrated moving averages model.
Other useful functions include:
lag(ts, k)
lagged version of time series, shifted back k observations
diff(ts, differences=d)
difference the time series d times
ndiffs(ts)
Number of differences required to achieve stationarity (from the forecast package)
acf(ts)
autocorrelation function
pacf(ts)
partial autocorrelation function
adf.test(ts)
Augemented Dickey-Fuller test.
Rejecting the null hypothesis suggests that a time series is stationary (from the tseries package)
Box.test(x, type="Ljung-Box")
Pormanteau test that observations in vector or time series x are independent
Note that the forecast package has somewhat nicer versions of acf() and pacf() called Acf() and Pacf() respectively.
# fit an ARIMA model of order P, D, Q
fit <- arima(myts, order=c(p, d, q)
# predict next 5 observations
library(forecast)
forecast(fit, 5)
plot(forecast(fit, 5))
Automated Forecasting
The forecast package provides functions for the automatic selection of exponential and ARIMA models.
The ets() function supports both additive and multiplicative models.
The auto.arima() function can handle both seasonal and nonseasonal ARIMA models.
Models are chosen to maximize one of several fit criteria.
library(forecast)
# Automated forecasting using an exponential model
fit <- ets(myts)
# Automated forecasting using an ARIMA model
fit <- auto.arima(myts)
Going Further
There are many good online resources for learning time series analysis with R.
These include A little book of R for time series by Avril Chohlan and DataCamp's manipulating time series in R course by Jeffrey Ryan.
This section covers principal components and factor analysis.
The latter includes both exploratory and confirmatory methods.
Principal Components
The princomp( ) function produces an unrotated principal component analysis.
# Pricipal Components Analysis
# entering raw data and extracting PCs
#
from the correlation matrix
fit <- princomp(mydata, cor=TRUE)
summary(fit) # print variance accounted for
loadings(fit) # pc loadings
plot(fit,type="lines") # scree plot
fit$scores # the principal components
biplot(fit)
click to view
Use cor=FALSE to base the principal components on the covariance matrix.
Use the covmat= option to enter a correlation or covariance matrix directly.
If entering a covariance matrix, include the option n.obs=.
The principal( ) function in the psych package can be used to extract and rotate principal components.
# Varimax Rotated Principal Components
# retaining 5 components
library(psych)
fit <- principal(mydata, nfactors=5, rotate="varimax")
fit # print results
mydata can be a raw data matrix or a covariance matrix.
Pairwise deletion of missing data is used.
rotate can "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster"
Exploratory Factor Analysis
The factanal( ) function produces maximum likelihood factor analysis.
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
#
with varimax rotation
fit <- factanal(mydata, 3, rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names click to view
The rotation= options include "varimax", "promax", and "none".
Add the option scores="regression" or "Bartlett" to produce factor scores.
Use the covmat= option to enter a correlation or covariance matrix directly.
If entering a covariance matrix, include the option n.obs=.
The factor.pa( ) function in the psychpackage offers a number of factor analysis related functions, including principal axis factoring.
# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results
mydata can be a raw data matrix or a covariance matrix.
Pairwise deletion of missing data is used.
Rotation can be "varimax" or "promax".
Determining the Number of Factors to Extract
A crucial decision in exploratory factor analysis is how many factors to extract.
The nFactors package offer a suite of functions to aid in this decision.
Details on this methodology can be found in a PowerPoint presentation by Raiche, Riopel, and Blais.
Of course, any factor solution must be interpretable to be useful.
# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),
rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
click to view
Going Further
The FactoMineR package offers a large number of additional functions for exploratory factor analysis.
This includes the use of both quantitative and qualitative variables, as well as the inclusion of supplimentary variables and observations.
Here is an example of the types of graphs that you can create with this package.
# PCA Variable Factor Map
library(FactoMineR)
result <- PCA(mydata) # graphs generated automatically
click to view
Thye GPARotation package offers a wealth of rotation options beyond varimax and promax.
Structual Equation Modeling
Confirmatory Factor Analysis(CFA) is a subset of the much wider Structural Equation Modeling(SEM) methodology.
SEM is provided in R via the sem package.
Models are entered via RAM specification (similar to PROC CALIS in SAS).
While sem is a comprehensive package, my recommendation is that if you are doing significant SEM work, you spring for a copy of AMOS.
It can be much more user-friendly and creates more attractive and publication ready output.
Having said that, here is a CFA example using sem.
Assume that we have six observered variables (X1, X2, ..., X6).
We hypothesize that there are two unobserved latent factors (F1, F2) that underly the observed variables as described in this diagram.
X1, X2, and X3 load on F1 (with loadings lam1, lam2, and lam3).
X4, X5, and X6 load on F2 (with loadings lam4, lam5, and lam6).
The double headed arrow indicates the covariance between the two latent factors (F1F2).
e1 thru e6 represent the residual variances (variance in the observed variables not accounted for by the two latent factors).
We set the variances of F1 and F2 equal to one so that the parameters will have a scale.
This will result in F1F2 representing the correlation between the two latent factors.
For sem, we need the covariance matrix of the observed variables - thus the cov( ) statement in the code below.
The CFA model is specified using the specify.model( ) function.
The format is arrow specification, parameter name, start value.
Choosing a start value of NA tells the program to choose a start value rather than supplying one yourself.
Note that the variance of F1 and F2 are fixed at 1 (NA in the second column).
The blank line is required to end the RAM specification.
# Simple CFA Model
library(sem)
mydata.cov <- cov(mydata)
model.mydata <- specify.model()
F1 -> X1, lam1, NA
F1 -> X2, lam2, NA
F1 -> X3, lam3, NA
F2 -> X4, lam4, NA
F2 -> X5, lam5, NA
F2 -> X6, lam6, NA
X1 <-> X1, e1, NA
X2 <-> X2, e2, NA
X3 <-> X3, e3, NA
X4 <-> X4, e4, NA
X5 <-> X5, e5, NA
X6 <-> X6, e6, NA
F1 <-> F1, NA, 1
F2 <-> F2, NA, 1
F1 <-> F2, F1F2, NA
mydata.sem <- sem(model.mydata, mydata.cov, nrow(mydata))
# print results (fit indices, paramters, hypothesis tests)
summary(mydata.sem)
# print standardized coefficients (loadings)
std.coef(mydata.sem)
You can use the boot.sem( ) function to bootstrap the structual equation model.
See help(boot.sem) for details.
Additionally, the function mod.indices() will produce modification indices.
Using modification indices to improve model fit by respecifying the parameters moves you from a confirmatory to an exploratory analysis.
For more information on sem, see Structural Equation Modeling with the sem Package in R, by John Fox.
To Practice
To practice improving predictions, try the Kaggle R Tutorial on Machine Learning
Correspondence analysis provides a graphic method of exploring the relationship between variables in a contingency table.
There are many options for correspondence analysis in R.
I recommend the capackage by Nenadic and Greenacre because it supports supplimentary points, subset analyses, and comprehensive graphics.
You can obtain the package here.
Although ca can perform multiple correspondence analysis (more than two categorical variables), only simple correspondence analysis is covered here.
See their article for details on multiple CA.
Simple Correspondence Analysis
In the following example, A and B are categorical factors.
# Correspondence Analysis
library(ca)
mytable <- with(mydata, table(A,B)) # create a 2 way table
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
fit <- ca(mytable)
print(fit) # basic results
summary(fit) # extended results
plot(fit) # symmetric map
plot(fit, mass = TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map
The first graph is the standard symmetric representation of a simple correspondence analysis with rows and column represented by points.
click to view
Row points (column points) that are closer together have more similar column profiles (row profiles).
Keep in mind that you can not interpret the distance between row and column points directly.
The second graph is asymmetric , with rows in the principal coordinates and columns in reconstructions of the standarized residuals.
Additionally, mass is represented by points and columns are represented by arrows.
Point intensity (shading) corresponds to the absolute contributions for the rows.
This example is included to highlight some of the available options.
click to view
Going Further
Try this interactive course on exploratory data analysis.
R provides functions for both classical and nonmetric multidimensional scaling.
Assume that we have N objects measured on p numeric variables.
We want to represent the distances among the objects in a parsimonious (and visual) way (i.e., a lower k-dimensional space).
Classical MDS
You can perform a classical MDS using the cmdscale( ) function.
# Classical MDS
# N rows (objects) x p columns (variables)
# each row identified by a unique row name
d <- dist(mydata) # euclidean distances between the rows
fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim
fit # view results
# plot solution
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",
main="Metric MDS", type="n")
text(x, y, labels = row.names(mydata), cex=.7)
click to view
Nonmetric MDS
Nonmetric MDS is performed using the isoMDS( ) function in the MASS package.
# Nonmetric MDS
# N rows (objects) x p columns (variables)
# each row identified by a unique row name
library(MASS)
d <- dist(mydata) # euclidean distances between the rows
fit <- isoMDS(d, k=2) # k is the number of dim
fit # view results
# plot solution
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",
main="Nonmetric MDS", type="n")
text(x, y, labels = row.names(mydata), cex=.7)
click to view
Individual Difference Scaling
3-way or individual difference scaling can be completed using the indscal() function in the SensoMineR package.
The smacof package offers a three way analysis of individual differences based on stress minimization of means of majorization.
To Practice
This tutorial on ggplot2 includes exercises on Distance matrices and Multi-Dimensional Scaling (MDS).
R has an amazing variety of functions for cluster analysis.
In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based.
While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below.
Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability.
# Prepare Data
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
Partitioning
K-means clustering is the most popular partitioning method.
It requires the analyst to specify the number of clusters to extract.
A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters.
The analyst looks for a bend in the plot similar to a scree test in factor analysis.
See Everitt & Hothorn (pg.
251).
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares") # K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ).
The function pamk( ) in the fpcpackage is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width.
Hierarchical Agglomerative
There are a wide range of hierarchical clustering approaches.
I have had good luck with Ward's method described below.
# Ward Hierarchical Clustering
d <- dist(mydata,
method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
click to view
The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on multiscale bootstrap resampling.
Clusters that are highly supported by the data will have large p values.
Interpretation details are provided Suzuki.
Be aware that pvclust clusters columns, not rows.
Transpose your data before using.
# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <-
pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95) click to view
Model Based
Model based approaches assume a variety of data models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters.
Specifically, the Mclust( ) function in the mclust package selects the optimal model according to BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models.
(phew!).
One chooses the model and number of clusters with the largest BIC.
See help(mclustModelNames) to details on the model chosen as best.
# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
plot(fit) # plot results
summary(fit) # display the best model click to view
Plotting Cluster Solutions
It is always a good idea to look at the cluster results.
# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
library(fpc)
plotcluster(mydata, fit$cluster)
click to view
Validating cluster solutions
The function cluster.stats() in the fpc package provides a mechanism for comparing the similarity of two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index and the corrected rand index)
# comparing 2 cluster solutions
library(fpc)
cluster.stats(d, fit1$cluster, fit2$cluster)
where d is a distance matrix among objects, and fit1$cluster and fit$cluster are integer vectors containing classification results from two different clusterings of the same data.
To Practice
Try the clustering exercise in this introduction to machine learning course.
Recursive partitioning is a fundamental tool in data mining.
It helps us explore the stucture of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome.
This section briefly describes CART modeling, conditional inference trees, and random forests.
CART Modeling via rpart
Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be generated through the rpart package.
Detailed information on rpart is available in An Introduction to Recursive Partitioning
Using the RPART Routines.
The general steps are provided below followed by two examples.
1.
Grow the Tree
To grow a tree, use
rpart(formula, data=, method=,control=) where
formula
is in the format
outcome ~ predictor1+predictor2+predictor3+ect.
data=
specifies the data frame
method=
"class" for a classification tree
"anova" for a regression tree
control=
optional parameters for controlling tree growth.
For example, control=rpart.control(minsplit=30, cp=0.001) requires that the minimum number of observations in a node be 30 before attempting a split and that a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted.
2.
Examine the results
The following functions help us to examine the results.
printcp(fit)
display cp table
plotcp(fit)
plot cross-validation results
rsq.rpart(fit)
plot approximate R-squared and relative error for different splits (2 plots).
labels are only appropriate for the "anova" method.
print(fit)
print results
summary(fit)
detailed results including surrogate splits
plot(fit)
plot decision tree
text(fit)
label the decision tree plot
post(fit, file=)
create postscript plot of decision tree
In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs below).
3.
prune tree
Prune back the tree to avoid overfitting the data.
Typically, you will want to select a tree size that minimizes the cross-validated error, the xerror column printed by printcp( ).
Prune the tree to the desired size using
prune(fit, cp=)
Specifically, use printcp( ) to examine the cross-validated error results, select the complexity parameter associated with minimum error, and place it into the prune( ) function.
Alternatively, you can use the code fragment
fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
to automatically select the complexity parameter associated with the smallest cross-validated error.
Thanks to HSAUR for this idea.
Classification Tree example
Let's use the data frame kyphosis to predict a type of deformation (kyphosis) after surgery, from age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).
# Classification Tree with rpart
library(rpart)
# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits
# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postscript plot of tree
post(fit, file = "c:/tree.ps",
title = "Classification Tree for Kyphosis") click to view
# prune the tree
pfit<- prune(fit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
# plot the pruned tree
plot(pfit, uniform=TRUE,
main="Pruned Classification Tree for Kyphosis")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
title = "Pruned Classification Tree for Kyphosis") click to view
Regression Tree example
In this example we will predict car mileage from price, country, reliability, and car type.
The data frame is cu.summary.
# Regression Tree Example
library(rpart)
# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits
# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results
# plot tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postcript plot of tree
post(fit, file = "c:/tree2.ps",
title = "Regression Tree for Mileage ") click to view
# prune the tree
pfit<- prune(fit, cp=0.01160389) # from cptable
# plot the pruned tree
plot(pfit, uniform=TRUE,
main="Pruned Regression Tree for Mileage")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree2.ps",
title = "Pruned Regression Tree for Mileage")
It turns out that this produces the same tree as the original.
Conditional inference trees via party
The party package provides nonparametric regression trees for nominal, ordinal, numeric, censored, and multivariate responses.
party: A laboratory for recursive partitioning, provides details.
You can create a regression or classification tree via the function
ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor, numeric, etc.).
Tree growth is based on statistical stopping rules, so pruning should not be required.
The previous two examples are re-analyzed below.
# Conditional Inference Tree for Kyphosis
library(party)
fit <- ctree(Kyphosis ~ Age + Number + Start,
data=kyphosis)
plot(fit, main="Conditional Inference Tree for Kyphosis") click to view
# Conditional Inference Tree for Mileage
library(party)
fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
data=na.omit(cu.summary)) click to view
Random Forests
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).
Breiman and Cutler's random forest approach is implimented via the randomForest package.
Here is an example.
# Random Forest prediction of Kyphosis data
library(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit) # view results
importance(fit) # importance of each predictor
For more details see the comprehensive Random Forest website.
Going Further
This section has only touched on the options available.
To learn more, see the CRAN Task View on Machine & Statistical Learning.
The boot package provides extensive facilities for bootstrapping and related resampling methods.
You can bootstrap a single statistic (e.g.
a median), or a vector (e.g., regression weights).
This section will get you started with basic nonparametric bootstrapping.
The main bootstrapping function is boot( ) and has the following format:
bootobject <- boot(data= , statistic= , R=, ...) where
parameter
description
data
A vector, matrix, or data frame
statistic
A function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic).
The function should include an indices parameter that the boot() function can use to select cases for each replication (see examples below).
R
Number of bootstrap replicates
...
Additional parameters to be passed to the function that produces the statistic of interest
boot( ) calls the statistic function R times.
Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data).
These indices are used within the statistic function to select a sample.
The statistics are calculated on the sample and the results are accumulated in the bootobject.
The bootobject structure includes
element
description
t0
The observed values of k statistics applied to the orginal data.
t
An R x k matrix where each row is a bootstrap replicate of the k statistics.
You can access these as bootobject$t0 and bootobject$t.
Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to examine the results.
If the results look reasonable, you can use boot.ci() function to obtain confidence intervals for the statistic(s).
The format is
boot.ci(bootobject, conf=, type= ) where
parameter
description
bootobject
The object returned by the boot function
conf
The desired confidence interval (default: conf=0.95)
type
The type of confidence interval returned.
Possible values are "norm", "basic", "stud", "perc", "bca" and "all" (default: type="all")
Bootstrapping a Single Statistic (k=1)
The following example generates the bootstrapped 95% confidence interval for R-squared in the linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp).
The data source is mtcars.
The bootstrapped confidence interval is based on 1000 replications.
# Bootstrap 95% CI for R-Squared
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices)
{
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=rsq,
R=1000, formula=mpg~wt+disp)
# view results
results
plot(results)
# get 95% confidence interval
boot.ci(results, type="bca") click to view
Bootstrapping several Statistics (k>1)
In example above, the function rsq returned a number and boot.ci returned a single confidence interval.
The statistics function you provide can also return a vector.
In the next example we get the 95% CI for the three model regression coefficients (intercept, car weight, displacement).
In this case we add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to analyzed.
# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs <- function(formula, data, indices)
{
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
R=1000, formula=mpg~wt+disp)
# view results
results
plot(results, index=1) # intercept
plot(results, index=2) # wt
plot(results, index=3) # disp
# get 95% confidence intervals
boot.ci(results, type="bca", index=1)
# intercept
boot.ci(results, type="bca", index=2)
# wt
boot.ci(results, type="bca", index=3)
# disp click to view
Going Further
The boot( ) function can generate both nonparametric and parametric resampling.
For the nonparametric bootstrap, resampling methods include ordinary, balanced, antithetic and permutation.
For the nonparametric bootstrap, stratified resampling is supported.
Importance resampling weights can also be specified.
The boot.ci( ) function takes a bootobject and generates 5 different types of two-sided nonparametric confidence intervals.
These include the first order normal approximation, the basic bootstrap interval, the studentized bootstrap interval, the bootstrap percentile interval, and the adjusted bootstrap percentile (BCa) interval.
Look at help(boot), help(boot.ci), and help(plot.boot) for more details.
Try this interactive exercise with the boot package from DataCamp's Intro to Computational Finance with R course.
Most of the methods on this website actually describe the programming of matrices.
It is built deeply into the R language.
This section will simply cover operators and functions specifically suited to linear algebra.
Before proceeding you many want to review the sections on Data Types and Operators.
Matrix facilites
In the following examples, A and B are matrices and x and b are a vectors.
Operator or Function
Description
A * B
Element-wise multiplication
A %*% B
Matrix multiplication
A %o% B
Outer product. AB'
crossprod(A,B)
crossprod(A)
A'B and A'A respectively.
t(A)
Transpose
diag(x)
Creates diagonal matrix with elements of x in the principal diagonal
diag(A)
Returns a vector containing the elements of the principal diagonal
diag(k)
If k is a scalar, this creates a k x k identity matrix.
Go figure.
solve(A, b)
Returns vector x in the equation b = Ax (i.e., A-1b)
solve(A)
Inverse of A where A is a square matrix.
ginv(A)
Moore-Penrose Generalized Inverse of A.
ginv(A) requires loading the MASS package.
y<-eigen(A)
y$val are the eigenvalues of Ay$vec are the eigenvectors of A
y<-svd(A)
Single value decomposition of A.
y$d = vector containing the singular values of A
y$u = matrix with columns contain the left singular vectors of Ay$v = matrix with columns contain the right singular vectors of A
R <- chol(A)
Choleski factorization of A.
Returns the upper triangular factor, such that R'R = A.
y <- qr(A)
QR decomposition of A.
y$qr has an upper triangle that contains the decomposition and a lower triangle that contains information on the Q decomposition.
y$rank is the rank of A.
y$qraux a vector which contains additional information on Q.
y$pivot contains information on the pivoting strategy used.
cbind(A,B,...)
Combine matrices(vectors) horizontally.
Returns a matrix.
rbind(A,B,...)
Combine matrices(vectors) vertically.
Returns a matrix.
rowMeans(A)
Returns vector of row means.
rowSums(A)
Returns vector of row sums.
colMeans(A)
Returns vector of column means.
colSums(A)
Returns vector of column sums.
Matlab Emulation
The matlab package contains wrapper functions and variables used to replicate MATLAB function calls as best possible.
This can help porting MATLAB applications and code to R.
Going Further
The Matrix package contains functions that extend R to support highly dense or sparse matrices.
It provides efficient access to BLAS (Basic Linear Algebra Subroutines), Lapack
(dense matrix), TAUCS (sparse matrix) and UMFPACK (sparse matrix)
routines.
To Practice
Try some of the exercises in matrix algebra in this course on intro to statistics with R.
In R, graphs are typically created interactively.
# Creating a Graph
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
The plot( ) function opens a graph window and plots weight vs.
miles per gallon.
The next line of code adds a regression line to this graph.
The final line adds a title.
click to view
Saving Graphs
You can save the graph in a variety of formats from the menu
File -> Save As.
You can also save the graph via code using one of the following functions.
Creating a new graph by issuing a high level plotting command (plot, hist, boxplot, etc.) will typically overwrite a previous graph.
To avoid this, open a new graph window before creating a new graph.
To open a new graph window use one of the functions below.
Function
Platform
windows()
Windows
X11()
Unix
quartz()
Mac
You can have multiple graph windows open at one time.
See help(dev.cur) for more details.
Alternatively, after opening the first graph window, choose History -> Recording from the graph window menu.
Then you can use Previous and Next to step through the graphs you have created.
You can create histograms with the function hist(x) where x is a numeric vector of values to be plotted.
The option freq=FALSE plots probability densities instead of frequencies.
The option breaks= controls the number of bins.
# Simple Histogram
hist(mtcars$mpg) click to view
# Colored Histogram with Different Number of Bins
hist(mtcars$mpg, breaks=12, col="red") click to view
# Add a Normal Curve (Thanks to Peter Dalgaard)
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon",
main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2) click to view
Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.
To practice making a density plot with the hist() function, try this exercise.
Kernel Density Plots
Kernal density plots are usually a much more effective way to view the distribution of a variable.
Create the plot using plot(density(x)) where x is a numeric vector.
# Kernel Density Plot
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
click to view
# Filled Density Plot
d <- density(mtcars$mpg)
plot(d, main="Kernel Density of Miles Per Gallon")
polygon(d, col="red", border="blue")
click to view
Comparing Groups VIA Kernal Density
The sm.density.compare( ) function in the smpackage allows you to superimpose the kernal density plots of two or more groups.
The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.
# Compare MPG distributions for cars with
#
4,6, or 8 cylinders
library(sm)
attach(mtcars)
# create value labels
cyl.f <- factor(cyl, levels= c(4,6,8),
labels = c("4 cylinder", "6 cylinder", "8 cylinder"))
# plot densities
sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
title(main="MPG Distribution by Car Cylinders")
# add legend via mouse click
colfill<-c(2:(2+length(levels(cyl.f))))
legend(locator(1), levels(cyl.f), fill=colfill) click to view
Create dotplots with the dotchart(x, labels=) function, where x is a numeric vector and labels is a vector of labels for each point.
You can add a groups= option to designate a factor specifying how the elements of x are grouped.
If so, the option gcolor= controls the color of the groups label.
cex controls the size of the labels.
# Simple Dotplot
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
main="Gas Milage for Car Models",
xlab="Miles Per Gallon") click to view
# Dotplot: Grouped Sorted and Colored
# Sort by mpg, group and color by cylinder
x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
main="Gas Milage for Car Models\ngrouped by cylinder",
xlab="Miles Per Gallon", gcolor="black", color=x$color)
click to view
Going Further
Advanced dotplots can be created with the dotplot2( ) function in the Hmisc package and with the panel.dotplot( ) function in the lattice package.
To Practice
To practice making a dot plot in R, try this interactive exercise from a DataCamp course.
Create barplots with the barplot(height) function, where height is a vector or matrix.
If height is a vector, the values determine the heights of the bars in the plot.
If height is a matrix and the option beside=FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked “sub-bars”.
If height is a matrix and beside=TRUE, then the values in each column are juxtaposed rather than stacked.
Include option names.arg=(character vector) to label the bars.
The option horiz=TRUE to createa a horizontal barplot.
Simple Bar Plot
# Simple Bar Plot
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution",
xlab="Number of Gears")
click to view
# Simple Horizontal Bar Plot with Added Labels
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"))
click to view
(To practice making a simple bar plot in R, try this interactive video.)
Stacked Bar Plot
# Stacked Bar Plot with Colors and Legend
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts))
click to view
Grouped Bar Plot
# Grouped Bar Plot
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE) click to view
Notes
Bar plots need not be based on counts or frequencies.
You can create bar plots that represent means, medians, standard deviations, etc.
Use the aggregate( ) function and pass the results to the barplot( ) function.
By default, the categorical axis line is suppressed.
Include the option axis.lty=1 to draw it.
With many bars, bar labels may start to overlap.
You can decrease the font size using the cex.names = option.
Values smaller than one will shrink the size of the label.
Additionally, you can use graphical parameters such as the following to help text spacing:
# Fitting Labels
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE, names.arg=c("3 Gears", "4 Gears", "5 Gears"), cex.names=0.8) click to view
Overview
Line charts are created with the function lines(x, y, type=) where x and y are numeric vectors of (x,y) points to connect.
type= can take the following values:
type
description
p
points
l
lines
o
overplotted points and lines
b, c
points (empty if "c") joined by lines
s, S
stair steps
h
histogram-like vertical lines
n
does not produce any points or lines
The lines( ) function adds information to a graph.
It can not produce a graph on its own.
Usually it follows a plot(x, y) command that produces a graph.
By default, plot( ) plots the (x,y) points.
Use the type="n" option in the plot( ) command, to create the graph with axes, titles, etc., but without plotting the points.
(To practice creating line charts with this lines( ) function, try this exercise.)
Example
In the following code each of the type= options is applied to the same dataset.
The plot( ) command sets up the graph, but does not plot the points.
x <- c(1:5);
y <- x # create some data
par(pch=22, col="red") # plotting symbol and color
par(mfrow=c(2,4)) # all plots on one page
opts = c("p","l","o","b","c","s","S","h")
for(i in 1:length(opts)){
heading = paste("type=",opts[i])
plot(x, y, type="n", main=heading)
lines(x, y, type=opts[i])
} click to view
Next, we demonstrate each of the type= options when plot( ) sets up the graph and does plot the points.
x <- c(1:5);
y <- x # create some data
par(pch=22, col="blue") # plotting symbol and color
par(mfrow=c(2,4)) # all plots on one page
opts = c("p","l","o","b","c","s","S","h")
for(i in 1:length(opts){
heading = paste("type=",opts[i])
plot(x, y, main=heading)
lines(x, y, type=opts[i])
} click to view
As you can see, the type="c" option only looks different from the type="b" option if the plotting of points is suppressed in the plot( ) command.
To demonstrate the creation of a more complex line chart, let's plot the growth of 5 orange trees over time.
Each tree will have its own distinctive line.
The data come from the dataset Orange.
# Create Line Chart
# convert factor to numeric for convenience
Orange$Tree <- as.numeric(Orange$Tree)
ntrees <- max(Orange$Tree)
# get the range for the x and y axis
xrange <- range(Orange$age)
yrange <- range(Orange$circumference)
# set up the plot
plot(xrange, yrange, type="n", xlab="Age (days)",
ylab="Circumference (mm)" )
colors <- rainbow(ntrees)
linetype <- c(1:ntrees)
plotchar <- seq(18,18+ntrees,1)
# add lines
for (i in 1:ntrees) {
tree <- subset(Orange, Tree==i)
lines(tree$age, tree$circumference, type="b", lwd=1.5,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
# add a title and subtitle
title("Tree Growth", "example of line plot")
# add a legend
legend(xrange[1], yrange[2], 1:ntrees, cex=0.8, col=colors,
pch=plotchar, lty=linetype, title="Tree") click to view
Going Further
Try the exercises in this course on plotting and data visualization in R.
Pie charts are not recommended in the R documentation, and their features are somewhat limited.
The authors recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume.
Pie charts are created with the function pie(x, labels=) where x is a non-negative numeric vector indicating the area of each slice and labels= notes a character vector of names for the slices.
Simple Pie Chart
# Simple Pie Chart
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries") click to view
Pie Chart with Annotated Percentages
# Pie Chart with Percentages
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
# add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries")
click to view
3D Pie Chart
The pie3D( ) function in the plotrix package provides 3D exploded pie charts.
# 3D Exploded Pie Chart
library(plotrix)
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie3D(slices,labels=lbls,explode=0.1,
main="Pie Chart of Countries ") click to view
Creating Annotated Pies from a data frame
# Pie Chart from data frame with Appended Sample Sizes
mytable <- table(iris$Species)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable, labels = lbls,
main="Pie Chart of Species\n (with sample sizes)")
click to view
Boxplots can be created for individual variables or for variables by group.
The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data.
An example of a formula is y~group where a separate boxplot for numeric variable y is generated for each value of group.
Add varwidth=TRUE to make boxplot widths proportional to the square root of the samples sizes.
Add horizontal=TRUE to reverse the axis orientation.
# Boxplot of MPG by Car Cylinders
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")
click to view
# Notched Boxplot of Tooth Growth Against 2 Crossed Factors
# boxes colored for ease of interpretation
boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE,
col=(c("gold","darkgreen")),
main="Tooth Growth", xlab="Suppliment and Dose") click to view
In the notched boxplot, if two boxes' notches do not overlap this is ‘strong evidence’ their medians differ (Chambers et al., 1983, p.
62).
Colors recycle.
In the example above, if I had listed 6 colors, each box would have its own color.
Earl F.
Glynn has created an easy to use list of colors is PDF format.
Other Options
The boxplot.matrix( ) function in the sfsmisc package draws a boxplot for each column (row) in a matrix.
The boxplot.n( ) function in thegplots package annotates each boxplot with its sample size.
The bplot( ) function in the Rlab package offers many more options controlling the positioning and labeling of boxes in the output.
Violin Plots
A violin plot is a combination of a boxplot and a kernel density plot.
They can be created using the vioplot( ) function from vioplot package.
# Violin Plots
library(vioplot)
x1 <- mtcars$mpg[mtcars$cyl==4]
x2 <- mtcars$mpg[mtcars$cyl==6]
x3 <- mtcars$mpg[mtcars$cyl==8]
vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"),
col="gold")
title("Violin Plots of Miles Per Gallon") click to view
Bagplot - A 2D Boxplot Extension
The bagplot(x, y) function in the aplpack package provides a bivariate version of the univariate boxplot.
The bag contains 50% of all points.
The bivariate median is approximated.
The fence separates points in the fence from points outside.
Outliers are displayed.
# Example of a Bagplot
library(aplpack)
attach(mtcars)
bagplot(wt,mpg, xlab="Car Weight", ylab="Miles Per Gallon",
main="Bagplot Example") click to view
There are many ways to create a scatterplot in R.
The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot.
# Simple Scatterplot
attach(mtcars)
plot(wt, mpg, main="Scatterplot Example",
xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19) click to view
(To practice making a simple scatterplot, try this interactive example from DataCamp.)
# Add fit lines
abline(lm(mpg~wt), col="red") # regression line (y~x)
lines(lowess(wt,mpg), col="blue") # lowess line (x,y) click to view
The scatterplot( ) function in the car package offers many enhanced features, including fit lines, marginal box plots, conditioning on a factor, and interactive point identification.
Each of these features is optional.
# Enhanced Scatterplot of MPG vs.
Weight
#
by Number of Car Cylinders
library(car)
scatterplot(mpg ~ wt | cyl, data=mtcars,
xlab="Weight of Car", ylab="Miles Per Gallon",
main="Enhanced Scatter Plot",
labels=row.names(mtcars))
click to view
Scatterplot Matrices
There are at least 4 useful functions for creating scatterplot matrices.
Analysts must love scatterplot matrices!
# Basic Scatterplot Matrix
pairs(~mpg+disp+drat+wt,data=mtcars,
main="Simple Scatterplot Matrix") click to view
The lattice package provides options to condition the scatterplot matrix on a factor.
# Scatterplot Matrices from the lattice Package
library(lattice)
splom(mtcars[c(1,3,5,6)], groups=cyl, data=mtcars,
panel=panel.superpose,
key=list(title="Three Cylinder Options",
columns=3,
points=list(pch=super.sym$pch[1:3],
col=super.sym$col[1:3]),
text=list(c("4 Cylinder","6 Cylinder","8 Cylinder"))))
click to view
The car package can condition the scatterplot matrix on a factor, and optionally include lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.
# Scatterplot Matrices from the car Package
library(car)
scatterplot.matrix(~mpg+disp+drat+wt|cyl, data=mtcars,
main="Three Cylinder Options") click to view
The gclus package provides options to rearrange the variables so that those with higher correlations are closer to the principal diagonal.
It can also color code the cells to reflect the size of the correlations.
# Scatterplot Matrices from the glus Package
library(gclus)
dta <- mtcars[c(1,3,5,6)] # get data
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Variables Ordered and Colored by Correlation"
)
click to view
High Density Scatterplots
When there are many data points and significant overlap, scatterplots become less useful.
There are several approaches that be used when this occurs.
The hexbin(x, y) function in the hexbin package provides bivariate binning into hexagonal cells (it looks better than it sounds).
# High Density Scatterplot with Binning
library(hexbin)
x <- rnorm(1000)
y <- rnorm(1000)
bin<-hexbin(x, y, xbins=50)
plot(bin, main="Hexagonal Binning") click to view
Another option for a scatterplot with significant point overlap is the sunflowerplot.
See help(sunflowerplot) for details.
Finally, you can save the scatterplot in PDF format and use color transparency to allow points that overlap to show through (this idea comes from B.S.
Everrit in HSAUR).
# High Density Scatterplot with Color Transparency
pdf("c:/scatterplot.pdf")
x <- rnorm(1000)
y <- rnorm(1000)
plot(x,y, main="PDF Scatterplot Example", col=rgb(0,100,0,50,maxColorValue=255), pch=16)
dev.off() click to view
Note: You can use the col2rgb( ) function to get the rbg values for R colors.
For example, col2rgb("darkgreen") yeilds r=0, g=100, b=0.
Then add the alpha transparency level as the 4th number in the color vector.
A value of zero means fully transparent.
See help(rgb) for more information.
3D Scatterplots
You can create a 3D scatterplot with the scatterplot3d package.
Use the function scatterplot3d(x,y,z).
# 3D Scatterplot
library(scatterplot3d)
attach(mtcars)
scatterplot3d(wt,disp,mpg, main="3D Scatterplot") click to view
# 3D Scatterplot with Coloring and Vertical Drop Lines
library(scatterplot3d)
attach(mtcars)
scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot") click to view
# 3D Scatterplot with Coloring and Vertical Lines
# and Regression Plane
library(scatterplot3d)
attach(mtcars)
s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot")
fit <- lm(mpg ~ wt+disp)
s3d$plane3d(fit) click to view
Spinning 3D Scatterplots
You can also create an interactive 3D scatterplot using the plot3D(x, y, z) function in the rgl package.
It creates a spinning 3D scatterplot that can be rotated with the mouse.
The first three arguments are the x, y, and z numeric vectors representing points.
col= and size= control the color and size of the points respectively.
# Spinning 3d Scatterplot
library(rgl)
plot3d(wt, disp, mpg, col="red", size=3) click to view
You can perform a similar function with the scatter3d(x, y, z) in the Rcmdrpackage.
# Another Spinning 3d Scatterplot
library(Rcmdr)
attach(mtcars)
scatter3d(wt, disp, mpg)
click to view
To Practice
Try the creating scatterplot exercises in this course on data visualization in R.
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
One way is to specify these options in through the par( ) function.
If you set parameter values here, the changes will be in effect for the rest of the session or until you change them again.
The format is par(optionname=value, optionname=value, ...) # Set a graphical parameter using par()
par() # view current settings
opar <- par() # make a copy of current settings
par(col.lab="red") # red x and y labels
hist(mtcars$mpg) # create a plot with these new settings
par(opar) # restore original settings
A second way to specify graphical parameters is by providing the optionname=value pairs directly to a high level plotting function.
In this case, the options are only in effect for that specific graph.
# Set a graphical parameter within the plotting function
hist(mtcars$mpg, col.lab="red")
See the help for a specific high level plotting function (e.g.
plot, hist, boxplot) to determine which graphical parameters can be set this way.
The remainder of this section describes some of the more important graphical parameters that you can set.
Text and Symbol Size
The following options can be used to control text and symbol size in graphs.
option
description
cex
number indicating the amount by which plotting text and symbols should be scaled relative to the default.
1=default, 1.5 is 50% larger, 0.5 is 50% smaller, etc.
cex.axis
magnification of axis annotation relative to cex
cex.lab
magnification of x and y labels relative to cex
cex.main
magnification of titles relative to cex
cex.sub
magnification of subtitles relative to cex
Plotting Symbols
Use the pch= option to specify symbols to use when plotting points.
For symbols 21 through 25, specify border color (col=) and fill color (bg=).
Lines
You can change lines using the following options.
This is particularly useful for reference lines, axes, and fit lines.
option
description
lty
line type.
see the chart below.
lwd
line width relative to the default (default=1).
2 is twice as wide.
Colors
Options that specify colors include the following.
option
description
col
Default plotting color.
Some functions (e.g.
lines) accept a vector of values that are recycled.
col.axis
color for axis annotation
col.lab
color for x and y labels
col.main
color for titles
col.sub
color for subtitles
fg
plot foreground color (axes, boxes - also sets col= to same)
bg
plot background color
You can specify colors in R by index, name, hexadecimal, or RGB.
For example col=1, col="white", and col="#FFFFFF" are equivalent.
The following chart was produced with code developed by Earl F.
Glynn.
See his Color Chart for all the details you would ever need about using colors in R.
You can also create a vector of n contiguous colors using the functions rainbow(n), heat.colors(n), terrain.colors(n), topo.colors(n), and cm.colors(n).
colors() returns all available color names.
Fonts
You can easily set font size and style, but font family is a bit more complicated.
option
description
font
Integer specifying font to use for text.
1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol
font.axis
font for axis annotation
font.lab
font for x and y labels
font.main
font for titles
font.sub
font for subtitles
ps
font point size (roughly 1/72 inch)
text size=ps*cex
family
font family for drawing text.
Standard values are "serif", "sans", "mono", "symbol".
Mapping is device dependent.
In windows, mono is mapped to "TT Courier New", serif is mapped to"TT Times New Roman", sans is mapped to "TT Arial", mono is mapped to "TT Courier New", and symbol is mapped to "TT Symbol" (TT=True Type).
You can add your own mappings.
# Type family examples - creating new mappings
plot(1:10,1:10,type="n")
windowsFonts(
A=windowsFont("Arial Black"),
B=windowsFont("Bookman Old Style"),
C=windowsFont("Comic Sans MS"),
D=windowsFont("Symbol")
)
text(3,3,"Hello World Default")
text(4,4,family="A","Hello World from Arial Black")
text(5,5,family="B","Hello World from Bookman Old Style")
text(6,6,family="C","Hello World from Comic Sans MS")
text(7,7,family="D", "Hello World from Symbol")
click to view
Margins and Graph Size
You can control the margin size using the following parameters.
For complete information on margins, see Earl F.
Glynn's margin tutorial.
Going Further
See help(par) for more information on graphical parameters.
The customization of plotting axes and text annotations are covered next section.
To Practice
Try this course on making graphs in R.
Many high level plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options (as well as other graphical parameters).
For example
# Specify axis options within plot()
plot(x, y, main="title", sub="subtitle",
xlab="X-axis label", ylab="y-axix label",
xlim=c(xmin, xmax), ylim=c(ymin, ymax))
For finer control or for modularization, you can use the functions described below.
Titles
Use the title( ) function to add labels to a plot.
title(main="main title", sub="sub-title",
xlab="x-axis label", ylab="y-axis label")
Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in the title( ) function.
# Add a red title and a blue subtitle.
Make x and y
#
labels 25% smaller
than the default and green.
title(main="My Title", col.main="red",
sub="My Sub-title", col.sub="blue",
xlab="My X label", ylab="My Y label",
col.lab="green", cex.lab=0.75)
Text Annotations
Text can be added to graphs using the text( ) and mtext( ) functions.
text( ) places text within the graph while mtext( ) places text in one of the four margins.
(To practice adding text to plots in R, try this interactive exercise.)
text(location, "text to place", pos, ...)
mtext("text to place", side, line=n, ...)
Common options are described below.
option
description
location
location can be an x,y coordinate.
Alternatively, the text can be placed interactively via mouse by specifying location as locator(1).
pos
position relative to location.
1=below, 2=left, 3=above, 4=right.
If you specify pos, you can specify offset= in percent of character width.
side
which margin to place text.
1=bottom, 2=left, 3=top, 4=right.
you can specify line= to indicate the line in the margin starting with 0 and moving out.
you can also specify adj=0 for left/bottom alignment or adj=1 for top/right alignment.
Other common options are cex, col, and font (for size, color, and font style respectively).
Labeling points
You can use the text( ) function (see above) for labeling point as well as for adding other text annotations.
Specify location as a set of x, y coordinates and specify the text to place as a vector of labels.
The x, y, and label vectors should all be the same length.
# Example of labeling points
attach(mtcars)
plot(wt, mpg, main="Milage vs.
Car Weight",
xlab="Weight", ylab="Mileage", pch=18, col="blue")
text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red")
click to view
Math Annotations
You can add mathematically formulas to a graph using TEX-like rules.
See help(plotmath) for details and examples.
Axes
You can create custom axes using the axis( ) function.
axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ...)
where
option
description
side
an integer indicating the side of the graph to draw the axis (1=bottom, 2=left, 3=top, 4=right)
at
a numeric vector indicating where tic marks should be drawn
labels
a character vector of labels to be placed at the tickmarks
(if NULL, the at values will be used)
pos
the coordinate at which the axis line is to be drawn.
(i.e., the value on the other axis where it crosses)
lty
line type
col
the line and tick mark color
las
labels are parallel (=0) or perpendicular(=2) to axis
tck
length of tick mark as fraction of plotting region (negative number is outside graph, positive number is inside, 0 suppresses ticks, 1 creates gridlines) default is -0.01
If you are going to create a custom axis, you should suppress the axis automatically generated by your high level plotting function.
The option axes=FALSE suppresses both x and y axes.
xaxt="n" and yaxt="n" suppress the x and y axis respectively.
Here is a (somewhat overblown) example.
# A Silly Axis Example
# specify the data
x <- c(1:10);
y <- x;
z <- 10/x
# create extra margin room on the right for an axis
par(mar=c(5, 4, 4, 8) + 0.1)
# plot x vs.
y
plot(x, y,type="b", pch=21, col="red",
yaxt="n", lty=3, xlab="", ylab="")
# add x vs.
1/x
lines(x, z, type="b", pch=22, col="blue", lty=2)
# draw an axis on the left
axis(2, at=x,labels=x, col.axis="red", las=2)
# draw an axis on the right, with smaller text and ticks
axis(4, at=z,labels=round(z,digits=2),
col.axis="blue", las=2, cex.axis=0.7, tck=-.01)
# add a title for the right axis
mtext("y=1/x", side=4, line=3, cex.lab=1,las=2, col="blue")
# add a main title and bottom and left axis labels
title("An Example of Creative Axes", xlab="X values",
ylab="Y=X") click to view
Minor Tick Marks
The minor.tick( ) function in the Hmiscpackage adds minor tick marks.
# Add minor tick marks
library(Hmisc)
minor.tick(nx=n, ny=n, tick.ratio=n)nx is the number of minor tick marks to place between x-axis major tick marks.
ny does the same for the y-axis.
tick.ratio is the size of the minor tick mark relative to the major tick mark.
The length of the major tick mark is retrieved from par("tck").
Reference Lines
Add reference lines to a graph using the abline( ) function.
abline(h=yvalues, v=xvalues)
Other graphical parameters (such as line type, color, and width) can also be specified in the abline( ) function.
# add solid horizontal lines at y=1,5,7
abline(h=c(1,5,7))
# add dashed blue verical lines at x = 1,3,5,7,9
abline(v=seq(1,10,2),lty=2,col="blue")
Note: You can also use the grid( ) function to add reference lines.
Legend
Add a legend with the legend() function.
legend(location, title, legend, ...)
Common options are described below.
option
description
location
There are several ways to indicate the location of the legend.
You can give an x,y coordinate for the upper left hand corner of the legend.
You can use locator(1), in which case you use the mouse to indicate the location of the legend.
You can also use the keywords "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "bottomright", or "center".
If you use a keyword, you may want to use inset= to specify an amount to move the legend into the graph (as fraction of plot region).
title
A character string for the legend title (optional)
legend
A character vector with the labels
...
Other options.
If the legend labels colored lines, specify col= and a vector of colors.
If the legend labels point symbols, specify pch= and a vector of point symbols.
If the legend labels line width or line style, use lwd= or lty= and a vector of widths or styles.
To create colored boxes for the legend (common in bar, box, or pie charts), use fill= and a vector of colors.
Other common legend options include bty for box type, bg for background color, cex for size, and text.col for text color.
Setting horiz=TRUE sets the legend horizontally rather than vertically.
# Legend Example
attach(mtcars)
boxplot(mpg~cyl, main="Milage by Car Weight",
yaxt="n", xlab="Milage", horizontal=TRUE,
col=terrain.colors(3))
legend("topright", inset=.05, title="Number of Cylinders",
c("4","6","8"), fill=terrain.colors(3), horiz=TRUE) click to view
For more on legends, see help(legend).
The examples in the help are particularly informative.
To Practice
Try the free first chapter of this online data visualization course in R.
R makes it easy to combine multiple plots into one overall graph, using either the
par( ) or layout( ) function.
With the par( ) function, you can include the option mfrow=c(nrows, ncols) to create a matrix of nrows x ncols plots that are filled in by row.
mfcol=c(nrows, ncols) fills in the matrix by columns.
# 4 figures arranged in 2 rows and 2 columns
attach(mtcars)
par(mfrow=c(2,2))
plot(wt,mpg, main="Scatterplot of wt vs.
mpg")
plot(wt,disp, main="Scatterplot of wt vs disp")
hist(wt, main="Histogram of wt")
boxplot(wt, main="Boxplot of wt") click to view
# 3 figures arranged in 3 rows and 1 column
attach(mtcars)
par(mfrow=c(3,1))
hist(wt)
hist(mpg)
hist(disp) click to view
The layout( ) function has the form layout(mat) where
mat is a matrix object specifying the location of the N figures to plot.
# One figure in row 1 and two figures in row 2
attach(mtcars)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp) click to view
Optionally, you can include widths= and heights= options in the layout( ) function to control the size of each figure more precisely.
These options have the form
widths= a vector of values for the widths of columns
heights= a vector of values for the heights of rows.
Relative widths are specified with numeric values.
Absolute widths (in centimetres) are specified with the lcm() function.
# One figure in row 1 and two figures in row 2
# row 1 is 1/3 the height of row 2
# column 2 is 1/4 the width of the column 1
attach(mtcars)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE),
widths=c(3,1), heights=c(1,2))
hist(wt)
hist(mpg)
hist(disp) click to view
See help(layout) for more details.
Creating a figure arrangement with fine control
In the following example, two box plots are added to scatterplot to create an enhanced graph.
# Add boxplots to a scatterplot
par(fig=c(0,0.8,0,0.8), new=TRUE)
plot(mtcars$wt, mtcars$mpg, xlab="Car Weight",
ylab="Miles Per Gallon")
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(mtcars$wt, horizontal=TRUE, axes=FALSE)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(mtcars$mpg, axes=FALSE)
mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3) click to view
To understand this graph, think of the full graph area as going from (0,0) in the lower left corner to (1,1) in the upper right corner.
The format of the fig= parameter is a numerical vector of the form c(x1, x2, y1, y2).
The first fig= sets up the scatterplot going from 0 to 0.8 on the x axis and 0 to 0.8 on the y axis.
The top boxplot goes from 0 to 0.8 on the x axis and 0.55 to 1 on the y axis.
I chose 0.55 rather than 0.8 so that the top figure will be pulled closer to the scatter plot.
The right hand boxplot goes from 0.65 to 1 on the x axis and 0 to 0.8 on the y axis.
Again, I chose a value to pull the right hand boxplot closer to the scatterplot.
You have to experiment to get it just right.
fig= starts a new plot, so to add to an existing plot use new=TRUE.
You can use this to combine several plots in any arrangement into one graph.
To Practice
Try the free first chapter of this interactive data visualization course, which covers combining plots.
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships.
In particular, the package supports the creation of trellis graphs - graphs that display a variable or the relationship between variables, conditioned on one or more other variables.
The typical format is
graph_type(formula, data=)
where graph_type is selected from the listed below.
formula specifies the variable(s) to display and any conditioning variables .
For example ~x|A means display numeric variable x for each level of factor A.
y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.
~x means display numeric variable x alone.
graph_type
description
formula examples
barchart
bar chart
x~A or A~x
bwplot
boxplot
x~A or A~x
cloud
3D scatterplot
z~x*y|A
contourplot
3D contour plot
z~x*y
densityplot
kernal density plot
~x|A*B
dotplot
dotplot
~x|A
histogram
histogram
~x
levelplot
3D level plot
z~y*x
parallel
parallel coordinates plot
data frame
splom
scatterplot matrix
data frame
stripplot
strip plots
A~x or x~A
xyplot
scatterplot
y~x|A
wireframe
3D wireframe graph
z~y*x
Here are some examples.
They use the car data (mileage, weight, number of gears, number of cylinders, etc.) from the mtcars data frame.
# Lattice Examples
library(lattice)
attach(mtcars)
# create factors with value labels
gear.f<-factor(gear,levels=c(3,4,5),
labels=c("3gears","4gears","5gears"))
cyl.f <-factor(cyl,levels=c(4,6,8),
labels=c("4cyl","6cyl","8cyl"))
# kernel density plot
densityplot(~mpg,
main="Density Plot",
xlab="Miles per Gallon")
# kernel density plots by factor level
densityplot(~mpg|cyl.f,
main="Density Plot by Number of Cylinders",
xlab="Miles per Gallon")
# kernel density plots by factor level (alternate layout)
densityplot(~mpg|cyl.f,
main="Density Plot by Numer of Cylinders",
xlab="Miles per Gallon",
layout=c(1,3))
# boxplots for each combination of two factors
bwplot(cyl.f~mpg|gear.f,
ylab="Cylinders", xlab="Miles per Gallon",
main="Mileage by Cylinders and Gears",
layout=(c(1,3))
# scatterplots for each combination of two factors
xyplot(mpg~wt|cyl.f*gear.f,
main="Scatterplots by Cylinders and Gears",
ylab="Miles per Gallon", xlab="Car Weight")
# 3d scatterplot by factor level
cloud(mpg~wt*qsec|cyl.f,
main="3D Scatterplot by Cylinders")
# dotplot for each combination of two factors
dotplot(cyl.f~mpg|gear.f,
main="Dotplot Plot by Number of Gears and Cylinders",
xlab="Miles Per Gallon")
# scatterplot matrix
splom(mtcars[c(1,3,4,5,6)],
main="MTCARS Data")
click to view
Note, as in graph 1, that you specifying a conditioning variable is optional.
The difference between graphs 2 & 3 is the use of the layout option to contol the placement of panels.
Customizing Lattice Graphs
Unlike base R graphs, lattice graphs are not effected by many of the options set in the par( ) function.
To view the options that can be changed, look at help(xyplot).
It is frequently easiest to set these options within the high level plotting functions described above.
Additionally, you can write functions that modify the rendering of panels.
Here is an example.
# Customized Lattice Example
library(lattice)
panel.smoother <- function(x, y) {
panel.xyplot(x, y) # show points
panel.loess(x, y) # show smoothed line
}
attach(mtcars)
hp <- cut(hp,3) # divide horse power into three bands
xyplot(mpg~wt|hp, scales=list(cex=.8, col="red"),
panel=panel.smoother,
xlab="Weight", ylab="Miles per Gallon",
main="MGP vs Weight by Horse Power")
click to view
Going Further
Lattice graphics are a comprehensive graphical system in their own right.
Deepanyan Sarkar's book Lattice: Multivariate Data Visualization with R is the definitive reference.
Additionally, see the Trellis User's Guide.
Dr.
Ihaka has created a wonderful set of slides on the subject.
An excellent early consideration of trellis graphs can be found in W.S.
Cleveland's classic book Visualizing Data.
To Practice
Try this interactive course on data visualization which covers lattice graphs.
The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots.
Its popularity in the R community has exploded in recent years.
Origianlly based on Leland Wilkinson's The Grammar of Graphics, ggplot2 allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner.
Grouping can be represented by color, symbol, size, and transparency.
The creation of trellis plots (i.e., conditioning) is relatively simple.
Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful resources).
There is a helper function called qplot() (for quick plot) that can hide much of this complexity when creating standard graphs.
qplot()
The qplot() function can be used to create the most common graph types.
While it does not expose ggplot's full power, it can create a very wide range of useful plots.
The format is:
qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)
where the options are:
option
description
alpha
Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity)
color, shape, size, fill
Associates the levels of variable with symbol color, shape, or size.
For line plots, color associates levels of a variable with line color.
For density and box plots, fill associates fill colors with a variable.
Legends are drawn automatically.
data
Specifies a data frame
facets
Creates a trellis graph by specifying conditioning variables.
Its value is expressed as rowvar ~ colvar.
To create trellis graphs based on a single conditioning variable, use rowvar~.
or .~colvar)
geom
Specifies the geometric objects that define the graph type.
The geom option is expressed as a character vector with one or more entries.
geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".
main, sub
Character vectors specifying the title and subtitle
method, formula
If geom="smooth", a loess fit line and confidence limits are added by default.
When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed.
Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression.
The formula parameter gives the form of the fit.
For example, to add simple linear regression lines, you'd specify geom="smooth", method="lm", formula=y~x.
Changing the formula to y~poly(x,2) would produce a quadratic fit.
Note that the formula uses the letters x and y, not the names of the variables.
For method="gam", be sure to load the mgcv package.
For method="rml", load the MASS package.
x, y
Specifies the variables placed on the horizontal and vertical axis.
For univariate plots (for example, histograms), omit y
xlab, ylab
Character vectors specifying horizontal and vertical axis labels
xlim,ylim
Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively
Notes:
At present, ggplot2 cannot be used to create 3D graphs or mosaic plots.
Use I(value) to indicate a specific value.
For example size=z makes the size of the plotted points or lines proporational to the values of a variable z.
In contrast, size=I(3) sets each point or line to three times the default size.
Here are some examples using automotive data (car mileage, weight, number of gears, number of cylinders, etc.) contained in the mtcars data frame.
# ggplot2 examples
library(ggplot2)
# create factors with value labels
mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5),
labels=c("3gears","4gears","5gears"))
mtcars$am <- factor(mtcars$am,levels=c(0,1),
labels=c("Automatic","Manual"))
mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8),
labels=c("4cyl","6cyl","8cyl"))
# Kernel density plots for mpg
# grouped by number of gears (indicated by color)
qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),
main="Distribution of Gas Milage", xlab="Miles Per Gallon",
ylab="Density")
# Scatterplot of mpg vs.
hp for each combination of gears and cylinders
# in each facet, transmittion type is represented by shape and color
qplot(hp, mpg, data=mtcars, shape=am, color=am,
facets=gear~cyl, size=I(3),
xlab="Horsepower", ylab="Miles per Gallon")
# Separate regressions of mpg on weight for each number of cylinders
qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"),
method="lm", formula=y~x, color=cyl,
main="Regression of MPG on Weight",
xlab="Weight", ylab="Miles per Gallon")
# Boxplots of mpg by number of gears
# observations (points) are overlayed and jittered
qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"),
fill=gear, main="Mileage by Gear Number",
xlab="", ylab="Miles per Gallon")
click to view
Customizing ggplot2 Graphs
Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par( ) function.
They can be modified using the theme() function, and by adding graphic parameters within the qplot() function.
For greater control, use ggplot() and other functions provided by the package.
Note that ggplot2 functions can be chained with "+" signs to generate the final plot.
library(ggplot2)
p <- qplot(hp, mpg, data=mtcars, shape=am, color=am,
facets=gear~cyl, main="Scatterplots of MPG vs.
Horsepower",
xlab="Horsepower", ylab="Miles per Gallon")
# White background and black grid lines
p + theme_bw()
# Large brown bold italics labels
# and legend placed at top of plot
p + theme(axis.title=element_text(face="bold.italic",
size="12", color="brown"), legend.position="top")
click to view
Try the free first chapter of this interactive tutorial on ggplot2.
This section describes creating probability plots in R for both didactic purposes and for data analyses.
Probability Plots for Teaching and Demonstration
When I was a college professor teaching statistics, I used to have to draw normal distributions by hand.
They always came out looking like bunny rabbits.
What can I say?
R makes it easy to draw probability distributions and demonstrate statistical concepts.
Some of the more common probability distributions available in R are given below.
distribution
R name
distribution
R name
Beta
beta
Lognormal
lnorm
Binomial
binom
Negative Binomial
nbinom
Cauchy
cauchy
Normal
norm
Chisquare
chisq
Poisson
pois
Exponential
exp
Student t
t
F
f
Uniform
unif
Gamma
gamma
Tukey
tukey
Geometric
geom
Weibull
weib
Hypergeometric
hyper
Wilcoxon
wilcox
Logistic
logis
For a comprehensive list, see Statistical Distributions on the R wiki.
The functions available for each distribution follow this format:
name
description
dname( )
density or probability function
pname( )
cumulative density function
qname( )
quantile function
Rname( )
random deviates
For example, pnorm(0) =0.5 (the area under the standard normal curve to the left of zero).
qnorm(0.9) = 1.28 (1.28 is the 90th percentile of the standard normal distribution).
rnorm(100) generates 100 random deviates from a standard normal distribution.
Each function has parameters specific to that distribution.
For example, rnorm(100, m=50, sd=10) generates 100 random deviates from a normal distribution with mean 50 and standard deviation 10.
You can use these functions to demonstrate various aspects of probability distributions.
Two common examples are given below.
# Display the Student's t distributions with various
#
degrees of freedom and compare to the normal distribution
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")
for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}
legend("topright", inset=.05, title="Distributions",
labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors) click to view
# Children's IQ scores are normally distributed with a
# mean of 100 and a standard deviation of 15.
What
# proportion of children are expected to have an IQ between
# 80 and 120?
mean=100; sd=15
lb=80; ub=120
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="IQ Values", ylab="",
main="Normal Distribution", axes=FALSE)
i <- x >= lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")
area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result <- paste("P(",lb,"< IQ <",ub,") =",
signif(area, digits=3))
mtext(result,3)
axis(1, at=seq(40, 160, 20), pos=0)
click to view
For a comprehensive view of probability plotting in R, see Vincent Zonekynd's Probability Distributions.
Fitting Distributions
There are several methods of fitting distributions in R.
Here are some options.
You can use the qqnorm( ) function to create a Quantile-Quantile plot evaluating the fit of sample data to the normal distribution.
More generally, the qqplot( ) function creates a Quantile-Quantile plot for any theoretical distribution.
# Q-Q plots
par(mfrow=c(1,2))
# create sample data
x <- rt(100, df=3)
# normal fit
qqnorm(x);
qqline(x)
# t(3Df) fit
qqplot(rt(1000,df=3), x, main="t(3) Q-Q Plot",
ylab="Sample Quantiles")
abline(0,1) click to view
The fitdistr() function in the MASS package provides maximum-likelihood fitting of univariate distributions.
The format is fitdistr(x, densityfunction) where x is the sample data and densityfunction is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f", "gamma", "geometric", "log-normal", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" or "weibull".
# Estimate parameters assuming log-Normal distribution
# create some sample data
x <- rlnorm(100)
# estimate paramters
library(MASS)
fitdistr(x, "lognormal")
Finally R has a wide range of goodness of fit tests for evaluating if it is reasonable to assume that a random sample comes from a specified theoretical distribution.
These include chi-square, Kolmogorov-Smirnov, and Anderson-Darling.
For more details on fitting distributions, see Vito Ricci's Fitting Distributions with R.
For general (non R) advice, see Bill Huber's Fitting Distributions to Data.
To Practice
Try this interactive course on exploratory data analysis.
The vcd package provides a variety of methods for visualizing multivariate categorical data, inspired by Michael Friendly's wonderful "Visualizing Categorical Data".
Extended mosaic and association plots are described here.
Each provides a method of visualizng complex data and evaluating deviations from a specified independence model.
For more details, see The Strucplot Framework.
Mosaic Plots
For extended mosaic plots, use mosaic(x, condvar=, data=) where x is a table or formula, condvar= is an optional conditioning variable, and data= specifies a data frame or a table.
Include shade=TRUE to color the figure, and legend=TRUE to display a legend for the Pearson residuals.
# Mosaic Plot Example
library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE) click to view
Association Plots
To produce an extended association plot use assoc(x, row_vars, col_vars) where x is a contingency table, row_vars is a vector of integers giving the indices of the variables to be used for the rows, and col_vars is a vector of integers giving the indices of the variables to be used for the columns of the association plot.
# Association Plot Example
library(vcd)
assoc(HairEyeColor, shade=TRUE) click to view
Going Further
Both functions are complex and offer multiple input and output options.
See help(mosaic) and help(assoc) for more details.
x is a data frame with one observation per row.
order=TRUE will cause the variables to be ordered using principal component analysis of the correlation matrix.
panel= refers to the off-diagonal panels.
You can use lower.panel= and upper.panel= to choose different options below and above the main diagonal respectively.
text.panel= and diag.panel= refer to the main diagnonal.
Allowable parameters are given below.
off diagonal panelspanel.pie (the filled portion of the pie indicates the magnitude of the correlation)
panel.shade (the depth of the shading indicates the magnitude of the correlation)
panel.ellipse (confidence ellipse and smoothed line)
panel.pts (scatterplot)
main diagonal panels panel.minmax (min and max values of the variable)
panel.txt (variable name).
# First Correlogram Example
library(corrgram)
corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Car Milage Data in PC2/PC1 Order") click to view
# Second Correlogram Example
library(corrgram)
corrgram(mtcars, order=TRUE, lower.panel=panel.ellipse,
upper.panel=panel.pts,
text.panel=panel.txt,
diag.panel=panel.minmax,
main="Car Mileage Data in PC2/PC1 Order") click to view
# Third Correlogram Example
library(corrgram)
corrgram(mtcars, order=NULL, lower.panel=panel.shade,
upper.panel=NULL, text.panel=panel.txt,
main="Car Milage Data (unsorted)") click to view
Changing the colors in a correlogram
You can control the colors in a correlogram by specifying 4 colors in the colorRampPalette( ) function within the col.corrgram( ) function.
Here is an example.
# Changing Colors in a Correlogram
library(corrgram)
col.corrgram <- function(ncol){
colorRampPalette(c("darkgoldenrod4", "burlywood1",
"darkkhaki", "darkgreen"))(ncol)}
corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Correlogram of Car Mileage Data (PC2/PC1 Order)") click to view
Going Further
Try some of the exercises in this interactive course which covers correlational analysis.
There are a several ways to interact with R graphics in real time.
Three methods are described below.
GGobi
GGobi is an open source visualization program for exploring high-dimensional data.
It is freely available for MS Windows, Linux, and Mac platforms.
It supports linked interactive scatterplots, barcharts, parallel coordinate plots and tours, with both brushing and identification.
A good tutorial is included with the GGobi manual.
You can download the software here.
Once GGobi is installed, you can use the ggobi( ) function in the package rggobi to run GGobi from within R . This gives you interactive graphics access to all of your R data! See An Introduction to RGGOBI.
# Interact with R data using GGobi
library(rggobi)
g <- ggobi(mydata)
click to view
iPlots
The iplots package provide interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots and histograms that can be linked and color brushed.
iplots is implimented through the Java GUI for R.
For more information, see the iplots website.
# Install iplots
install.packages("iplots",dep=TRUE)
# Create some linked plots
library(iplots)
cyl.f <- factor(mtcars$cyl)
gear.f <- factor(mtcars$factor)
attach(mtcars)
ihist(mpg) # histogram
ibar(carb) # barchart
iplot(mpg, wt) # scatter plot
ibox(mtcars[c("qsec","disp","hp")]) # boxplots
ipcp(mtcars[c("mpg","wt","hp")]) # parallel coordinates
imosaic(cyl.f,gear.f) # mosaic plot
On windows platforms, hold down the cntrl key and move the mouse over each graph to get identifying information from points, bars, etc.
click to view
Interacting with Plots (Identifying Points)
R offers two functions for identifying points and coordinate locations in plots.
With identify(), clicking the mouse over points in a graph will display the row number or (optionally) the rowname for the point.
This continues until you select stop .
With locator() you can add points or lines to the plot using the mouse.
The function returns a list of the (x,y) coordinates.
Again, this continues until you select stop.
# Interacting with a scatterplot
attach(mydata)
plot(x, y) # scatterplot
identify(x, y, labels=row.names(mydata)) # identify points
coords <- locator(type="l") # add lines
coords # display list
Other Interactive Graphs
See scatterplots for a description of rotating 3D scatterplots in R.
Other Visualization Programs
Explore building interactive plots with ggvis from RStudio in this course.