R Tutorial






R Tutorial
R is a programming language and software environment for statistical analysis, graphics representation and reporting. 

R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. 

R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.

Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. 

R made its first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive.

Features of R
As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. 

The following are the important features of R -

R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language. 

It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. 

R is taught in universities and deployed in mission critical business applications. 

This tutorial will teach you R programming along with suitable examples in simple and easy steps.

Local Environment Setup
If you are still willing to set up your environment for R, you can follow the steps given below.

Windows Installation
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.

As it is a Windows installer (.exe) with a name "R-version-win.exe". 

You can just double click and run the installer accepting the default settings. 

If your Windows is 32-bit version, it installs the 32-bit version. 

But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions.

After installation you can locate the icon to run the Program in a directory structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. 

Clicking this icon brings up the R-GUI which is the R console to do R Programming.

Linux Installation
R is available as a binary for many versions of Linux at the location R Binaries.

The instruction to install Linux varies from flavor to flavor. 

These steps are mentioned under each type of Linux version in the mentioned link. 

However, if you are in a hurry, then you can use yum command to install R as follows -

$ yum install R

Above command will install core functionality of R programming along with standard packages, still you need additional package, then you can launch R prompt as follows -

$ R
R version 3.2.0 (2015-04-16) -- "Full of  Ingredients"          
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many  contributors. 

                   
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>  

Now you can use install command at R prompt to install the required package. 

For example, the following command will install plotrix package which is required for 3D charts.

> install.packages("plotrix")
R - Basic Syntax
As a convention, we will start learning R programming by writing a "Hello, World!" program. 

Depending on the needs, you can program either at R command prompt or you can use an R script file to write your program. 

Let's check both one by one.

R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just typing the following command at your command prompt -

$ R

This will launch R interpreter and you will get a prompt > where you can start typing your program as follows -

> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello, World!" and then next statement print() is being used to print the value stored in variable myString.

R Script File
Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript. 

So let's start with writing following code in a text file called test.R as under -

 Live Demo

# My first program in R Programming
myString <- "Hello, World!"

print ( myString)

Save the above code in a file test.R and execute it at Linux command prompt as given below. 

Even if you are using Windows or other system, syntax will remain same.

$ Rscript test.R 

When we run the above program, it produces the following result.

[1] "Hello, World!"
Comments
Comments are like helping text in your R program and they are ignored by the interpreter while executing your actual program. 

Single comment is written using # in the beginning of the statement as follows -

# My first program in R Programming

R does not support multi-line comments but you can perform a trick which is something as follows -

 Live Demo

if(FALSE) {
   "This is a demo for multi-line comments and it should be put inside either a 
      single OR double quote"
}

myString <- "Hello, World!"
print ( myString)

[1] "Hello, World!"

Though above comments will be executed by R interpreter, they will not interfere with your actual program. 

You should put such comments inside, either single or double quote.

R - Data Types
Generally, while doing programming in any programming language, you need to use various variables to store various information. 

Variables are nothing but reserved memory locations to store values. 

This means that, when you create a variable you reserve some space in memory.

You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. 

Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory.

In contrast to other programming languages like C and java in R, the variables are not declared as some data type. 

The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. 

There are many types of R-objects. 

The frequently used ones are -
Vectors
Lists    
Matrices   
Arrays
Factors
Data Frames

The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. 

The other R-Objects are built upon the atomic vectors.


Data Type Example
Verify
Logical TRUE, FALSE 

 Live Demo

v <- TRUE 
print(class(v))

it produces the following result -

[1] "logical" 


Numeric 12.3, 5, 999 

 Live Demo

v <- 23.5
print(class(v))

it produces the following result -

[1] "numeric"


Integer 2L, 34L, 0L 

 Live Demo

v <- 2L
print(class(v))

it produces the following result -

[1] "integer"


Complex 3 + 2i 

 Live Demo

v <- 2+5i
print(class(v))

it produces the following result -

[1] "complex"


Character 'a' , '"good", "TRUE", '23.4' 

 Live Demo

v <- "TRUE"
print(class(v))

it produces the following result -

[1] "character"


Raw "Hello" is stored as 48 65 6c 6c 6f 

 Live Demo

v <- charToRaw("Hello")
print(class(v))

it produces the following result -

[1] "raw" 



In R programming, the very basic data types are the R-objects called vectors which hold elements of different classes as shown above. 

Please note in R the number of classes is not confined to only the above six types. 

For example, we can use many atomic vectors and create an array whose class will become array.
   
Vectors
When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector.

 Live Demo

# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.
print(class(apple))

When we execute the above code, it produces the following result -

[1] "red"    "green"  "yellow"
[1] "character"
Lists
A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.

 Live Demo

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.
print(list1)

When we execute the above code, it produces the following result -

[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x)  .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. 

It can be created using a vector input to the matrix function.

 Live Demo

# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)

When we execute the above code, it produces the following result -

     [,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. 

The array function takes a dim attribute which creates the required number of dimension. 

In the below example we create an array with two elements which are 3x3 matrices each.

 Live Demo

# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)

When we execute the above code, it produces the following result -

, , 1

     [,1]     [,2]     [,3]    
[1,] "green"  "yellow" "green" 
[2,] "yellow" "green"  "yellow"
[3,] "green"  "yellow" "green" 

, , 2

     [,1]     [,2]     [,3]    
[1,] "yellow" "green"  "yellow"
[2,] "green"  "yellow" "green" 
[3,] "yellow" "green"  "yellow"  
Factors
Factors are the r-objects which are created using a vector. 

It stores the vector along with the distinct values of the elements in the vector as labels. 

The labels are always character irrespective of whether it is numeric or character or Boolean etc. 

in the input vector. 

They are useful in statistical modeling.

Factors are created using the factor() function. 

The nlevels functions gives the count of levels.

 Live Demo

# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.
factor_apple <- factor(apple_colors)

# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))

When we execute the above code, it produces the following result -

[1] green  green  yellow red    red    red    green 
Levels: green red yellow
[1] 3

 
Data Frames
Data frames are tabular data objects. 

Unlike a matrix in data frame each column can contain different modes of data. 

The first column can be numeric while the second column can be character and third column can be logical. 

It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

 Live Demo

# Create the data frame.
BMI <- 	data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)
print(BMI)

When we execute the above code, it produces the following result -

  gender height weight Age
1   Male  152.0     81  42
2   Male  171.5     93  38
3 Female  165.0     78  26  
R - Variables
A variable provides us with named storage that our programs can manipulate. 

A variable in R can store an atomic vector, group of atomic vectors or a combination of many Robjects. 

A valid variable name consists of letters, numbers and the dot or underline characters. 

The variable name starts with a letter or the dot not followed by a number.


Variable Name
Validity
Reason
var_name2. valid Has letters, numbers, dot and underscore
var_name% Invalid Has the character '%'. 

Only dot(.) and underscore allowed.
2var_name invalid Starts with a number
.var_name,

var.name
valid
Can start with a dot(.) but the dot(.)should not be followed by a number.
.2var_name invalid The starting dot is followed by a number making it invalid.
_var_name invalid Starts with _ which is not valid     

Variable Assignment
The variables can be assigned values using leftward, rightward and equal to operator. 

The values of the variables can be printed using print() or cat() function. 

The cat() function combines multiple items into a continuous print output.

 Live Demo

# Assignment using equal operator.
var.1 = c(0,1,2,3)           

# Assignment using leftward operator.
var.2 <- c("learn","R")   

# Assignment using rightward operator. 

  
c(TRUE,1) -> var.3           

print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")

When we execute the above code, it produces the following result -

[1] 0 1 2 3
var.1 is  0 1 2 3 
var.2 is  learn R 
var.3 is  1 1 

Note - The vector c(TRUE,1) has a mix of logical and numeric class. 

So logical class is coerced to numeric class making TRUE as 1.

Data Type of a Variable
In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object assigned to it. 

So R is called a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program.

 Live Demo

var_x <- "Hello"
cat("The class of var_x is ",class(var_x),"\n")

var_x <- 34.5
cat("  Now the class of var_x is ",class(var_x),"\n")

var_x <- 27L
cat("   Next the class of var_x becomes ",class(var_x),"\n")

When we execute the above code, it produces the following result -

The class of var_x is  character 
   Now the class of var_x is  numeric 
      Next the class of var_x becomes  integer
Finding Variables
To know all the variables currently available in the workspace we use the ls() function. 

Also the ls() function can use patterns to match the variable names.

 Live Demo

print(ls())

When we execute the above code, it produces the following result -

[1] "my var"     "my_new_var" "my_var"     "var.1"      
[5] "var.2"      "var.3"      "var.name"   "var_name2."
[9] "var_x"      "varname" 

Note - It is a sample output depending on what variables are declared in your environment.

The ls() function can use patterns to match the variable names.

 Live Demo

# List the variables starting with the pattern "var".
print(ls(pattern = "var"))   

When we execute the above code, it produces the following result -

[1] "my var"     "my_new_var" "my_var"     "var.1"      
[5] "var.2"      "var.3"      "var.name"   "var_name2."
[9] "var_x"      "varname"    

The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE" argument to ls() function.

 Live Demo

print(ls(all.name = TRUE))

When we execute the above code, it produces the following result -

[1] ".cars"        ".Random.seed" ".var_name"    ".varname"     ".varname2"   
[6] "my var"       "my_new_var"   "my_var"       "var.1"        "var.2"        
[11]"var.3"        "var.name"     "var_name2."   "var_x"  
Deleting Variables
Variables can be deleted by using the rm() function. 

Below we delete the variable var.3. 

On printing the value of the variable error is thrown.

 Live Demo

rm(var.3)
print(var.3)

When we execute the above code, it produces the following result -

[1] "var.3"
Error in print(var.3) : object 'var.3' not found

All the variables can be deleted by using the rm() and ls() function together.

 Live Demo

rm(list = ls())
print(ls())

When we execute the above code, it produces the following result -

character(0)
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulations. 

R language is rich in built-in operators and provides following types of operators.

Types of Operators
We have the following types of operators in R programming -

Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators

Arithmetic Operators
Following table shows the arithmetic operators supported by R language. 

The operators act on each element of the vector.


Operator
Description
Example
+ Adds two vectors 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)

it produces the following result -

[1] 10.0  8.5  10.0


- Subtracts second vector from the first 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)

it produces the following result -

[1] -6.0  2.5  2.0


* Multiplies both vectors 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)

it produces the following result -

[1] 16.0 16.5 24.0


/ Divide the first vector with the second 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)

When we execute the above code, it produces the following result -

[1] 0.250000 1.833333 1.500000

 
%% Give the remainder of the first vector with the second 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%%t)

it produces the following result -

[1] 2.0 2.5 2.0


%/% The result of division of first vector with second (quotient) 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v%/%t)

it produces the following result -

[1] 0 1 1


^ The first vector raised to the exponent of second vector 

 Live Demo

v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v^t)

it produces the following result -

[1]  256.000  166.375 1296.000


 
Relational Operators
Following table shows the relational operators supported by R language. 

Each element of the first vector is compared with the corresponding element of the second vector. 

The result of comparison is a Boolean value.


Operator
Description
Example
>
Checks if each element of the first vector is greater than the corresponding element of the second vector. 

 Live Demo

v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>t)

it produces the following result -

[1] FALSE  TRUE FALSE FALSE


<
Checks if each element of the first vector is less than the corresponding element of the second vector. 

 Live Demo

v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v < t)

it produces the following result -

[1]  TRUE FALSE  TRUE FALSE


==
Checks if each element of the first vector is equal to the corresponding element of the second vector. 

 Live Demo

v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v == t)

it produces the following result -

[1] FALSE FALSE FALSE  TRUE


<=
Checks if each element of the first vector is less than or equal to the corresponding element of the second vector. 

 Live Demo

v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v<=t)

it produces the following result -

[1]  TRUE FALSE  TRUE  TRUE


>=
Checks if each element of the first vector is greater than or equal to the corresponding element of the second vector. 

 Live Demo

v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>=t)

it produces the following result -

[1] FALSE  TRUE FALSE  TRUE


!=
Checks if each element of the first vector is unequal to the corresponding element of the second vector. 

 Live Demo

v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v!=t)

it produces the following result -

[1]  TRUE  TRUE  TRUE FALSE


 
Logical Operators
Following table shows the logical operators supported by R language. 

It is applicable only to vectors of type logical, numeric or complex. 

All numbers greater than 1 are considered as logical value TRUE.

Each element of the first vector is compared with the corresponding element of the second vector. 

The result of comparison is a Boolean value.
 

Operator
Description
Example
&
It is called Element-wise Logical AND operator. 

It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if both the elements are TRUE. 

 Live Demo

v <- c(3,1,TRUE,2+3i)
t <- c(4,1,FALSE,2+3i)
print(v&t)

it produces the following result -

[1]  TRUE  TRUE FALSE  TRUE


|
It is called Element-wise Logical OR operator. 

It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if one the elements is TRUE. 

 Live Demo

v <- c(3,0,TRUE,2+2i)
t <- c(4,0,FALSE,2+3i)
print(v|t)

it produces the following result -

[1]  TRUE FALSE  TRUE  TRUE


!
It is called Logical NOT operator. 

Takes each element of the vector and gives the opposite logical value. 

 Live Demo

v <- c(3,0,TRUE,2+2i)
print(!v)

it produces the following result -

[1] FALSE  TRUE FALSE FALSE



The logical operator && and || considers only the first element of the vectors and give a vector of single element as output.
 

Operator
Description
Example
&&
Called Logical AND operator. 

Takes first element of both the vectors and gives the TRUE only if both are TRUE. 

 Live Demo

v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
print(v&&t)

it produces the following result -

[1] TRUE


||
Called Logical OR operator. 

Takes first element of both the vectors and gives the TRUE if one of them is TRUE. 

 Live Demo

v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
print(v||t)

it produces the following result -

[1] FALSE



Assignment Operators
These operators are used to assign values to vectors.


Operator
Description
Example

<-
 
or
 
=
 
or
 
<<-

Called Left Assignment 

 Live Demo

v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
v3 = c(3,1,TRUE,2+3i)
print(v1)
print(v2)
print(v3)

it produces the following result -

[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i



->

or

->>
Called Right Assignment 

 Live Demo

c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2 
print(v1)
print(v2)

it produces the following result -

[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i



Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical computation.


Operator
Description
Example
:
Colon operator. 

It creates the series of numbers in sequence for a vector. 

 Live Demo

v <- 2:8
print(v) 

it produces the following result -

[1] 2 3 4 5 6 7 8

	
%in%
This operator is used to identify if an element belongs to a vector. 

 Live Demo

v1 <- 8
v2 <- 12
t <- 1:10
print(v1 %in% t) 
print(v2 %in% t) 

it produces the following result -

[1] TRUE
[1] FALSE


%*%
This operator is used to multiply a matrix with its transpose. 

 Live Demo

M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
t = M %*% t(M)
print(t)

it produces the following result -

      [,1] [,2]
[1,]   65   82
[2,]   82  117



R - Decision making
Decision making structures require the programmer to specify one or more conditions to be evaluated or tested by the program, along with a statement or statements to be executed if the condition is determined to be true, and optionally, other statements to be executed if the condition is determined to be false.

Following is the general form of a typical decision making structure found in most of the programming languages -


R provides the following types of decision making statements. 

Click the following links to check their detail.


Sr.No.
Statement & Description
1 
if statement
An if statement consists of a Boolean expression followed by one or more statements.


2 
if...else statement
An if statement can be followed by an optional else statement, which executes when the Boolean expression is false.


3 
switch statement
A switch statement allows a variable to be tested for equality against a list of values.



R - Loops
There may be a situation when you need to execute a block of code several number of times. 

In general, statements are executed sequentially. 

The first statement in a function is executed first, followed by the second, and so on.

Programming languages provide various control structures that allow for more complicated execution paths.

A loop statement allows us to execute a statement or group of statements multiple times and the following is the general form of a loop statement in most of the programming languages -


R programming language provides the following kinds of loop to handle looping requirements. 

Click the following links to check their detail.


Sr.No.
Loop Type & Description
1 
repeat loop
Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.


2 
while loop
Repeats a statement or group of statements while a given condition is true. 

It tests the condition before executing the loop body.


3 
for loop
Like a while statement, except that it tests the condition at the end of the loop body.



Loop Control Statements
Loop control statements change execution from its normal sequence. 

When execution leaves a scope, all automatic objects that were created in that scope are destroyed.

R supports the following control statements. 

Click the following links to check their detail.


Sr.No.
Control Statement & Description
1 
break statement
Terminates the loop statement and transfers execution to the statement immediately following the loop.


2 
Next statement
The next statement simulates the behavior of R switch.



R - Functions
A function is a set of statements organized together to perform a specific task. 

R has a large number of in-built functions and the user can create their own functions.

In R, a function is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions.

The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects.

Function Definition
An R function is created by using the keyword function. 

The basic syntax of an R function definition is as follows -

function_name <- function(arg_1, arg_2, ...) {
   Function body 
}
Function Components 
The different parts of a function are -

Function Name - This is the actual name of the function. 

It is stored in R environment as an object with this name.
Arguments - An argument is a placeholder. 

When a function is invoked, you pass a value to the argument. 

Arguments are optional; that is, a function may contain no arguments. 

Also arguments can have default values.
Function Body - The function body contains a collection of statements that defines what the function does.
Return Value - The return value of a function is the last expression in the function body to be evaluated.

R has many in-built functions which can be directly called in the program without defining them first. 

We can also create and use our own functions referred as user defined functions.

Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. 

They are directly called by user written programs. 

You can refer most widely used R functions.

 Live Demo

# Create a sequence of numbers from 32 to 44.
print(seq(32,44))

# Find mean of numbers from 25 to 82.
print(mean(25:82))

# Find sum of numbers frm 41 to 68.
print(sum(41:68))

When we execute the above code, it produces the following result -
 

[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
We can create user-defined functions in R. 

They are specific to what a user wants and once created they can be used like the built-in functions. 

Below is an example of how a function is created and used.

# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
   for(i in 1:a) {
      b <- i^2
      print(b)
   }
}	
Calling a Function

 Live Demo

# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
   for(i in 1:a) {
      b <- i^2
      print(b)
   }
}

# Call the function new.function supplying 6 as an argument.
new.function(6)

When we execute the above code, it produces the following result -
 

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36

Calling a Function without an Argument

 Live Demo

# Create a function without an argument.
new.function <- function() {
   for(i in 1:5) {
      print(i^2)
   }
}	

# Call the function without supplying an argument.
new.function()

When we execute the above code, it produces the following result -

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

Calling a Function with Argument Values (by position and by name)
The arguments to a function call can be supplied in the same sequence as defined in the function or they can be supplied in a different sequence but assigned to the names of the arguments.

 Live Demo

# Create a function with arguments.
new.function <- function(a,b,c) {
   result <- a * b + c
   print(result)
}

# Call the function by position of arguments.
new.function(5,3,11)

# Call the function by names of the arguments.
new.function(a = 11, b = 5, c = 3)

When we execute the above code, it produces the following result -

[1] 26
[1] 58

Calling a Function with Default Argument
We can define the value of the arguments in the function definition and call the function without supplying any argument to get the default result. 

But we can also call such functions by supplying new values of the argument and get non default result.

 Live Demo

# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
   result <- a * b
   print(result)
}

# Call the function without giving any argument.
new.function()

# Call the function with giving new values of the argument.
new.function(9,5)

When we execute the above code, it produces the following result -

[1] 18
[1] 45
Lazy Evaluation of Function
Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the function body.

 Live Demo

# Create a function with arguments.
new.function <- function(a, b) {
   print(a^2)
   print(a)
   print(b)
}

# Evaluate the function without supplying one of the arguments.
new.function(6)

When we execute the above code, it produces the following result -

[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
R - Strings
Any value written within a pair of single quote or double quotes in R is treated as a string. 

Internally R stores every string within double quotes, even when you create them with single quote.

Rules Applied in String Construction
The quotes at the beginning and end of a string should be both double quotes or both single quote. 

They can not be mixed.
Double quotes can be inserted into a string starting and ending with single quote.
Single quote can be inserted into a string starting and ending with double quotes.
Double quotes can not be inserted into a string starting and ending with double quotes.
Single quote can not be inserted into a string starting and ending with single quote.

Examples of Valid Strings
Following examples clarify the rules about creating a string in R.

 Live Demo

a <- 'Start and end with single quote'
print(a)

b <- "Start and end with double quotes"
print(b)

c <- "single quote ' in between double quotes"
print(c)

d <- 'Double quotes " in between single quote'
print(d)

When the above code is run we get the following output -

[1] "Start and end with single quote"
[1] "Start and end with double quotes"
[1] "single quote ' in between double quote"
[1] "Double quote \" in between single quote"

Examples of Invalid Strings

 Live Demo

e <- 'Mixed quotes" 
print(e)

f <- 'Single quote ' inside single quote'
print(f)

g <- "Double quotes " inside double quotes"
print(g)

When we run the script it fails giving below results.

Error: unexpected symbol in:
"print(e)
f <- 'Single"
Execution halted

 
String Manipulation
Concatenating Strings - paste() function
Many strings in R are combined using the paste() function. 

It can take any number of arguments to be combined together.

Syntax
The basic syntax for paste function is -

paste(..., sep = " ", collapse = NULL)

Following is the description of the parameters used -

... represents any number of arguments to be combined.
sep represents any separator between the arguments. 

It is optional.
collapse is used to eliminate the space in between two strings. 

But not the space within two words of one string.

Example

 Live Demo

a <- "Hello"
b <- 'How'
c <- "are you? "

print(paste(a,b,c))

print(paste(a,b,c, sep = "-"))

print(paste(a,b,c, sep = "", collapse = ""))

When we execute the above code, it produces the following result -

[1] "Hello How are you? "
[1] "Hello-How-are you? "
[1] "HelloHoware you? "

Formatting numbers & strings - format() function
Numbers and strings can be formatted to a specific style using format() function.

Syntax
The basic syntax for format function is -

format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none")) 

Following is the description of the parameters used -

x is the vector input.
digits is the total number of digits displayed.
nsmall is the minimum number of digits to the right of the decimal point.
scientific is set to TRUE to display scientific notation.
width indicates the minimum width to be displayed by padding blanks in the beginning.
justify is the display of the string to left, right or center.

Example

 Live Demo

# Total number of digits displayed. 

Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)

# Display numbers in scientific notation.
result <- format(c(6, 13.14521), scientific = TRUE)
print(result)

# The minimum number of digits to the right of the decimal point.
result <- format(23.47, nsmall = 5)
print(result)

# Format treats everything as a string.
result <- format(6)
print(result)

# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)

# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)

# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)

When we execute the above code, it produces the following result -

[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] "  13.7"
[1] "Hello   "
[1] " Hello  "

Counting number of characters in a string - nchar() function
This function counts the number of characters including spaces in a string.

Syntax
The basic syntax for nchar() function is -

nchar(x)

Following is the description of the parameters used -

x is the vector input.

Example

 Live Demo

result <- nchar("Count the number of characters")
print(result)

When we execute the above code, it produces the following result -

[1] 30

Changing the case - toupper() & tolower() functions
These functions change the case of characters of a string.

Syntax
The basic syntax for toupper() & tolower() function is  -

toupper(x)
tolower(x)

Following is the description of the parameters used -

x is the vector input.

Example

 Live Demo

# Changing to Upper case.
result <- toupper("Changing To Upper")
print(result)

# Changing to lower case.
result <- tolower("Changing To Lower")
print(result)

When we execute the above code, it produces the following result -

[1] "CHANGING TO UPPER"
[1] "changing to lower"

Extracting parts of a string - substring() function
This function extracts parts of a String.

Syntax
The basic syntax for substring() function is -

substring(x,first,last)

Following is the description of the parameters used -

x is the character vector input.
first is the position of the first character to be extracted.
last is the position of the last character to be extracted.

Example

 Live Demo

# Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)

When we execute the above code, it produces the following result -

[1] "act"
R - Vectors
Vectors are the most basic R data objects and there are six types of atomic vectors. 

They are logical, integer, double, complex, character and raw.

Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of the above vector types.

 Live Demo

# Atomic vector of type character.
print("abc");

# Atomic vector of type double.
print(12.5)

# Atomic vector of type integer.
print(63L)

# Atomic vector of type logical.
print(TRUE)

# Atomic vector of type complex.
print(2+3i)

# Atomic vector of type raw.
print(charToRaw('hello'))

When we execute the above code, it produces the following result -

[1] "abc"
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f

Multiple Elements Vector
Using colon operator with numeric data

 Live Demo

# Creating a sequence from 5 to 13.
v <- 5:13
print(v)

# Creating a sequence from 6.6 to 12.6.
v <- 6.6:12.6
print(v)

# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)

When we execute the above code, it produces the following result -

[1]  5  6  7  8  9 10 11 12 13
[1]  6.6  7.6  8.6  9.6 10.6 11.6 12.6
[1]  3.8  4.8  5.8  6.8  7.8  8.8  9.8 10.8

Using sequence (Seq.) operator

 Live Demo

# Create vector with elements from 5 to 9 incrementing by 0.4.
print(seq(5, 9, by = 0.4))

When we execute the above code, it produces the following result -

[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0

Using the c() function

The non-character values are coerced to character type if one of the elements is a character.

 Live Demo

# The logical and numeric values are converted to characters.
s <- c('apple','red',5,TRUE)
print(s)

When we execute the above code, it produces the following result -

[1] "apple" "red"   "5"     "TRUE" 
Accessing Vector Elements
Elements of a Vector are accessed using indexing. 

The [ ] brackets are used for indexing. 

Indexing starts with position 1. 

Giving a negative value in the index drops that element from result.TRUE, FALSE or 0 and 1 can also be used for indexing.

 Live Demo

# Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)

# Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)

# Accessing vector elements using negative indexing.
x <- t[c(-2,-5)]
print(x)

# Accessing vector elements using 0/1 indexing.
y <- t[c(0,0,0,0,0,0,1)]
print(y)

When we execute the above code, it produces the following result -

[1] "Mon" "Tue" "Fri"
[1] "Sun" "Fri"
[1] "Sun" "Tue" "Wed" "Fri" "Sat"
[1] "Sun"
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.

 Live Demo

# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)

# Vector addition.
add.result <- v1+v2
print(add.result)

# Vector subtraction.
sub.result <- v1-v2
print(sub.result)

# Vector multiplication.
multi.result <- v1*v2
print(multi.result)

# Vector division.
divi.result <- v1/v2
print(divi.result)

When we execute the above code, it produces the following result -

[1]  7 19  4 13  1 13
[1] -1 -3  4 -3 -1  9
[1] 12 88  0 40  0 22
[1] 0.7500000 0.7272727       Inf 0.6250000 0.0000000 5.5000000

Vector Element Recycling
If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operations.

 Live Demo

v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)

add.result <- v1+v2
print(add.result)

sub.result <- v1-v2
print(sub.result)

When we execute the above code, it produces the following result -

[1]  7 19  8 16  4 22
[1] -1 -3  0 -6 -4  0

Vector Element Sorting
Elements in a vector can be sorted using the sort() function.

 Live Demo

v <- c(3,8,4,5,0,11, -9, 304)

# Sort the elements of the vector.
sort.result <- sort(v)
print(sort.result)

# Sort the elements in the reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)

# Sorting character vectors in reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

When we execute the above code, it produces the following result -

[1]  -9   0   3   4   5   8  11 304
[1] 304  11   8   5   4   3   0  -9
[1] "Blue"   "Red"    "violet" "yellow"
[1] "yellow" "violet" "Red"    "Blue" 
R - Lists
Lists are the R objects which contain elements of different types like - numbers, strings, vectors and another list inside it. 

A list can also contain a matrix or a function as its elements. 

List is created using list() function.

Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.

 Live Demo

# Create a list containing strings, numbers, vectors and a logical
# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)

When we execute the above code, it produces the following result -

[[1]]
[1] "Red"

[[2]]
[1] "Green"

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE

[[5]]
[1] 51.23

[[6]]
[1] 119.1
Naming List Elements
The list elements can be given names and they can be accessed using these names.

 Live Demo

# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
   list("green",12.3))

# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Show the list.
print(list_data)

When we execute the above code, it produces the following result -

$`1st_Quarter`
[1] "Jan" "Feb" "Mar"

$A_Matrix
     [,1] [,2] [,3]
[1,]    3    5   -2
[2,]    9    1    8

$A_Inner_list
$A_Inner_list[[1]]
[1] "green"

$A_Inner_list[[2]]
[1] 12.3
Accessing List Elements 
Elements of the list can be accessed by the index of the element in the list. 

In case of named lists it can also be accessed using the names.

We continue to use the list in the above example -

 Live Demo

# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
   list("green",12.3))

# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Access the first element of the list.
print(list_data[1])

# Access the thrid element. 

As it is also a list, all its elements will be printed.
print(list_data[3])

# Access the list element using the name of the element.
print(list_data$A_Matrix)

When we execute the above code, it produces the following result -

$`1st_Quarter`
[1] "Jan" "Feb" "Mar"

$A_Inner_list
$A_Inner_list[[1]]
[1] "green"

$A_Inner_list[[2]]
[1] 12.3

     [,1] [,2] [,3]
[1,]    3    5   -2
[2,]    9    1    8
Manipulating List Elements
We can add, delete and update list elements as shown below. 

We can add and delete elements only at the end of a list. 

But we can update any element.

 Live Demo

# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
   list("green",12.3))

# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Add element at the end of the list.
list_data[4] <- "New element"
print(list_data[4])

# Remove the last element.
list_data[4] <- NULL

# Print the 4th Element.
print(list_data[4])

# Update the 3rd Element.
list_data[3] <- "updated element"
print(list_data[3])

When we execute the above code, it produces the following result -

[[1]]
[1] "New element"

$<NA>
NULL

$`A Inner list`
[1] "updated element"
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.

 Live Demo

# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.
merged.list <- c(list1,list2)

# Print the merged list.
print(merged.list)

When we execute the above code, it produces the following result -

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] "Sun"

[[5]]
[1] "Mon"

[[6]]
[1] "Tue"
Converting List to Vector
A list can be converted to a vector so that the elements of the vector can be used for further manipulation. 

All the arithmetic operations on vectors can be applied after the list is converted into vectors. 

To do this conversion, we use the unlist() function. 

It takes the list as input and produces a vector.

 Live Demo

# Create lists.
list1 <- list(1:5)
print(list1)

list2 <-list(10:14)
print(list2)

# Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)

print(v1)
print(v2)

# Now add the vectors
result <- v1+v2
print(result)

When we execute the above code, it produces the following result -

[[1]]
[1] 1 2 3 4 5

[[1]]
[1] 10 11 12 13 14

[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
 R - Matrices
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. 

They contain elements of the same atomic types. 

Though we can create a matrix containing only characters or only logical values, they are not of much use. 

We use matrices containing numeric elements to be used in mathematical calculations.

A Matrix is created using the matrix() function.

Syntax
The basic syntax for creating a matrix in R is -

matrix(data, nrow, ncol, byrow, dimnames)

Following is the description of the parameters used -

data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue. 

If TRUE then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.

Example
Create a matrix taking a vector of numbers as input.

 Live Demo

# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)

# Elements are arranged sequentially by column.
N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)

# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)

When we execute the above code, it produces the following result -

     [,1] [,2] [,3]
[1,]    3    4    5
[2,]    6    7    8
[3,]    9   10   11
[4,]   12   13   14
     [,1] [,2] [,3]
[1,]    3    7   11
[2,]    4    8   12
[3,]    5    9   13
[4,]    6   10   14
     col1 col2 col3
row1    3    4    5
row2    6    7    8
row3    9   10   11
row4   12   13   14
Accessing Elements of a Matrix 
Elements of a matrix can be accessed by using the column and row index of the element. 

We consider the matrix P above to find the specific elements below.

 Live Demo

# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

# Create the matrix.
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))

# Access the element at 3rd column and 1st row.
print(P[1,3])

# Access the element at 2nd column and 4th row.
print(P[4,2])

# Access only the  2nd row.
print(P[2,])

# Access only the 3rd column.
print(P[,3])

When we execute the above code, it produces the following result -

[1] 5
[1] 13
col1 col2 col3 
   6    7    8 
row1 row2 row3 row4 
   5    8   11   14 
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators. 

The result of the operation is also a matrix.

The dimensions (number of rows and columns) should be same for the matrices involved in the operation.

Matrix Addition & Subtraction

 Live Demo

# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)

# Add the matrices.
result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)

# Subtract the matrices
result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)

When we execute the above code, it produces the following result -

     [,1] [,2] [,3]
[1,]    3   -1    2
[2,]    9    4    6
     [,1] [,2] [,3]
[1,]    5    0    3
[2,]    2    9    4
Result of addition 
     [,1] [,2] [,3]
[1,]    8   -1    5
[2,]   11   13   10
Result of subtraction 
     [,1] [,2] [,3]
[1,]   -2   -1   -1
[2,]    7   -5    2

Matrix Multiplication & Division

 Live Demo

# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)

# Multiply the matrices.
result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result)

# Divide the matrices
result <- matrix1 / matrix2
cat("Result of division","\n")
print(result)

When we execute the above code, it produces the following result -

     [,1] [,2] [,3]
[1,]    3   -1    2
[2,]    9    4    6
     [,1] [,2] [,3]
[1,]    5    0    3
[2,]    2    9    4
Result of multiplication 
     [,1] [,2] [,3]
[1,]   15    0    6
[2,]   18   36   24
Result of division 
     [,1]      [,2]      [,3]
[1,]  0.6      -Inf 0.6666667
[2,]  4.5 0.4444444 1.5000000
R - Arrays
Arrays are the R data objects which can store data in more than two dimensions. 

For example - If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. 

Arrays can store only data type.

An array is created using the array() function. 

It takes vectors as input and uses the values in the dim parameter to create an array.

Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.

 Live Demo

# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2))
print(result)

When we execute the above code, it produces the following result -

, , 1

     [,1] [,2] [,3]
[1,]    5   10   13
[2,]    9   11   14
[3,]    3   12   15

, , 2

     [,1] [,2] [,3]
[1,]    5   10   13
[2,]    9   11   14
[3,]    3   12   15
Naming Columns and Rows
We can give names to the rows, columns and matrices in the array by using the dimnames parameter.

 Live Demo

# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,column.names,
   matrix.names))
print(result)

When we execute the above code, it produces the following result -

, , Matrix1

     COL1 COL2 COL3
ROW1    5   10   13
ROW2    9   11   14
ROW3    3   12   15

, , Matrix2

     COL1 COL2 COL3
ROW1    5   10   13
ROW2    9   11   14
ROW3    3   12   15
Accessing Array Elements

 Live Demo

# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,
   column.names, matrix.names))

# Print the third row of the second matrix of the array.
print(result[3,,2])

# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])

# Print the 2nd Matrix.
print(result[,,2])

When we execute the above code, it produces the following result -

COL1 COL2 COL3 
   3   12   15 
[1] 13
     COL1 COL2 COL3
ROW1    5   10   13
ROW2    9   11   14
ROW3    3   12   15
Manipulating Array Elements
As array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices.

 Live Demo

# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.
array1 <- array(c(vector1,vector2),dim = c(3,3,2))

# Create two vectors of different lengths.
vector3 <- c(9,1,0)
vector4 <- c(6,0,11,3,14,1,2,6,9)
array2 <- array(c(vector1,vector2),dim = c(3,3,2))

# create matrices from these arrays.
matrix1 <- array1[,,2]
matrix2 <- array2[,,2]

# Add the matrices.
result <- matrix1+matrix2
print(result)

When we execute the above code, it produces the following result -

     [,1] [,2] [,3]
[1,]   10   20   26
[2,]   18   22   28
[3,]    6   24   30
Calculations Across Array Elements
We can do calculations across the elements in an array using the apply() function.

Syntax

apply(x, margin, fun)

Following is the description of the parameters used -

x is an array.
margin is the name of the data set used.
fun is the function to be applied across the elements of the array.

Example
We use the apply() function below to calculate the sum of the elements in the rows of an array across all the matrices.

 Live Demo

# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.
new.array <- array(c(vector1,vector2),dim = c(3,3,2))
print(new.array)

# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)

When we execute the above code, it produces the following result -

, , 1

     [,1] [,2] [,3]
[1,]    5   10   13
[2,]    9   11   14
[3,]    3   12   15

, , 2

     [,1] [,2] [,3]
[1,]    5   10   13
[2,]    9   11   14
[3,]    3   12   15

[1] 56 68 60
R - Factors
Factors are the data objects which are used to categorize the data and store it as levels. 

They can store both strings and integers. 

They are useful in the columns which have a limited number of unique values. 

Like "Male, "Female" and True, False etc. 

They are useful in data analysis for statistical modeling.

Factors are created using the factor () function by taking a vector as input.

Example

 Live Demo

# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","North")

print(data)
print(is.factor(data))

# Apply the factor function.
factor_data <- factor(data)

print(factor_data)
print(is.factor(factor_data))

When we execute the above code, it produces the following result -

[1] "East"  "West"  "East"  "North" "North" "East"  "West"  "West"  "West"  "East" "North"
[1] FALSE
[1] East  West  East  North North East  West  West  West  East  North
Levels: East North West
[1] TRUE
Factors in Data Frame
On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it.

 Live Demo

# Create the vectors for data frame.
height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)
gender <- c("male","male","female","female","male","female","male")

# Create the data frame.
input_data <- data.frame(height,weight,gender)
print(input_data)

# Test if the gender column is a factor.
print(is.factor(input_data$gender))

# Print the gender column so see the levels.
print(input_data$gender)

When we execute the above code, it produces the following result -

  height weight gender
1    132     48   male
2    151     49   male
3    162     66 female
4    139     53 female
5    166     67   male
6    147     52 female
7    122     40   male
[1] TRUE
[1] male   male   female female male   female male  
Levels: female male
Changing the Order of Levels
The order of the levels in a factor can be changed by applying the factor function again with new order of the levels.

 Live Demo

data <- c("East","West","East","North","North","East","West",
   "West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)

# Apply the factor function with required order of the level.
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)

When we execute the above code, it produces the following result -

 [1] East  West  East  North North East  West  West  West  East  North
Levels: East North West
 [1] East  West  East  North North East  West  West  West  East  North
Levels: East West North
Generating Factor Levels
We can generate factor levels by using the gl() function. 

It takes two integers as input which indicates how many levels and how many times each level.

Syntax

gl(n, k, labels)

Following is the description of the parameters used -

n is a integer giving the number of levels.
k is a integer giving the number of replications.
labels is a vector of labels for the resulting factor levels.

Example

 Live Demo

v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)

When we execute the above code, it produces the following result -

Tampa   Tampa   Tampa   Tampa   Seattle Seattle Seattle Seattle Boston 
[10] Boston  Boston  Boston 
Levels: Tampa Seattle Boston
R - Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Following are the characteristics of a data frame.

The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.

Create Data Frame 

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
# Print the data frame.			
print(emp.data) 

 
When we execute the above code, it produces the following result -

 emp_id    emp_name     salary     start_date
1     1     Rick        623.30     2012-01-01
2     2     Dan         515.20     2013-09-23
3     3     Michelle    611.00     2014-11-15
4     4     Ryan        729.00     2014-05-11
5     5     Gary        843.25     2015-03-27

 
Get the Structure of the Data Frame
The structure of the data frame can be seen by using str() function.

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

When we execute the above code, it produces the following result -

'data.frame':   5 obs. 

of  4 variables:
 $ emp_id    : int  1 2 3 4 5
 $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...
 $ salary    : num  623 515 611 729 843
 $ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained by applying summary() function.

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))  

When we execute the above code, it produces the following result -

     emp_id    emp_name             salary        start_date        
 Min. 

  :1   Length:5           Min. 

  :515.2   Min. 

  :2012-01-01  
 1st Qu.:2   Class :character   1st Qu.:611.0   1st Qu.:2013-09-23  
 Median :3   Mode  :character   Median :623.3   Median :2014-05-11  
 Mean   :3                      Mean   :664.4   Mean   :2014-01-14  
 3rd Qu.:4                      3rd Qu.:729.0   3rd Qu.:2014-11-15  
 Max. 

  :5                      Max. 

  :843.2   Max. 

  :2015-03-27 
Extract Data from Data Frame
Extract specific column from a data frame using column name.

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5),
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25),
   
   start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

When we execute the above code, it produces the following result -

  emp.data.emp_name emp.data.salary
1              Rick          623.30
2               Dan          515.20
3          Michelle          611.00
4              Ryan          729.00
5              Gary          843.25

 
Extract the first two rows and then all columns

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5),
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25),
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)

When we execute the above code, it produces the following result -

  emp_id    emp_name   salary    start_date
1      1     Rick      623.3     2012-01-01
2      2     Dan       515.2     2013-09-23

Extract 3^rd and 5^th row with 2^nd and 4^th column

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
	start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)

# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)

When we execute the above code, it produces the following result -

  emp_name start_date
3 Michelle 2014-11-15
5     Gary 2015-03-27
Expand Data Frame
A data frame can be expanded by adding columns and rows.

Add Column
Just add the column vector using a new column name.

 Live Demo

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)

# Add the "dept" coulmn.
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)

When we execute the above code, it produces the following result -

  emp_id   emp_name    salary    start_date       dept
1     1    Rick        623.30    2012-01-01       IT
2     2    Dan         515.20    2013-09-23       Operations
3     3    Michelle    611.00    2014-11-15       IT
4     4    Ryan        729.00    2014-05-11       HR
5     5    Gary        843.25    2015-03-27       Finance

Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.

In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame.

 Live Demo

# Create the first data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   dept = c("IT","Operations","IT","HR","Finance"),
   stringsAsFactors = FALSE
)

# Create the second data frame
emp.newdata <- 	data.frame(
   emp_id = c (6:8), 
   emp_name = c("Rasmi","Pranab","Tusar"),
   salary = c(578.0,722.5,632.8), 
   start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
   dept = c("IT","Operations","Fianance"),
   stringsAsFactors = FALSE
)

# Bind the two data frames.
emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

When we execute the above code, it produces the following result -

  emp_id     emp_name    salary     start_date       dept
1      1     Rick        623.30     2012-01-01       IT
2      2     Dan         515.20     2013-09-23       Operations
3      3     Michelle    611.00     2014-11-15       IT
4      4     Ryan        729.00     2014-05-11       HR
5      5     Gary        843.25     2015-03-27       Finance
6      6     Rasmi       578.00     2013-05-21       IT
7      7     Pranab      722.50     2013-07-30       Operations
8      8     Tusar       632.80     2014-06-17       Fianance
R - Packages
R packages are a collection of R functions, complied code and sample data. 

They are stored under a directory called "library" in the R environment. 

By default, R installs a set of packages during installation. 

More packages are added later, when they are needed for some specific purpose. 

When we start the R console, only the default packages are available by default. 

Other packages which are already installed have to be loaded explicitly to be used by the R program that is going to use them.

All the packages available in R language are listed at R Packages.

Below is a list of commands to be used to check, verify and use the R packages.

Check Available R Packages
Get library locations containing R packages

 Live Demo

.libPaths()

When we execute the above code, it produces the following result. 

It may vary depending on the local settings of your pc.

[2] "C:/Program Files/R/R-3.2.2/library"
Get the list of all the packages installed

 Live Demo

library()

When we execute the above code, it produces the following result. 

It may vary depending on the local settings of your pc.

Packages in library ‘C:/Program Files/R/R-3.2.2/library’:

base                    The R Base Package
boot                    Bootstrap Functions (Originally by Angelo Canty
                        for S)
class                   Functions for Classification
cluster                 "Finding Groups in Data": Cluster Analysis
                        Extended Rousseeuw et al.
codetools               Code Analysis Tools for R
compiler                The R Compiler Package
datasets                The R Datasets Package
foreign                 Read Data Stored by 'Minitab', 'S', 'SAS',
                        'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
graphics                The R Graphics Package
grDevices               The R Graphics Devices and Support for Colours
                        and Fonts
grid                    The Grid Graphics Package
KernSmooth              Functions for Kernel Smoothing Supporting Wand
                        & Jones (1995)
lattice                 Trellis Graphics for R
MASS                    Support Functions and Datasets for Venables and
                        Ripley's MASS
Matrix                  Sparse and Dense Matrix Classes and Methods
methods                 Formal Methods and Classes
mgcv                    Mixed GAM Computation Vehicle with GCV/AIC/REML
                        Smoothness Estimation
nlme                    Linear and Nonlinear Mixed Effects Models
nnet                    Feed-Forward Neural Networks and Multinomial
                        Log-Linear Models
parallel                Support for Parallel computation in R
rpart                   Recursive Partitioning and Regression Trees
spatial                 Functions for Kriging and Point Pattern
                        Analysis
splines                 Regression Spline Functions and Classes
stats                   The R Stats Package
stats4                  Statistical Functions using S4 Classes
survival                Survival Analysis
tcltk                   Tcl/Tk Interface
tools                   Tools for Package Development
utils                   The R Utils Package

Get all packages currently loaded in the R environment

 Live Demo

search()

When we execute the above code, it produces the following result. 

It may vary depending on the local settings of your pc.

[1] ".GlobalEnv"        "package:stats"     "package:graphics" 
[4] "package:grDevices" "package:utils"     "package:datasets" 
[7] "package:methods"   "Autoloads"         "package:base" 
Install a New Package
There are two ways to add new R packages. 

One is installing directly from the CRAN directory and another is downloading the package to your local system and installing it manually.

Install directly from CRAN
The following command gets the packages directly from CRAN webpage and installs the package in the R environment. 

You may be prompted to choose a nearest mirror. 

Choose the one appropriate to your location.

 install.packages("Package Name")
 
# Install the package named "XML".
 install.packages("XML")
Install package manually
Go to the link R Packages to download the package needed. 

Save the package as a .zip file in a suitable location in the local system.

Now you can run the following command to install this package in the R environment.

install.packages(file_name_with_path, repos = NULL, type = "source")

# Install the package named "XML"
install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")
Load Package to Library
Before a package can be used in the code, it must be loaded to the current R environment. 

You also need to load a package that is already installed previously but not available in the current environment.

A package is loaded using the following command -

library("package Name", lib.loc = "path to library")

# Load the package named "XML"
install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")
R - Data Reshaping
Data Reshaping in R is about changing the way data is organized into rows and columns. 

Most of the time data processing in R is done by taking the input data as a data frame. 

It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format that is different from format in which we received it. 

R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.

Joining Columns and Rows in a Data Frame
We can join multiple vectors to create a data frame using the cbind()function. 

Also we can merge two data frames using rbind() function.

 Live Demo

# Create vector objects.
city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n") 

# Print the data frame.
print(addresses)

# Create another data frame with similar columns
new.address <- data.frame(
   city = c("Lowry","Charlotte"),
   state = c("CO","FL"),
   zipcode = c("80230","33949"),
   stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n") 

# Print the data frame.
print(new.address)

# Combine rows form both the data frames.
all.addresses <- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n") 

# Print the result.
print(all.addresses)

When we execute the above code, it produces the following result -

# # # # The First data frame
     city       state zipcode
[1,] "Tampa"    "FL"  "33602"
[2,] "Seattle"  "WA"  "98104"
[3,] "Hartford" "CT"   "6161" 
[4,] "Denver"   "CO"  "80294"

# # # The Second data frame
       city       state   zipcode
1      Lowry      CO      80230
2      Charlotte  FL      33949

# # # The combined data frame
       city      state zipcode
1      Tampa     FL    33602
2      Seattle   WA    98104
3      Hartford  CT     6161
4      Denver    CO    80294
5      Lowry     CO    80230
6     Charlotte  FL    33949
Merging Data Frames
We can merge two data frames by using the merge() function. 

The data frames must have same column names on which the merging happens.

In the example below, we consider the data sets about Diabetes in Pima Indian Women available in the library names "MASS". 

we merge the two data sets based on the values of blood pressure("bp") and body mass index("bmi"). 

On choosing these two columns for merging, the records where values of these two variables match in both data sets are combined together to form a single data frame.

 Live Demo

library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
   by.x = c("bp", "bmi"),
   by.y = c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)

When we execute the above code, it produces the following result -

   bp  bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1  60 33.8       1   117     23 0.466    27     No       2   125     20 0.088
2  64 29.7       2    75     24 0.370    33     No       2   100     23 0.368
3  64 31.2       5   189     33 0.583    29    Yes       3   158     13 0.295
4  64 33.2       4   117     27 0.230    24     No       1    96     27 0.289
5  66 38.1       3   115     39 0.150    28     No       1   114     36 0.289
6  68 38.5       2   100     25 0.324    26     No       7   129     49 0.439
7  70 27.4       1   116     28 0.204    21     No       0   124     20 0.254
8  70 33.1       4    91     32 0.446    22     No       9   123     44 0.374
9  70 35.4       9   124     33 0.282    34     No       6   134     23 0.542
10 72 25.6       1   157     21 0.123    24     No       4    99     17 0.294
11 72 37.7       5    95     33 0.370    27     No       6   103     32 0.324
12 74 25.9       9   134     33 0.460    81     No       8   126     38 0.162
13 74 25.9       1    95     21 0.673    36     No       8   126     38 0.162
14 78 27.6       5    88     30 0.258    37     No       6   125     31 0.565
15 78 27.6      10   122     31 0.512    45     No       6   125     31 0.565
16 78 39.4       2   112     50 0.175    24     No       4   112     40 0.236
17 88 34.5       1   117     24 0.403    40    Yes       4   127     11 0.598
   age.y type.y
1     31     No
2     21     No
3     24     No
4     21     No
5     21     No
6     43    Yes
7     36    Yes
8     40     No
9     29    Yes
10    28     No
11    55     No
12    39     No
13    39     No
14    49    Yes
15    49    Yes
16    38     No
17    28     No
[1] 17
Melting and Casting
One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to get a desired shape. 

The functions used to do this are called melt() and cast().

"melt" data so that each row is a unique id-variable combination. 

"cast" the melted data into any shape you would like.

mydata
id	time	x1	x2
1	1	5	6
1	2	3	5
2	1	6	1
2	2	2	4

library(reshape)
melteddata <- melt(mydata, id=c("id","time"))

newdata
id	time	variable	value
1	1	x1	5
1	2	x1	3
2	1	x1	6
2	2	x1	2
1	1	x2	6
1	2	x2	5
2	1	x2	1
2	2	x2	4

# cast the melted data
# cast(data, formula, function)
subjmeans <- cast(melteddata, id~variable, mean)
timemeans <- cast(melteddata, time~variable, mean)

subjmeans
id	x1	x2
1	4	5.5
2	4	2.5

timemeans
time	x1	x2
1	5.5	3.5
2	2.5	4.5

Another example:
We consider the dataset called ships present in the library called "MASS".

 Live Demo

library(MASS)
print(ships)

When we execute the above code, it produces the following result -

     type year   period   service   incidents
1     A   60     60        127         0
2     A   60     75         63         0
3     A   65     60       1095         3
.............

Melt the Data
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.

library(reshape2)

molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)

When we execute the above code, it produces the following result with diff structure

      type year  variable  value
1      A   60    period      60
2      A   60    period      75
............
41     A   60    service    127
...........
101    C   70    incidents    6
102    C   70    incidents    2
...........
Cast the Molten Data
We can cast the molten data into a new form where the aggregate of each type of ship for each year is created. 

It is done using the cast() function.

recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)

When we execute the above code, it produces the following result -

     type year  period  service  incidents
1     A   60    135       190      0
2     A   65    135      2190      7
3     A   70    135      4865     24
4     A   75    135      2244     11
5     B   60    135     62058     68
6     B   65    135     48979    111
7     B   70    135     20163     56
8     B   75    135      7117     18
9     C   60    135      1731      2
10    C   65    135      1457      1
11    C   70    135      2731      8
12    C   75    135       274      1
13    D   60    135       356      0
14    D   65    135       480      0
15    D   70    135      1557     13
16    D   75    135      2051      4
17    E   60    135        45      0
18    E   65    135      1226     14
19    E   70    135      3318     17
20    E   75    135       542      1
R - CSV Files
In R, we can read data from files stored outside the R environment. 

We can also write data into files which will be stored and accessed by the operating system. 

R can read and write into various file formats like csv, excel, xml etc.

In this chapter we will learn to read data from a csv file and then write data into a csv file. 

The file should be present in current working directory so that R can read it. 

Of course we can also set our own directory and read files from there.

Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function. 

You can also set a new working directory using setwd()function.

# Get and print current working directory.
print(getwd())

# Set current working directory.
setwd("/web/com")

# Get and print current working directory.
print(getwd())

When we execute the above code, it produces the following result -

[1] "/web/com/1441086124_2016"
[1] "/web/com"

This result depends on your OS and your current directory where you are working.

Input as CSV File
The csv file is a text file in which the values in the columns are separated by a comma. 

Let's consider the following data present in the file named input.csv.

You can create this file using windows notepad by copying and pasting this data. 

Save the file as input.csv using the save As All files(*.*) option in notepad.

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current working directory -

data <- read.csv("input.csv")
print(data)

When we execute the above code, it produces the following result -

      id,   name,    salary,   start_date,     dept
1      1    Rick     623.30    2012-01-01      IT
2      2    Dan      515.20    2013-09-23      Operations
3      3    Michelle 611.00    2014-11-15      IT
4      4    Ryan     729.00    2014-05-11      HR
5     NA    Gary     843.25    2015-03-27      Finance
6      6    Nina     578.00    2013-05-21      IT
7      7    Simon    632.80    2013-07-30      Operations
8      8    Guru     722.50    2014-06-17      Finance
Analyzing the CSV File
By default the read.csv() function gives the output as a data frame. 

This can be easily checked as follows. 

Also we can check the number of columns and rows.

 
data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

When we execute the above code, it produces the following result -

[1] TRUE
[1] 5
[1] 8

Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in subsequent section.

Get the maximum salary

# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.
sal <- max(data$salary)
print(sal)

When we execute the above code, it produces the following result -

[1] 843.25

Get the details of the person with max salary
We can fetch rows meeting specific filter criteria similar to a SQL where clause.

# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.
sal <- max(data$salary)

# Get the person detail having max salary.
retval <- subset(data, salary == max(salary))
print(retval)

When we execute the above code, it produces the following result -

      id    name  salary  start_date    dept
5     NA    Gary  843.25  2015-03-27    Finance

Get all the people working in IT department

# Create a data frame.
data <- read.csv("input.csv")

retval <- subset( data, dept == "IT")
print(retval)

When we execute the above code, it produces the following result -

       id   name      salary   start_date   dept
1      1    Rick      623.3    2012-01-01   IT
3      3    Michelle  611.0    2014-11-15   IT
6      6    Nina      578.0    2013-05-21   IT

Get the persons in IT department whose salary is greater than 600

# Create a data frame.
data <- read.csv("input.csv")

info <- subset(data, salary > 600 & dept == "IT")
print(info)

When we execute the above code, it produces the following result -

       id   name      salary   start_date   dept
1      1    Rick      623.3    2012-01-01   IT
3      3    Michelle  611.0    2014-11-15   IT

Get the people who joined on or after 2014

# Create a data frame.
data <- read.csv("input.csv")

retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
print(retval)

When we execute the above code, it produces the following result -

       id   name     salary   start_date    dept
3      3    Michelle 611.00   2014-11-15    IT
4      4    Ryan     729.00   2014-05-11    HR
5     NA    Gary     843.25   2015-03-27    Finance
8      8    Guru     722.50   2014-06-17    Finance
Writing into a CSV File
R can create csv file form existing data frame. 

The write.csv() function is used to create the csv file. 

This file gets created in the working directory.

# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.
write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)

When we execute the above code, it produces the following result -

  X      id   name      salary   start_date    dept
1 3      3    Michelle  611.00   2014-11-15    IT
2 4      4    Ryan      729.00   2014-05-11    HR
3 5     NA    Gary      843.25   2015-03-27    Finance
4 8      8    Guru      722.50   2014-06-17    Finance

Here the column X comes from the data set newper. 

This can be dropped using additional parameters while writing the file.

# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.
write.csv(retval,"output.csv", row.names = FALSE)
newdata <- read.csv("output.csv")
print(newdata)

When we execute the above code, it produces the following result -

      id    name      salary   start_date    dept
1      3    Michelle  611.00   2014-11-15    IT
2      4    Ryan      729.00   2014-05-11    HR
3     NA    Gary      843.25   2015-03-27    Finance
4      8    Guru      722.50   2014-06-17    Finance
R - Excel File
Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format. 

R can read directly from these files using some excel specific packages. 

Few such packages are - XLConnect, xlsx, gdata etc. 

We will be using xlsx package. 

R can also write into excel file using this package.

Install xlsx Package
You can use the following command in the R console to install the "xlsx" package. 

It may ask to install some additional packages on which this package is dependent. 

Follow the same command with required package name to install the additional packages.

install.packages("xlsx")
Verify and Load the "xlsx" Package
Use the following command to verify and load the "xlsx" package.

# Verify the package is installed.
any(grepl("xlsx",installed.packages()))

# Load the library into R workspace.
library("xlsx")

When the script is run we get the following output.

[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
Input as xlsx File
Open Microsoft excel. 

Copy and paste the following data in the work sheet named as sheet1.

id	name      salary    start_date	dept
1	Rick	    623.3	  1/1/2012	   IT
2	Dan       515.2     9/23/2013    Operations
3	Michelle  611	     11/15/2014	IT
4	Ryan	    729	     5/11/2014	   HR
5	Gary	    43.25     3/27/2015  	Finance
6	Nina	    578       5/21/2013	   IT
7	Simon	    632.8	  7/30/2013	   Operations
8	Guru	    722.5	  6/17/2014	   Finance

Also copy and paste the following data to another worksheet and rename this worksheet to "city".

name	    city
Rick	    Seattle
Dan       Tampa
Michelle  Chicago
Ryan	    Seattle
Gary	    Houston
Nina	    Boston
Simon	    Mumbai
Guru	    Dallas

Save the Excel file as "input.xlsx". 

You should save it in the current working directory of the R workspace.

Reading the Excel File
The input.xlsx is read by using the read.xlsx() function as shown below. 

The result is stored as a data frame in the R environment.

# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)

When we execute the above code, it produces the following result -

      id,   name,     salary,   start_date,   dept
1      1    Rick      623.30    2012-01-01    IT
2      2    Dan       515.20    2013-09-23    Operations
3      3    Michelle  611.00    2014-11-15    IT
4      4    Ryan      729.00    2014-05-11    HR
5     NA    Gary      843.25    2015-03-27    Finance
6      6    Nina      578.00    2013-05-21    IT
7      7    Simon     632.80    2013-07-30    Operations
8      8    Guru      722.50    2014-06-17    Finance
R - Binary Files
A binary file is a file that contains information stored only in form of bits and bytes.(0’s and 1’s). 

They are not human readable as the bytes in it translate to characters and symbols which contain many other non-printable characters. 

Attempting to read a binary file using any text editor will show characters like Ø and ð.

The binary file has to be read by specific programs to be useable. 

For example, the binary file of a Microsoft Word program can be read to a human readable form only by the Word program. 

Which indicates that, besides the human readable text, there is a lot more information like formatting of characters and page numbers etc., which are also stored along with alphanumeric characters. 

And finally a binary file is a continuous sequence of bytes. 

The line break we see in a text file is a character joining first line to the next.

Sometimes, the data generated by other programs are required to be processed by R as a binary file. 

Also R is required to create binary files which can be shared with other programs.

R has two functions WriteBin() and readBin() to create and read binary files.

Syntax

writeBin(object, con)
readBin(con, what, n )

Following is the description of the parameters used -

con is the connection object to read or write the binary file.
object is the binary file which to be written.
what is the mode like character, integer etc. 

representing the bytes to be read.
n is the number of bytes to read from the binary file.

Example
We consider the R inbuilt data "mtcars". 

First we create a csv file from it and convert it to a binary file and store it as a OS file. 

Next we read this binary file created into R.

Writing the Binary File
We read the data frame "mtcars" as a csv file and then write it as a binary file to the OS.

# Read the "mtcars" data frame as a csv file and store only the columns 
   "cyl", "am" and "gear".
write.table(mtcars, file = "mtcars.csv",row.names = FALSE, na = "", 
   col.names = TRUE, sep = ",")

# Store 5 records from the csv file as a new data frame.
new.mtcars <- read.table("mtcars.csv",sep = ",",header = TRUE,nrows = 5)

# Create a connection object to write the binary file using mode "wb".
write.filename = file("/web/com/binmtcars.dat", "wb")

# Write the column names of the data frame to the connection object.
writeBin(colnames(new.mtcars), write.filename)

# Write the records in each of the column to the file.
writeBin(c(new.mtcars$cyl,new.mtcars$am,new.mtcars$gear), write.filename)

# Close the file for writing so that it can be read by other program.
close(write.filename)
Reading the Binary File
The binary file created above stores all the data as continuous bytes. 

So we will read it by choosing appropriate values of column names as well as the column values.

# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("/web/com/binmtcars.dat", "rb")

# First read the column names. 

n = 3 as we have 3 columns.
column.names <- readBin(read.filename, character(),  n = 3)

# Next read the column values. 

n = 18 as we have 3 column names and 15 values.
read.filename <- file("/web/com/binmtcars.dat", "rb")
bindata <- readBin(read.filename, integer(),  n = 18)

# Print the data.
print(bindata)

# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)

# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)

# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)

# Combine all the read values to a dat frame.
finaldata = cbind(cyldata, amdata, geardata)
colnames(finaldata) = column.names
print(finaldata)

When we execute the above code, it produces the following result and chart -

 [1]    7108963 1728081249    7496037          6          6          4
 [7]          6          8          1          1          1          0
[13]          0          4          4          4          3          3

[1] 6 6 4 6 8

[1] 1 1 1 0 0

[1] 4 4 4 3 3

     cyl am gear
[1,]   6  1    4
[2,]   6  1    4
[3,]   4  1    4
[4,]   6  0    3
[5,]   8  0    3

As we can see, we got the original data back by reading the binary file in R.

R - XML Files
XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text. 

It stands for Extensible Markup Language (XML). 

Similar to HTML it contains markup tags. 

But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file.

You can read a xml file in R using the "XML" package. 

This package can be installed using following command.

install.packages("XML")
Input Data
Create a XMl file by copying the below data into a text editor like notepad. 

Save the file with a .xml extension and choosing the file type as all files(*.*).

<RECORDS>
   <EMPLOYEE>
      <ID>1</ID>
      <NAME>Rick</NAME>
      <SALARY>623.3</SALARY>
      <STARTDATE>1/1/2012</STARTDATE>
      <DEPT>IT</DEPT>
   </EMPLOYEE>
	
   <EMPLOYEE>
      <ID>2</ID>
      <NAME>Dan</NAME>
      <SALARY>515.2</SALARY>
      <STARTDATE>9/23/2013</STARTDATE>
      <DEPT>Operations</DEPT>
   </EMPLOYEE>
   
   <EMPLOYEE>
      <ID>3</ID>
      <NAME>Michelle</NAME>
      <SALARY>611</SALARY>
      <STARTDATE>11/15/2014</STARTDATE>
      <DEPT>IT</DEPT>
   </EMPLOYEE>
   
   <EMPLOYEE>
      <ID>4</ID>
      <NAME>Ryan</NAME>
      <SALARY>729</SALARY>
      <STARTDATE>5/11/2014</STARTDATE>
      <DEPT>HR</DEPT>
   </EMPLOYEE>
   
   <EMPLOYEE>
      <ID>5</ID>
      <NAME>Gary</NAME>
      <SALARY>843.25</SALARY>
      <STARTDATE>3/27/2015</STARTDATE>
      <DEPT>Finance</DEPT>
   </EMPLOYEE>
   
   <EMPLOYEE>
      <ID>6</ID>
      <NAME>Nina</NAME>
      <SALARY>578</SALARY>
      <STARTDATE>5/21/2013</STARTDATE>
      <DEPT>IT</DEPT>
   </EMPLOYEE>
   
   <EMPLOYEE>
      <ID>7</ID>
      <NAME>Simon</NAME>
      <SALARY>632.8</SALARY>
      <STARTDATE>7/30/2013</STARTDATE>
      <DEPT>Operations</DEPT>
   </EMPLOYEE>
   
   <EMPLOYEE>
      <ID>8</ID>
      <NAME>Guru</NAME>
      <SALARY>722.5</SALARY>
      <STARTDATE>6/17/2014</STARTDATE>
      <DEPT>Finance</DEPT>
   </EMPLOYEE>
	
</RECORDS>
Reading XML File
The xml file is read by R using the function xmlParse(). 

It is stored as a list in R.

# Load the package required to read XML files.
library("XML")

# Also load the other required package.
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Print the result.
print(result)

When we execute the above code, it produces the following result -

1
Rick
623.3
1/1/2012
IT

2
Dan
515.2
9/23/2013
Operations

3
Michelle
611
11/15/2014
IT

4
Ryan
729
5/11/2014
HR

5
Gary
843.25
3/27/2015
Finance

6
Nina
578
5/21/2013
IT

7
Simon
632.8
7/30/2013
Operations

8
Guru
722.5
6/17/2014
Finance

Get Number of Nodes Present in XML File

# Load the packages required to read XML files.
library("XML")
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Exract the root node form the xml file.
rootnode <- xmlRoot(result)

# Find number of nodes in the root.
rootsize <- xmlSize(rootnode)

# Print the result.
print(rootsize)

When we execute the above code, it produces the following result -

output
[1] 8
Details of the First Node
Let's look at the first record of the parsed file. 

It will give us an idea of the various elements present in the top level node.

# Load the packages required to read XML files.
library("XML")
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Exract the root node form the xml file.
rootnode <- xmlRoot(result)

# Print the result.
print(rootnode[1])

When we execute the above code, it produces the following result -

$EMPLOYEE
   1
   Rick
   623.3
   1/1/2012
   IT
 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList" 

Get Different Elements of a Node

# Load the packages required to read XML files.
library("XML")
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Exract the root node form the xml file.
rootnode <- xmlRoot(result)

# Get the first element of the first node.
print(rootnode[[1]][[1]])

# Get the fifth element of the first node.
print(rootnode[[1]][[5]])

# Get the second element of the third node.
print(rootnode[[3]][[2]])

When we execute the above code, it produces the following result -

1 
IT 
Michelle 
XML to Data Frame
To handle the data effectively in large files we read the data in the xml file as a data frame. 

Then process the data frame for data analysis.

# Load the packages required to read XML files.
library("XML")
library("methods")

# Convert the input xml file to a data frame.
xmldataframe <- xmlToDataFrame("input.xml")
print(xmldataframe)

When we execute the above code, it produces the following result -

      ID    NAME     SALARY    STARTDATE       DEPT 
1      1    Rick     623.30    2012-01-01      IT
2      2    Dan      515.20    2013-09-23      Operations
3      3    Michelle 611.00    2014-11-15      IT
4      4    Ryan     729.00    2014-05-11      HR
5     NA    Gary     843.25    2015-03-27      Finance
6      6    Nina     578.00    2013-05-21      IT
7      7    Simon    632.80    2013-07-30      Operations
8      8    Guru     722.50    2014-06-17      Finance

As the data is now available as a dataframe we can use data frame related function to read and manipulate the file.

R - JSON Files
JSON file stores data as text in human-readable format. 

Json stands for JavaScript Object Notation. 

R can read JSON files using the rjson package.

Install rjson Package
In the R console, you can issue the following command to install the rjson package.

install.packages("rjson")
Input Data
Create a JSON file by copying the below data into a text editor like notepad. 

Save the file with a .json extension and choosing the file type as all files(*.*).

{ 
   "ID":["1","2","3","4","5","6","7","8" ],
   "Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
   "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
   
   "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
      "7/30/2013","6/17/2014"],
   "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
Read the JSON File
The JSON file is read by R using the function from JSON(). 

It is stored as a list in R.

# Load the package required to read JSON files.
library("rjson")

# Give the input file name to the function.
result <- fromJSON(file = "input.json")

# Print the result.
print(result)

When we execute the above code, it produces the following result -

$ID
[1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"

$Name
[1] "Rick"     "Dan"      "Michelle" "Ryan"     "Gary"     "Nina"     "Simon"    "Guru"

$Salary
[1] "623.3"  "515.2"  "611"    "729"    "843.25" "578"    "632.8"  "722.5"

$StartDate
[1] "1/1/2012"   "9/23/2013"  "11/15/2014" "5/11/2014"  "3/27/2015"  "5/21/2013"
   "7/30/2013"  "6/17/2014"

$Dept
[1] "IT"         "Operations" "IT"         "HR"         "Finance"    "IT"
   "Operations" "Finance"
Convert JSON to a Data Frame
We can convert the extracted data above to a R data frame for further analysis using the as.data.frame() function.

# Load the package required to read JSON files.
library("rjson")

# Give the input file name to the function.
result <- fromJSON(file = "input.json")

# Convert JSON file to a data frame.
json_data_frame <- as.data.frame(result)

print(json_data_frame)

When we execute the above code, it produces the following result -

      id,   name,    salary,   start_date,     dept
1      1    Rick     623.30    2012-01-01      IT
2      2    Dan      515.20    2013-09-23      Operations
3      3    Michelle 611.00    2014-11-15      IT
4      4    Ryan     729.00    2014-05-11      HR
5     NA    Gary     843.25    2015-03-27      Finance
6      6    Nina     578.00    2013-05-21      IT
7      7    Simon    632.80    2013-07-30      Operations
8      8    Guru     722.50    2014-06-17      Finance
R - Web Data
Many websites provide data for consumption by its users. 

For example the World Health Organization(WHO) provides reports on health and medical information in the form of CSV, txt and XML files. 

Using R programs, we can programmatically extract specific data from such websites. 

Some packages in R which are used to scrap data form the web are - "RCurl",XML", and "stringr". 

They are used to connect to the URL’s, identify required links for the files and download them to the local environment.

Install R Packages
The following packages are required for processing the URL’s and links to the files. 

If they are not available in your R Environment, you can install them using following commands.

install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
Input Data
We will visit the URL weather data and download the CSV files using R for the year 2015.

Example
We will use the function getHTMLLinks() to gather the URLs of the files. 

Then we will use the function download.file() to save the files to the local system. 

As we will be applying the same code again and again for multiple files, we will create a function to be called multiple times. 

The filenames are passed as parameters in form of a R list object to this function.

# Read the URL.
url <- "http://www.geos.ed.ac.uk/~weather/jcmb_ws/"

# Gather the html links present in the webpage.
links <- getHTMLLinks(url)

# Identify only the links which point to the JCMB 2015 files. 
filenames <- links[str_detect(links, "JCMB_2015")]

# Store the file names as a list.
filenames_list <- as.list(filenames)

# Create a function to download the files by passing the URL and filename list.
downloadcsv <- function (mainurl,filename) {
   filedetails <- str_c(mainurl,filename)
   download.file(filedetails,filename)
}

# Now apply the l_ply function and save the files into the current R working directory.
l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")
Verify the File Download
After running the above code, you can locate the following files in the current R working directory.

"JCMB_2015.csv" "JCMB_2015_Apr.csv" "JCMB_2015_Feb.csv" "JCMB_2015_Jan.csv"
   "JCMB_2015_Mar.csv"
R - Databases
The data is Relational database systems are stored in a normalized format. 

So, to carry out statistical computing we will need very advanced and complex Sql queries. 

But R can connect easily to many relational databases like MySql, Oracle, Sql server etc. 

and fetch records from them as a data frame. 

Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using all the powerful packages and functions.

In this tutorial we will be using MySql as our reference database for connecting to R.

RMySQL Package
R has a built-in package named "RMySQL" which provides native connectivity between with MySql database. 

You can install this package in the R environment using the following command.

install.packages("RMySQL")
Connecting R to MySql
Once the package is installed we create a connection object in R to connect to the database. 

It takes the username, password, database name and host name as input.

# Create a connection Object to MySQL database.
# We will connect to the sampel database named "sakila" that comes with MySql installation.
mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila',
   host = 'localhost')

# List the tables available in this database.
 dbListTables(mysqlconnection)

When we execute the above code, it produces the following result -

 [1] "actor"                      "actor_info"                
 [3] "address"                    "category"                  
 [5] "city"                       "country"                   
 [7] "customer"                   "customer_list"             
 [9] "film"                       "film_actor"                
[11] "film_category"              "film_list"                 
[13] "film_text"                  "inventory"                 
[15] "language"                   "nicer_but_slower_film_list"
[17] "payment"                    "rental"                    
[19] "sales_by_film_category"     "sales_by_store"            
[21] "staff"                      "staff_list"                
[23] "store"                     
Querying the Tables
We can query the database tables in MySql using the function dbSendQuery(). 

The query gets executed in MySql and the result set is returned using the R fetch() function. 

Finally it is stored as a data frame in R.

# Query the "actor" tables to get all the rows.
result = dbSendQuery(mysqlconnection, "select * from actor")

# Store the result in a R data frame object. 

n = 5 is used to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)

When we execute the above code, it produces the following result -

        actor_id   first_name    last_name         last_update
1        1         PENELOPE      GUINESS           2006-02-15 04:34:33
2        2         NICK          WAHLBERG          2006-02-15 04:34:33
3        3         ED            CHASE             2006-02-15 04:34:33
4        4         JENNIFER      DAVIS             2006-02-15 04:34:33
5        5         JOHNNY        LOLLOBRIGIDA      2006-02-15 04:34:33
Query with Filter Clause
We can pass any valid select query to get the result.

result = dbSendQuery(mysqlconnection, "select * from actor where last_name = 'TORN'")

# Fetch all the records(with n = -1) and store it as a data frame.
data.frame = fetch(result, n = -1)
print(data)

When we execute the above code, it produces the following result -

        actor_id    first_name     last_name         last_update
1        18         DAN            TORN              2006-02-15 04:34:33
2        94         KENNETH        TORN              2006-02-15 04:34:33
3       102         WALTER         TORN              2006-02-15 04:34:33
Updating Rows in the Tables
We can update the rows in a Mysql table by passing the update query to the dbSendQuery() function.

dbSendQuery(mysqlconnection, "update mtcars set disp = 168.5 where hp = 110")

After executing the above code we can see the table updated in the MySql Environment.

Inserting Data into the Tables

dbSendQuery(mysqlconnection,
   "insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
   values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)"
)

After executing the above code we can see the row inserted into the table in the MySql Environment.

Creating Tables in MySql
We can create tables in the MySql using the function dbWriteTable(). 

It overwrites the table if it already exists and takes a data frame as input.

# Create the connection object to the database where we want to create the table.
mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila', 
   host = 'localhost')

# Use the R data frame "mtcars" to create the table in MySql.
# All the rows of mtcars are taken inot MySql.
dbWriteTable(mysqlconnection, "mtcars", mtcars[, ], overwrite = TRUE)

After executing the above code we can see the table created in the MySql Environment.

Dropping Tables in MySql
We can drop the tables in MySql database passing the drop table statement into the dbSendQuery() in the same way we used it for querying data from tables.

dbSendQuery(mysqlconnection, 'drop table if exists mtcars')

After executing the above code we can see the table is dropped in the MySql Environment.

R - Pie Charts
R Programming language has numerous libraries to create charts and graphs. 

A pie-chart is a representation of values as slices of a circle with different colors. 

The slices are labeled and the numbers corresponding to each slice is also represented in the chart.

In R the pie chart is created using the pie() function which takes positive numbers as a vector input. 

The additional parameters are used to control labels, color, title etc.

Syntax
The basic syntax for creating a pie-chart using the R is -

pie(x, labels, radius, main, col, clockwise)

Following is the description of the parameters used -

x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between -1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.

Example
A very simple pie-chart is created using just the input vector and labels. 

The below script will create and save the pie chart in the current R working directory.

 Live Demo

# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")

# Give the chart file a name.
png(file = "city.png")

# Plot the chart.
pie(x,labels)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Pie Chart Title and Colors
We can expand the features of the chart by adding more parameters to the function. 

We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart. 

The length of the pallet should be same as the number of values we have for the chart. 

Hence we use length(x).

Example
The below script will create and save the pie chart in the current R working directory.

 Live Demo

# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")

# Give the chart file a name.
png(file = "city_title_colours.jpg")

# Plot the chart with title and rainbow color pallet.
pie(x, labels, main = "City pie chart", col = rainbow(length(x)))

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Slice Percentages and Chart Legend
We can add slice percentage and a chart legend by creating additional chart variables.

 Live Demo

# Create data for the graph.
x <-  c(21, 62, 10,53)
labels <-  c("London","New York","Singapore","Mumbai")

piepercent<- round(100*x/sum(x), 1)

# Give the chart file a name.
png(file = "city_percentage_legends.jpg")

# Plot the chart.
pie(x, labels = piepercent, main = "City pie chart",col = rainbow(length(x)))
legend("topright", c("London","New York","Singapore","Mumbai"), cex = 0.8,
   fill = rainbow(length(x)))

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. 

The package plotrix has a function called pie3D() that is used for this.



# Get the library.
library(plotrix)

# Create data for the graph.
x <-  c(21, 62, 10,53)
lbl <-  c("London","New York","Singapore","Mumbai")

# Give the chart file a name.
png(file = "3d_pie_chart.jpg")

# Plot the chart.
pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


R - Bar Charts
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. 

R uses the function barplot() to create bar charts.
 R can draw both vertical and Horizontal bars in the bar chart.
In bar chart each of the bars can be given different colors.

Syntax
The basic syntax to create a bar-chart in R is -

barplot(H,xlab,ylab,main, names.arg,col)

Following is the description of the parameters used -

H is a vector or matrix containing numeric values used in bar chart.
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each bar.
col is used to give colors to the bars in the graph.

Example
A simple bar chart is created using just the input vector and the name of each bar.

 The below script will create and save the bar chart in the current R working directory.

 Live Demo

# Create the data for the chart
H <- c(7,12,28,3,41)

# Give the chart file a name
png(file = "barchart.png")

# Plot the bar chart 
barplot(H)

# Save the file
dev.off()

When we execute above code, it produces following result -


Bar Chart Labels, Title and Colors
The features of the bar chart can be expanded by adding more parameters. 

The main parameter is used to add  title. 

The col parameter is used to add colors to the bars. 

The args.name is a vector having same number of values as the input vector to describe the meaning of each bar.

Example
The below script will create and save the bar chart in the current R working directory.

 Live Demo

# Create the data for the chart
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")

# Give the chart file a name
png(file = "barchart_months_revenue.png")

# Plot the bar chart 
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")

# Save the file
dev.off()

When we execute above code, it produces following result -


Group Bar Chart and Stacked Bar Chart
We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.

 More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.



# Create the input vectors.
colors = c("green","orange","brown")
months <- c("Mar","Apr","May","Jun","Jul")
regions <- c("East","West","North")

# Create the matrix of the values.
Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow = 3, ncol = 5, byrow = TRUE)

# Give the chart file a name
png(file = "barchart_stacked.png")

# Create the bar chart
barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab = "revenue", col = colors)

# Add the legend to the chart
legend("topleft", regions, cex = 1.3, fill = colors)

# Save the file
dev.off()

 

R - Boxplots
Boxplots are a measure of how well distributed is the data in a data set. 

It divides the data set into three quartiles. 

This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. 

It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.

Boxplots are created in R by using the boxplot() function.

Syntax
The basic syntax to create a boxplot in R is  -

boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used -

x is a vector or a formula.
data is the data frame.
notch is a logical value. 

Set as TRUE to draw a notch.
varwidth is a logical value. 

Set as true to draw width of the box proportionate to the sample size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.

Example
We use the data set "mtcars" available in the R environment to create a basic boxplot. 

Let's look at the columns "mpg" and "cyl" in mtcars.

 Live Demo

input <- mtcars[,c('mpg','cyl')]
print(head(input))

When we execute above code, it produces following result -

                   mpg  cyl
Mazda RX4         21.0   6
Mazda RX4 Wag     21.0   6
Datsun 710        22.8   4
Hornet 4 Drive    21.4   6
Hornet Sportabout 18.7   8
Valiant           18.1   6
Creating the Boxplot
The below script will create a boxplot graph for the relation between mpg (miles per gallon) and cyl (number of cylinders).

 Live Demo

# Give the chart file a name.
png(file = "boxplot.png")

# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
   ylab = "Miles Per Gallon", main = "Mileage Data")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Boxplot with Notch
We can draw boxplot with notch to find out how the medians of different data groups match with each other.

The below script will create a boxplot graph with notch for each of the data group.

 Live Demo

# Give the chart file a name.
png(file = "boxplot_with_notch.png")

# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, 
   xlab = "Number of Cylinders",
   ylab = "Miles Per Gallon", 
   main = "Mileage Data",
   notch = TRUE, 
   varwidth = TRUE, 
   col = c("green","yellow","purple"),
   names = c("High","Medium","Low")
)
# Save the file.
dev.off()

When we execute the above code, it produces the following result -


R - Histograms
A histogram represents the frequencies of values of a variable bucketed into ranges. 

Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. 

Each bar in histogram represents the height of the number of values present in that range.

R creates histogram using hist() function. 

This function takes a vector as an input and uses some more parameters to plot histograms.

Syntax
The basic syntax for creating a histogram using R is -

hist(v,main,xlab,xlim,ylim,breaks,col,border)

Following is the description of the parameters used -

v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.

Example
A simple histogram is created using input vector, label, col and border parameters.

The script given below will create and save the histogram in the current R working directory.

 Live Demo

# Create data for the graph.
v <-  c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.
png(file = "histogram.png")

# Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Range of X and Y values
To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.

The width of each of the bar can be decided by using breaks.

 Live Demo

# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.
png(file = "histogram_lim_breaks.png")

# Create the histogram.
hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),
   breaks = 5)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


R - Line Graphs
A line chart is a graph that connects a series of points by drawing line segments between them. 

These points are ordered in one of their coordinate (usually the x-coordinate) value. 

Line charts are usually used in identifying the trends in data.

The plot() function in R is used to create the line graph.

Syntax
The basic syntax to create a line chart in R is -

plot(v,type,col,xlab,ylab)

Following is the description of the parameters used -

v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.

Example
A simple line chart is created using the input vector and the type parameter as "O". 

The below script will create and save a line chart in the current R working directory.

 Live Demo

# Create the data for the chart.
v <- c(7,12,28,3,41)

# Give the chart file a name.
png(file = "line_chart.jpg")

# Plot the bar chart. 
plot(v,type = "o")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Line Chart Title, Color and Labels
The features of the line chart can be expanded by using additional parameters. 

We add color to the points and lines, give a title to the chart and add labels to the axes.

Example

 Live Demo

# Create the data for the chart.
v <- c(7,12,28,3,41)

# Give the chart file a name.
png(file = "line_chart_label_colored.jpg")

# Plot the bar chart.
plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
   main = "Rain fall chart")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Multiple Lines in a Line Chart
More than one line can be drawn on the same chart by using the lines()function.

After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in the chart,

 Live Demo

# Create the data for the chart.
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)

# Give the chart file a name.
png(file = "line_chart_2_lines.jpg")

# Plot the bar chart.
plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall", 
   main = "Rain fall chart")

lines(t, type = "o", col = "blue")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


R - Scatterplots
Scatterplots show many points plotted in the Cartesian plane. 

Each point represents the values of two variables. 

One variable is chosen in the horizontal axis and another in the vertical axis.

The simple scatterplot is created using the plot() function.

Syntax
The basic syntax for creating scatterplot in R is -

plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used -

x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.

Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. 

Let's use the columns "wt" and "mpg" in mtcars.

 Live Demo

input <- mtcars[,c('wt','mpg')]
print(head(input))

When we execute the above code, it produces the following result -

                    wt      mpg
Mazda RX4           2.620   21.0
Mazda RX4 Wag       2.875   21.0
Datsun 710          2.320   22.8
Hornet 4 Drive      3.215   21.4
Hornet Sportabout   3.440   18.7
Valiant             3.460   18.1
Creating the Scatterplot
The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon).

 Live Demo

# Get the input values.
input <- mtcars[,c('wt','mpg')]

# Give the chart file a name.
png(file = "scatterplot.png")

# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
   xlab = "Weight",
   ylab = "Milage",
   xlim = c(2.5,5),
   ylim = c(15,30),		 
   main = "Weight vs Milage"
)
	 
# Save the file.
dev.off()

When we execute the above code, it produces the following result -


Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one variable versus the remaining ones we use scatterplot matrix. 

We use pairs() function to create matrices of scatterplots.

Syntax
The basic syntax for creating scatterplot matrices in R is -

pairs(formula, data)

Following is the description of the parameters used -

formula represents the series of variables used in pairs.
data represents the data set from which the variables will be taken.

Example
Each variable is paired up with each of the remaining variable. 

A scatterplot is plotted for each pair.

 Live Demo

# Give the chart file a name.
png(file = "scatterplot_matrices.png")

# Plot the matrices between 4 variables giving 12 plots.

# One variable with 3 others and total 4 variables.

pairs(~wt+mpg+disp+cyl,data = mtcars,
   main = "Scatterplot Matrix")

# Save the file.
dev.off()

When the above code is executed we get the following output.


R - Mean, Median and Mode
Statistical analysis in R is performed by using many in-built functions. 

Most of these functions are part of the R base package. 

These functions take R vector as an input along with the arguments and give the result.

The functions we are discussing in this chapter are mean, median and mode.

Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data series.

The function mean() is used to calculate this in R.

Syntax
The basic syntax for calculating mean in R is -

mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used -

x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.

Example

 Live Demo

# Create a vector. 
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

When we execute the above code, it produces the following result -

[1] 8.22
Applying Trim Option
When trim parameter is supplied, the values in the vector get sorted and then the required numbers of observations are dropped from calculating the mean.

When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.

In this case the sorted vector is (-21, -5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed from the vector for calculating mean are (-21,-5,2) from left and (12,18,54) from right.

 Live Demo

# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <-  mean(x,trim = 0.3)
print(result.mean)

When we execute the above code, it produces the following result -

[1] 5.55
Applying NA Option
If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm = TRUE. 

which means remove the NA values.

 Live Demo

# Create a vector. 
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)

# Find mean.
result.mean <-  mean(x)
print(result.mean)

# Find mean dropping NA values.
result.mean <-  mean(x,na.rm = TRUE)
print(result.mean)

When we execute the above code, it produces the following result -

[1] NA
[1] 8.22
Median
The middle most value in a data series is called the median. 

The median() function is used in R to calculate this value.

Syntax
The basic syntax for calculating median in R is -

median(x, na.rm = FALSE)

Following is the description of the parameters used -

x is the input vector.
na.rm is used to remove the missing values from the input vector.

Example

 Live Demo

# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.
median.result <- median(x)
print(median.result)

When we execute the above code, it produces the following result -

[1] 5.6
Mode
The mode is the value that has highest number of occurrences in a set of data. 

Unike mean and median, mode can have both numeric and character data.

R does not have a standard in-built function to calculate mode. 

So we create a user function to calculate mode of a data set in R. 

This function takes the vector as input and gives the mode value as output.

Example

 Live Demo

# Create the function.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.
result <- getmode(v)
print(result)

# Create the vector with characters.
charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.
result <- getmode(charv)
print(result)

 
When we execute the above code, it produces the following result -

[1] 2
[1] "it"
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model between two variables. 

One of these variable is called predictor variable whose value is gathered through experiments. 

The other variable is called response variable whose value is derived from the predictor variable.

In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1. 

Mathematically a linear relationship represents a straight line when plotted as a graph. 

A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is -
 

y = ax + b

Following is the description of the parameters used -

y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.

Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known. 

To do this we need to have the relationship between height and weight of a person.
 
The steps to create the relationship is -

Carry out the experiment of gathering a sample of observed values of height and corresponding weight.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model created and create the mathematical equation using these
Get a summary of the relationship model to know the average error in prediction. 

Also called residuals.
To predict the weight of new persons, use the predict() function in R.

Input Data
Below is the sample data representing the observations -

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.

Syntax
The basic syntax for lm() function in linear regression is -

lm(formula,data)

Following is the description of the parameters used -

formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients

 Live Demo

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.
relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result -

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
   -38.4551          0.6746 

   
Get the Summary of the Relationship

 Live Demo

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.
relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result -

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q     Median      3Q     Max 
-6.3002    -1.6629  0.0412    1.8944  3.9775 

Coefficients:
             Estimate Std. 

Error t value Pr(>|t|)    
(Intercept) -38.45509    8.04901  -4.778  0.00139 ** 
x             0.67461    0.05191  12.997 1.16e-06 ***
---
Signif. 

codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared:  0.9548,    Adjusted R-squared:  0.9491 
F-statistic: 168.9 on 1 and 8 DF,  p-value: 1.164e-06
predict() Function
Syntax
The basic syntax for predict() in linear regression is -

predict(object, newdata)

Following is the description of the parameters used -

object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.

Predict the weight of new persons

 Live Demo

# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.
relation <- lm(y~x)

# Find weight of a person with height 170.
a <- data.frame(x = 170)
result <-  predict(relation,a)
print(result)

When we execute the above code, it produces the following result -

       1 
76.22869 

Visualize the Regression Graphically

 Live Demo

# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.
png(file = "linearregression.png")

# Plot the chart.
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


R - Multiple Regression
Multiple regression is an extension of linear regression into relationship between more than two variables. 

In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is -

y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used -

y is the response variable.
a, b1, b2...bn are the coefficients.
x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. 

The model determines the value of the coefficients using the input data. 

Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.

lm() Function
This function creates the relationship model between the predictor and the response variable.

Syntax
The basic syntax for lm() function in multiple regression is -

lm(y ~ x1+x2+x3...,data)

Following is the description of the parameters used -

formula is a symbol presenting the relation between the response variable and predictor variables.
data is the vector on which the formula will be applied.

Example
Input Data
Consider the data set "mtcars" available in the R environment. 

It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse power("hp"), weight of the car("wt") and some more parameters.

The goal of the model is to establish the relationship between "mpg" as a response variable with "disp","hp" and "wt" as predictor variables. 

We create a subset of these variables from the mtcars data set for this purpose.

 Live Demo

input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))

When we execute the above code, it produces the following result -

                   mpg   disp   hp    wt
Mazda RX4          21.0  160    110   2.620
Mazda RX4 Wag      21.0  160    110   2.875
Datsun 710         22.8  108     93   2.320
Hornet 4 Drive     21.4  258    110   3.215
Hornet Sportabout  18.7  360    175   3.440
Valiant            18.1  225    105   3.460

Create Relationship Model & get the Coefficients

 Live Demo

input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.
print(model)

# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")

a <- coef(model)[1]
print(a)

Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]

print(Xdisp)
print(Xhp)
print(Xwt)

When we execute the above code, it produces the following result -

Call:
lm(formula = mpg ~ disp + hp + wt, data = input)

Coefficients:
(Intercept)         disp           hp           wt  
  37.105505      -0.000937        -0.031157    -3.800891  

# # # # The Coefficient Values # # # 
(Intercept) 
   37.10551 
         disp 
-0.0009370091 
         hp 
-0.03115655 
       wt 
-3.800891 

Create Equation for Regression Model
Based on the above intercept and coefficient values, we create the mathematical equation.

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Apply Equation for predicting New Values
We can use the regression equation created above to predict the mileage when a new set of values for displacement, horse power and weight is provided.

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is -

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
R - Logistic Regression
The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1. 

It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.

The general mathematical equation for logistic regression is -

y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

Following is the description of the parameters used -

y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.

The function used to create the regression model is the glm() function.

Syntax
The basic syntax for glm() function in logistic regression is -

glm(formula,data,family)

Following is the description of the parameters used -

formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model. 

It's value is binomial for logistic regression.

Example
The in-built data set "mtcars" describes different models of a car with their various engine specifications. 

In "mtcars" data set, the transmission mode (automatic or manual) is described by the column am which is a binary value (0 or 1). 

We can create a logistic regression model between the columns "am" and 3 other columns - hp, wt and cyl.

 Live Demo

# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result -

                  am   cyl  hp    wt
Mazda RX4          1   6    110   2.620
Mazda RX4 Wag      1   6    110   2.875
Datsun 710         1   4     93   2.320
Hornet 4 Drive     0   6    110   3.215
Hornet Sportabout  0   8    175   3.440
Valiant            0   6    105   3.460
Create Regression Model
We use the glm() function to create the regression model and get its summary for analysis.

 Live Demo

input <- mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

print(summary(am.data))

When we execute the above code, it produces the following result -

Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals: 
     Min        1Q      Median        3Q       Max  
-2.17272     -0.14907  -0.01464     0.14116   1.27641  

Coefficients:
            Estimate Std. 

Error z value Pr(>|z|)  
(Intercept) 19.70288    8.11637   2.428   0.0152 *
cyl          0.48760    1.07162   0.455   0.6491  
hp           0.03259    0.01886   1.728   0.0840 .
wt          -9.14947    4.15332  -2.203   0.0276 *
---
Signif. 

codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.2297  on 31  degrees of freedom
Residual deviance:  9.8415  on 28  degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and "hp", we consider them to be insignificant in contributing to the value of the variable "am". 

Only weight (wt) impacts the "am" value in this regression model.

R - Normal Distribution
In a random collection of data from independent sources, it is generally observed that the distribution of data is normal. 

Which means, on plotting a graph with the value of the variable in the horizontal axis and the count of the values in the vertical axis we get a bell shape curve. 

The center of the curve represents the mean of the data set. 

In the graph, fifty percent of values lie to the left of the mean and the other fifty percent lie to the right of the graph. 

This is referred as normal distribution in statistics.

R has four in built functions to generate normal distribution. 

They are described below.

dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

Following is the description of the parameters used in above functions -

x is a vector of numbers.
p is a vector of probabilities.
n is number of observations(sample size).
mean is the mean value of the sample data. 

It's default value is zero.
sd is the standard deviation. 

It's default value is 1.

dnorm()
This function gives height of the probability distribution at each point for a given mean and standard deviation.

 Live Demo

# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)

# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)

# Give the chart file a name.
png(file = "dnorm.png")

plot(x,y)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a given number. 

It is also called "Cumulative Distribution Function".

 Live Demo

# Create a sequence of numbers between -10 and 10 incrementing by 0.2.
x <- seq(-10,10,by = .2)
 
# Choose the mean as 2.5 and standard deviation as 2. 
y <- pnorm(x, mean = 2.5, sd = 2)

# Give the chart file a name.
png(file = "pnorm.png")

# Plot the graph.
plot(x,y)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


qnorm()
This function takes the probability value and gives a number whose cumulative value matches the probability value.

 Live Demo

# Create a sequence of probability values incrementing by 0.02.
x <- seq(0, 1, by = 0.02)

# Choose the mean as 2 and standard deviation as 3.
y <- qnorm(x, mean = 2, sd = 1)

# Give the chart file a name.
png(file = "qnorm.png")

# Plot the graph.
plot(x,y)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


rnorm()
This function is used to generate random numbers whose distribution is normal. 

It takes the sample size as input and generates that many random numbers. 

We draw a histogram to show the distribution of the generated numbers.

 Live Demo

# Create a sample of 50 numbers which are normally distributed.
y <- rnorm(50)

# Give the chart file a name.
png(file = "rnorm.png")

# Plot the histogram for this sample.
hist(y, main = "Normal DIstribution")

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


R - Binomial Distribution
The binomial distribution model deals with finding the probability of success of an event which has only two possible outcomes in a series of experiments. 

For example, tossing of a coin always gives a head or a tail. 

The probability of finding exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the binomial distribution.

R has four in-built functions to generate binomial distribution. 

They are described below.

dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

Following is the description of the parameters used -

x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.

dbinom()
This function gives the probability density distribution at each point.

 Live Demo

# Create a sample of 50 numbers which are incremented by 1.
x <- seq(0,50,by = 1)

# Create the binomial distribution.
y <- dbinom(x,50,0.5)

# Give the chart file a name.
png(file = "dbinom.png")

# Plot the graph for this sample.
plot(x,y)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -


pbinom()
This function gives the cumulative probability of an event. 

It is a single value representing the probability.

 Live Demo

# Probability of getting 26 or less heads from a 51 tosses of a coin.
x <- pbinom(26,51,0.5)

print(x)

When we execute the above code, it produces the following result -

[1] 0.610116
qbinom()
This function takes the probability value and gives a number whose cumulative value matches the probability value.

 Live Demo

# How many heads will have a probability of 0.25 will come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)

print(x)

When we execute the above code, it produces the following result -

[1] 23
rbinom()
This function generates required number of random values of given probability from a given sample.

 Live Demo

# Find 8 random values from a sample of 150 with probability of 0.4.
x <- rbinom(8,150,.4)

print(x)

When we execute the above code, it produces the following result -

[1] 58 61 59 66 55 60 61 67
R - Poisson Regression
Poisson Regression involves regression models in which the response variable is in the form of counts and not fractional numbers. 

For example, the count of number of births or number of wins in a football match series. 

Also the values of the response variables follow a Poisson distribution.

The general mathematical equation for Poisson regression is -

log(y) = a + b1x1 + b2x2 + bnxn.....

Following is the description of the parameters used -

y is the response variable.
a and b are the numeric coefficients.
x is the predictor variable.

The function used to create the Poisson regression model is the glm() function.

Syntax
The basic syntax for glm() function in Poisson regression is -

glm(formula,data,family)

Following is the description of the parameters used in above functions -

formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model. 

It's value is 'Poisson' for Logistic Regression.

Example
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension (low, medium or high) on the number of warp breaks per loom. 

Let's consider "breaks" as the response variable which is a count of number of breaks. 

The wool "type" and "tension" are taken as predictor variables.

Input Data

 Live Demo

input <- warpbreaks
print(head(input))

When we execute the above code, it produces the following result -

      breaks   wool  tension
1     26       A     L
2     30       A     L
3     54       A     L
4     25       A     L
5     70       A     L
6     52       A     L
Create Regression Model

 Live Demo

output <-glm(formula = breaks ~ wool+tension, data = warpbreaks,
   family = poisson)
print(summary(output))

When we execute the above code, it produces the following result -

Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals: 
    Min       1Q     Median       3Q      Max  
  -3.6871  -1.6503  -0.4269     1.1902   4.2616  

Coefficients:
            Estimate Std. 

Error z value Pr(>|z|)    
(Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
---
Signif. 

codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 210.39  on 50  degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4

In the summary we look for the p-value in the last column to be less than 0.05 to consider an impact of the predictor variable on the response variable. 

As seen the wooltype B having tension type M and H have impact on the count of breaks.

R - Analysis of Covariance
We use Regression analysis to create models which describe the effect of variation in predictor variables on the response variable. 

Sometimes, if we have a categorical variable with values like Yes/No or Male/Female etc. 

The simple regression analysis gives multiple results for each value of the categorical variable. 

In such scenario, we can study the effect of the categorical variable by using it along with the predictor variable and comparing the regression lines for each level of the categorical variable. 

Such an analysis is termed as Analysis of Covariance also called as ANCOVA.

Example
Consider the R built in data set mtcars. 

In it we observer that the field "am" represents the type of transmission (auto or manual). 

It is a categorical variable with values 0 and 1. 

The miles per gallon value(mpg) of a car can also depend on it besides the value of horse power("hp").

We study the effect of the value of "am" on the regression between "mpg" and "hp". 

It is done by using the aov() function followed by the anova() function to compare the multiple regressions.

Input Data
Create a data frame containing the fields "mpg", "hp" and "am" from the data set mtcars. 

Here we take "mpg" as the response variable, "hp" as the predictor variable and "am" as the categorical variable.

 Live Demo

input <- mtcars[,c("am","mpg","hp")]
print(head(input))

When we execute the above code, it produces the following result -

                   am   mpg   hp
Mazda RX4          1    21.0  110
Mazda RX4 Wag      1    21.0  110
Datsun 710         1    22.8   93
Hornet 4 Drive     0    21.4  110
Hornet Sportabout  0    18.7  175
Valiant            0    18.1  105
ANCOVA Analysis
We create a regression model taking "hp" as the predictor variable and "mpg" as the response variable taking into account the interaction between "am" and "hp".

Model with interaction between categorical variable and predictor variable

 Live Demo

# Get the dataset.
input <- mtcars

# Create the regression model.
result <- aov(mpg~hp*am,data = input)
print(summary(result))

When we execute the above code, it produces the following result -

            Df Sum Sq Mean Sq F value   Pr(>F)    
hp           1  678.4   678.4  77.391 1.50e-09 ***
am           1  202.2   202.2  23.072 4.75e-05 ***
hp:am        1    0.0     0.0   0.001    0.981    
Residuals   28  245.4     8.8                     
---
Signif. 

codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This result shows that both horse power and transmission type has significant effect on miles per gallon as the p value in both cases is less than 0.05. 

But the interaction between these two variables is not significant as the p-value is more than 0.05.

Model without interaction between categorical variable and predictor variable

 Live Demo

# Get the dataset.
input <- mtcars

# Create the regression model.
result <- aov(mpg~hp+am,data = input)
print(summary(result))

When we execute the above code, it produces the following result -

            Df  Sum Sq  Mean Sq   F value   Pr(>F)    
hp           1  678.4   678.4   80.15 7.63e-10 ***
am           1  202.2   202.2   23.89 3.46e-05 ***
Residuals   29  245.4     8.5                     
---
Signif. 

codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This result shows that both horse power and transmission type has significant effect on miles per gallon as the p value in both cases is less than 0.05.

Comparing Two Models
Now we can compare the two models to conclude if the interaction of the variables is truly in-significant. 

For this we use the anova() function.

 Live Demo

# Get the dataset.
input <- mtcars

# Create the regression models.
result1 <- aov(mpg~hp*am,data = input)
result2 <- aov(mpg~hp+am,data = input)

# Compare the two models.
print(anova(result1,result2))

When we execute the above code, it produces the following result -

Model 1: mpg ~ hp * am
Model 2: mpg ~ hp + am
  Res.Df    RSS Df  Sum of Sq     F Pr(>F)
1     28 245.43                           
2     29 245.44 -1 -0.0052515 6e-04 0.9806

As the p-value is greater than 0.05 we conclude that the interaction between horse power and transmission type is not significant. 

So the mileage per gallon will depend in a similar manner on the horse power of the car in both auto and manual transmission mode.

R - Time Series Analysis
Time series is a series of data points in which each data point is associated with a timestamp. 

A simple example is the price of a stock in the stock market at different points of time on a given day. 

Another example is the amount of rainfall in a region at different months of the year. 

R language uses many functions to create, manipulate and plot the time series data. 

The data for the time series is stored in an R object called time-series object. 

It is also a R data object like a vector or data frame.

The time series object is created by using the ts() function.

Syntax
The basic syntax for ts() function in time series analysis is -

timeseries.object.name <-  ts(data, start, end, frequency)

Following is the description of the parameters used -

data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
frequency specifies the number of observations per unit time.

Except the parameter "data" all other parameters are optional.

Example
Consider the annual rainfall details at a place starting from January 2012. 

We create an R time series object for a period of 12 months and plot it.

 Live Demo

# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)

# Convert it to a time series object.
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.
print(rainfall.timeseries)

# Give the chart file a name.
png(file = "rainfall.png")

# Plot a graph of the time series.
plot(rainfall.timeseries)

# Save the file.
dev.off()

When we execute the above code, it produces the following result and chart -

Jan    Feb    Mar    Apr    May     Jun    Jul    Aug    Sep
2012  799.0  1174.8  865.1  1334.6  635.4  918.5  685.5  998.6  784.2
        Oct    Nov    Dec
2012  985.0  882.8 1071.0

The Time series chart -


Different Time Intervals
The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. 

A value of 12 indicates that the time series is for 12 months. 

Other values and its meaning is as below -

frequency = 12 pegs the data points for every month of a year.
frequency = 4 pegs the data points for every quarter of a year.
frequency = 6 pegs the data points for every 10 minutes of an hour.
frequency = 24*6 pegs the data points for every 10 minutes of a day.

Multiple Time Series
We can plot multiple time series in one chart by combining both the series into a matrix.

 Live Demo

# Get the data points in form of a R vector.
rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall2 <- 
           c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)

# Convert them to a matrix.
combined.rainfall <-  matrix(c(rainfall1,rainfall2),nrow = 12)

# Convert it to a time series object.
rainfall.timeseries <- ts(combined.rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.
print(rainfall.timeseries)

# Give the chart file a name.
png(file = "rainfall_combined.png")

# Plot a graph of the time series.
plot(rainfall.timeseries, main = "Multiple Time Series")

# Save the file.
dev.off()

When we execute the above code, it produces the following result and chart -

           Series 1  Series 2
Jan 2012    799.0    655.0
Feb 2012   1174.8   1306.9
Mar 2012    865.1   1323.4
Apr 2012   1334.6   1172.2
May 2012    635.4    562.2
Jun 2012    918.5    824.0
Jul 2012    685.5    822.4
Aug 2012    998.6   1265.5
Sep 2012    784.2    799.6
Oct 2012    985.0   1105.6
Nov 2012    882.8   1106.7
Dec 2012   1071.0   1337.8

The Multiple Time series chart -


R - Nonlinear Least Square
When modeling real world data for regression analysis, we observe that it is rarely the case that the equation of the model is a linear equation giving a linear graph. 

Most of the time, the equation of the model of real world data involves mathematical functions of higher degree like an exponent of 3 or a sin function. 

In such a scenario, the plot of the model gives a curve rather than a line. 

The goal of both linear and non-linear regression is to adjust the values of the model's parameters to find the line or curve that comes closest to your data. 

On finding these values we will be able to estimate the response variable with good accuracy.

In Least Square regression, we establish a regression model in which the sum of the squares of the vertical distances of different points from the regression curve is minimized. 

We generally start with a defined model and assume some values for the coefficients. 

We then apply the nls() function of R to get the more accurate values along with the confidence intervals.

Syntax
The basic syntax for creating a nonlinear least square test in R is -

nls(formula, data, start)

Following is the description of the parameters used -

formula is a nonlinear model formula including variables and parameters.
data is a data frame used to evaluate the variables in the formula.
start is a named list or named numeric vector of starting estimates.

Example
We will consider a nonlinear model with assumption of initial values of its coefficients. 

Next we will see what is the confidence intervals of these assumed values so that we can judge how well these values fir into the model.

So let's consider the below equation for this purpose -

a = b1*x^2+b2

Let's assume the initial coefficients to be 1 and 3 and fit these values into nls() function.

 Live Demo

xvalues <- c(1.6,2.1,2,2.23,3.71,3.25,3.4,3.86,1.19,2.21)
yvalues <- c(5.19,7.43,6.94,8.11,18.75,14.88,16.06,19.12,3.21,7.58)

# Give the chart file a name.
png(file = "nls.png")

# Plot these values.
plot(xvalues,yvalues)

# Take the assumed values and fit into the model.
model <- nls(yvalues ~ b1*xvalues^2+b2,start = list(b1 = 1,b2 = 3))

# Plot the chart with new data by fitting it to a prediction from 100 data points.
new.data <- data.frame(xvalues = seq(min(xvalues),max(xvalues),len = 100))
lines(new.data$xvalues,predict(model,newdata = new.data))

# Save the file.
dev.off()

# Get the sum of the squared residuals.
print(sum(resid(model)^2))

# Get the confidence intervals on the chosen values of the coefficients.
print(confint(model))

When we execute the above code, it produces the following result -

[1] 1.081935
Waiting for profiling to be done...
       2.5%    97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484


We can conclude that the value of b1 is more close to 1 while the value of b2 is more close to 2 and not 3.

R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree. 

The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions. 

It is mostly used in Machine Learning and Data Mining applications using R.

Examples of use of decision tress is - predicting an email as spam or not spam, predicting of a tumor is cancerous or predicting a loan as a good or bad credit risk based on the factors in each of these. 

Generally, a model is created with observed data also called training data. 

Then a set of validation data is used to verify and improve the model. 

R has packages which are used to create and visualize decision trees. 

For new set of predictor variable, we use this model to arrive at a decision on the category (yes/No, spam/not spam) of the data.

The R package "party" is used to create decision trees.

Install R Package
Use the below command in R console to install the package. 

You also have to install the dependent packages if any.

install.packages("party")

The package "party" has the function ctree() which is used to create and analyze decison tree.

Syntax
The basic syntax for creating a decision tree in R is -

ctree(formula, data)

Following is the description of the parameters used -

formula is a formula describing the predictor and response variables.
data is the name of the data set used.

Input Data
We will use the R in-built data set named readingSkills to create a decision tree. 

It describes the score of someone's readingSkills if we know the variables "age","shoesize","score" and whether the person is a native speaker or not.

Here is the sample data.



# Load the party package. 

It will automatically load other
# dependent packages.
library(party)

# Print some records from data set readingSkills.
print(head(readingSkills))

When we execute the above code, it produces the following result and chart -

  nativeSpeaker   age   shoeSize      score
1           yes     5   24.83189   32.29385
2           yes     6   25.95238   36.63105
3            no    11   30.42170   49.60593
4           yes     7   28.66450   40.28456
5           yes    11   31.88207   55.46085
6           yes    10   30.07843   52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

Example
We will use the ctree() function to create the decision tree and see its graph.



# Load the party package. 

It will automatically load other
# dependent packages.
library(party)

# Create the input data frame.
input.dat <- readingSkills[c(1:105),]

# Give the chart file a name.
png(file = "decision_tree.png")

# Create the tree.
  output.tree <- ctree(
  nativeSpeaker ~ age + shoeSize + score, 
  data = input.dat)

# Plot the tree.
plot(output.tree)

# Save the file.
dev.off()

When we execute the above code, it produces the following result -

null device 
          1 
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

   as.Date, as.Date.numeric

Loading required package: sandwich


Conclusion
From the decision tree shown above we can conclude that anyone whose readingSkills score is less than 38.3 and age is more than 6 is not a native Speaker.

R - Random Forest
In the random forest approach, a large number of decision trees are created. 

Every observation is fed into every decision tree. 

The most common outcome for each observation is used as the final output. 

A new observation is fed into all the trees and taking a majority vote for each classification model.

An error estimate is made for the cases which were not used while building the tree. 

That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.

The R package "randomForest" is used to create random forests.

Install R Package
Use the below command in R console to install the package. 

You also have to install the dependent packages if any.

install.packages("randomForest)

The package "randomForest" has the function randomForest() which is used to create and analyze random forests.

Syntax
The basic syntax for creating a random forest in R is -

randomForest(formula, data)

Following is the description of the parameters used -

formula is a formula describing the predictor and response variables.
data is the name of the data set used.

Input Data
We will use the R in-built data set named readingSkills to create a decision tree. 

It describes the score of someone's readingSkills if we know the variables "age","shoesize","score" and whether the person is a native speaker.

Here is the sample data.



# Load the party package. 

It will automatically load other
# required packages.
library(party)

# Print some records from data set readingSkills.
print(head(readingSkills))

When we execute the above code, it produces the following result and chart -

  nativeSpeaker   age   shoeSize      score
1           yes     5   24.83189   32.29385
2           yes     6   25.95238   36.63105
3            no    11   30.42170   49.60593
4           yes     7   28.66450   40.28456
5           yes    11   31.88207   55.46085
6           yes    10   30.07843   52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

Example
We will use the randomForest() function to create the decision tree and see it's graph.



# Load the party package. 

It will automatically load other
# required packages.
library(party)
library(randomForest)

# Create the forest.
output.forest <- randomForest(nativeSpeaker ~ age + shoeSize + score, 
           data = readingSkills)

# View the forest results.
print(output.forest) 

# Importance of each predictor.
print(importance(fit,type = 2)) 

When we execute the above code, it produces the following result -

Call:
 randomForest(formula = nativeSpeaker ~ age + shoeSize + score,     
                 data = readingSkills)
               Type of random forest: classification
                     Number of trees: 500
No. 

of variables tried at each split: 1

        OOB estimate of  error rate: 1%
Confusion matrix:
    no yes class.error
no  99   1        0.01
yes  1  99        0.01
         MeanDecreaseGini
age              13.95406
shoeSize         18.91006
score            56.73051

Conclusion
From the random forest shown above we can conclude that the shoesize and score are the important factors deciding if someone is a native speaker or not. 

Also the model has only 1% error which means we can predict with 99% accuracy.

R - Survival Analysis
Survival analysis deals with predicting the time when a specific event is going to occur. 

It is also known as failure time analysis or analysis of time to death. 

For example predicting the number of days a person with cancer will survive or predicting the time when a mechanical system is going to fail.

The R package named survival is used to carry out survival analysis. 

This package contains the function Surv() which takes the input data as a R formula and creates a survival object among the chosen variables for analysis. 

Then we use the function survfit() to create a plot for the analysis.

Install Package

install.packages("survival")

Syntax
The basic syntax for creating survival analysis in R is -

Surv(time,event)
survfit(formula)

Following is the description of the parameters used -

time is the follow up time until the event occurs.
event indicates the status of occurrence of the expected event.
formula is the relationship between the predictor variables.

Example
We will consider the data set named "pbc" present in the survival packages installed above. 

It describes the survival data points about people affected with primary biliary cirrhosis (PBC) of the liver. 

Among the many columns present in the data set we are primarily concerned with the fields "time" and "status". 

Time represents the number of days between registration of the patient and earlier of the event between the patient receiving a liver transplant or death of the patient.



# Load the library.
library("survival")

# Print first few rows.
print(head(pbc))

When we execute the above code, it produces the following result and chart -

  id time status trt      age sex ascites hepato spiders edema bili chol
1  1  400      2   1 58.76523   f       1      1       1   1.0 14.5  261
2  2 4500      0   1 56.44627   f       0      1       1   0.0  1.1  302
3  3 1012      2   1 70.07255   m       0      0       0   0.5  1.4  176
4  4 1925      2   1 54.74059   f       0      1       1   0.5  1.8  244
5  5 1504      1   2 38.10541   f       0      1       1   0.0  3.4  279
6  6 2503      2   2 66.25873   f       0      1       0   0.0  0.8  248
  albumin copper alk.phos    ast trig platelet protime stage
1    2.60    156   1718.0 137.95  172      190    12.2     4
2    4.14     54   7394.8 113.52   88      221    10.6     3
3    3.48    210    516.0  96.10   55      151    12.0     4
4    2.54     64   6121.8  60.63   92      183    10.3     4
5    3.53    143    671.0 113.15   72      136    10.9     3
6    3.98     50    944.0  93.00   63       NA    11.0     3

From the above data we are considering time and status for our analysis.

Applying Surv() and survfit() Function
Now we proceed to apply the Surv() function to the above data set and create a plot that will show the trend.



# Load the library.
library("survival")

# Create the survival object. 
survfit(Surv(pbc$time,pbc$status == 2)~1)

# Give the chart file a name.
png(file = "survival.png")

# Plot the graph. 
plot(survfit(Surv(pbc$time,pbc$status == 2)~1))

# Save the file.
dev.off()

When we execute the above code, it produces the following result and chart -

Call: survfit(formula = Surv(pbc$time, pbc$status == 2) ~ 1)

      n  events  median 0.95LCL 0.95UCL 
    418     161    3395    3090    3853 


The trend in the above graph helps us predicting the probability of survival at the end of a certain number of days.

R - Chi Square Test
Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. 

Both those variables should be from same population and they should be categorical like - Yes/No, Male/Female, Red/Green etc.

For example, we can build a data set with observations on people's ice-cream buying pattern and try to correlate the gender of a person with the flavor of the ice-cream they prefer. 

If a correlation is found we can plan for appropriate stock of flavors by knowing the number of gender of people visiting.

Syntax
The function used for performing chi-Square test is chisq.test().
 
The basic syntax for creating a chi-square test in R is -

chisq.test(data)

Following is the description of the parameters used -

data is the data in form of a table containing the count value of the variables in the observation.

Example
We will take the Cars93 data in the "MASS" library which represents the sales of different models of car in the year 1993.

 Live Demo

library("MASS")
print(str(Cars93))

When we execute the above code, it produces the following result -

'data.frame':   93 obs. 

of  27 variables: 
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ... 
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ... 
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ... 
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ... 
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ... 
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ... 
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ... 
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ... 
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ... 
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ... 
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ... 
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ... 
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ... 
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ... 
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ... 
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ... 
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ... 
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ... 
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ... 
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ... 
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ... 
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ... 
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ... 
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ... 
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ... 
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ... 
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ... 
The above result shows the dataset has many Factor variables which can be considered as categorical variables. 

For our model we will consider the variables "AirBags" and "Type". 

Here we aim to find out any significant correlation between the types of car sold and the type of Air bags it has. 

If correlation is observed we can estimate which types of cars can sell better with what types of air bags.

 Live Demo

# Load the library.
library("MASS")

# Create a data frame from the main data set.
car.data <- data.frame(Cars93$AirBags, Cars93$Type)

# Create a table with the needed variables.
car.data = table(Cars93$AirBags, Cars93$Type) 
print(car.data)

# Perform the Chi-Square test.
print(chisq.test(car.data))

When we execute the above code, it produces the following result -

                     Compact Large Midsize Small Sporty Van
  Driver & Passenger       2     4       7     0      3   0
  Driver only              9     7      11     5      8   3
  None                     5     0       4    16      3   6

         Pearson's Chi-squared test

data:  car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Conclusion
The result shows the p-value of less than 0.05 which indicates a string correlation.

R - Interview Questions
Dear readers, these R Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of R programming. 

As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer -


What is R Programming?

R is a programming language meant for statistical analysis and creating graphs for this purpose.Instead of data types, it has data objects which are used for calculations. 

It is used in the fields of data mining, Regression analysis, Probability estimation etc., using many packages available in it.



What are the different data objects in R?

There are 6 data objects in R. 

They are vectors, lists, arrays, matrices, data frames and tables.



What makes a valid variable name in R?

A valid variable name consists of letters, numbers and the dot or underline characters. 

The variable name starts with a letter or the dot not followed by a number.


What is the main difference between an Array and a matrix?

A matrix is always two dimensional as it has only rows and columns. 

But an array can be of any number of dimensions and each dimension is a matrix. 

For example a 3x3x2 array represents 2 matrices each of dimension 3x3.



Which data object in R is used to store and process categorical data?

The Factor data objects in R are used to store and process categorical data in R.



How can you load and use csv file in R?

A csv file can be loaded using the read.csv function. 

R creates a data frame on reading the csv files using this function.



How do you get the name of the current working directory in R?

The command getwd() gives the current working directory in the R environment.



What is R Base package?

This is the package which is loaded by default when R environment is set. 

It provides the basic functionalities like input/output, arithmetic calculations etc. 

in the R environment.



How R is used in logistic regression?

Logistic regression deals with measuring the probability of a binary response variable. 

In R the function glm() is used to create the logistic regression.



How do you access the element in the 2nd column and 4th row of a matrix named M?

The expression M[4,2] gives the element at 4th row and 2nd column.



What is recycling of elements in a vector? Give an example.

When two vectors of different length are involved in a operation then the elements of the shorter vector are reused to complete the operation. 

This is called element recycling. 

Example - v1 <- c(4,1,0,6) and V2 <- c(2,4) then v1*v2 gives (8,4,0,24). 

The elements 2 and 4 are repeated.



What are different ways to call a function in R?

We can call a function in R in 3 ways. 

First method is to call by using position of the arguments. 

Second method id to call by using the name of the arguments and the third method is to call by default arguments.



What is lazy function evaluation in R?

The lazy evaluation of a function means, the argument is evaluated only if it is used inside the body of the function. 

If there is no reference to the argument in the body of the function then it is simply ignored.



How do you install a package in R?

To install a package in R we use the below command.

install.packages("package Name")


Name a R packages which is used to read XML files.

The package named "XML" is used to read and process the XML files.



Can we update and delete any of the elements in a list?

We can update any of the element but we can delete only the element at the end of the list.



Give the general expression to create a matrix in R.

The general expression to create a matrix in R is - matrix(data, nrow, ncol, byrow, dimnames)



which function is used to create a boxplot graph in R?

The boxplot() function is used to create boxplots in R. 

It takes a formula and a data frame as inputs to create the boxplots.



In doing time series analysis, what does frequency = 6 means in the ts() function?

 Frequency  6 indicates the time interval for the time series data is every 10 minutes of an hour.



What is reshaping of data in R?

In R the data objects can be converted from one form to another. 

For example we can create a data frame by merging many lists. 

This involves a series of R commands to bring the data into the new format. 

This is called data reshaping.



What is the output of runif(4)?

It generates 4 random numbers between 0 and 1.



How to get a list of all the packages installed in R ?

Use the command

installed.packages()


What is expected from running the command - strsplit(x,"e")?

It splits the strings in vector x into substrings at the position of letter e.



Give a R script to extract all the unique words in uppercase from the string - "The quick brown fox jumps over the lazy dog".

x <- "The quick brown fox jumps over the lazy dog"
split.string <- strsplit(x, " ")
extract.words <- split.string[[1]]
result <- unique(tolower(extract.words))
print(result)


Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[1]?

Error in v * x[1] : non-numeric argument to binary operator



Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[[1]]?

[1]  5 12 21 32s



What does unlist() do?

It converts a list to a vector.



Give the R expression to get 26 or less heads from a 51 tosses of a coin using pbinom.

x <- pbinom(26,51,0.5)
print(x)


X is the vector c(5,9.2,3,8.51,NA), What is the output of mean(x)?

NA



How do you convert the data in a JSON file to a data frame?

Using the function as.data.frame()



Give a function in R that replaces all missing values of a vector x with the sum of elements of that vector?

function(x) { x[is.na(x)] <- sum(x, na.rm = TRUE); x }


What is the use of apply() in R?

It is used to apply the same function to each of the elements in an Array. 

For example finding the mean of the rows in every row.



Is an array a matrix or a matrix an array?

Every matrix can be called an array but not the reverse. 

Matrix is always two dimensional but array can be of any dimension.



How to find the help page on missing values?

 ?NA



How do you get the standard deviation for a vector x?

sd(x, na.rm=TRUE)



How do you set the path for current working directory in R?

setwd("Path")



What is the difference between "%%" and "%/%"?

"%%" gives remainder of the division of first vector with second while "%/%" gives the quotient of the division of first vector with second.



What does col.max(x) do?

Find the column has the maximum value for each row.



Give the command to create a histogram.

hist()



How do you remove a vector from the R workspace?

rm(x)



List the data sets available in package "MASS"

data(package = "MASS")



List the data sets available in all available packages.

data(package = .packages(all.available = TRUE))


What is the use of the command - install.packages(file.choose(), repos=NULL)?

It is used to install a r package from local directory by browsing and selecting the file.



Give the command to check if the element 15 is present in vector x.

15 %in% x


Give the syntax for creating scatterplot matrices.

pairs(formula, data)

Where formula represents the series of variables used in pairs and data represents the data set from which the variables will be taken.



What is the difference between subset() function and sample() function in R?

The subset() functions is used to select variables and observations. 

The sample() function is used to choose a random sample of size n from a dataset.



How do you check if "m" is a matrix data object in R?

is.matrix(m)  should retrun TRUE.



What is the output for the below expression all(NA==NA)?

[1] NA



How to obtain the transpose of a matrix in R?

The function t() is used for transposing a matrix. 

Example - t(m) , where m is a matrix.



What is the use of "next" statement in R?

The "next" statement in R programming language is useful when we want to skip the current iteration of a loop without terminating it.


What is Next?
Further, you can go through your past assignments you have done with the subject and make sure you are able to speak confidently on them. 

If you are fresher then interviewer does not expect you will answer very complex questions, rather you have to make your basics concepts very strong.

Second it really doesn't matter much if you could not answer few questions but it matters that whatever you answered, you must have answered with confidence. 

So just feel confident during your interview. 

We at tutorialspoint wish you best luck to have a good interviewer and all the very best for your future endeavor. 

Cheers :-)

StatMethods
statmethods
Obtaining R
R is available for Linux, MacOS, and Windows. 
Software can be downloaded from The Comprehensive R Archive Network (CRAN).

Startup
After R is downloaded and installed, simply find and launch R from your Applications folder.


Entering Commands
R is a command line driven program. 
The user enters commands at the prompt (> by default) and each command is executed one at a time.


The Workspace
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). 
At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started.
Graphic User Interfaces

Aside from the built in R console, RStudio is the most popular R code editor, and it interfaces with R for Windows, MacOS, and Linux platforms.

Operators in R
R's binary and logical operators will look very familiar to programmers. 
Note that binary operators work on vectors and matrices as well as scalars.

Arithmetic Operators include:


  
    Operator
    Description
  
  
    +
    addition
  
  
    -
    subtraction
  
  
    *
    multiplication
  
  
    /
    division
  
  
    ^ or ** 
    exponentiation
  

Logical Operators include:



  
    Operator
    Description
  
  
    >
    greater than 
  
  
    >=
    greater than or equal to 
  
  
    ==
    exactly equal to 
  
  
    !=
    not equal to 
  



Data Types
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Creating New Variables
Use the assignment operator <- to create new variables.

# An example of computing the mean with variables
    mydata$sum <- mydata$x1 + mydata$x2
    mydata$mean <- (mydata$x1 + mydata$x2)/2
  

Functions
Almost everything in R is done through functions. 
A function is a piece of code written to carry out a specified task; it may accept arguments or parameters (or not) and it may return one or more values (or not!). 
In R, a function is defined with the construct:


    function ( arglist ) {body}
  

The code in between the curly braces is the body of the function. 
Note that by using built-in functions, the only thing you need to worry about is how to effectively communicate the correct input arguments (arglist) and manage the return value/s (if any).

Importing Data
Importing data into R is fairly simple. 
R offers options to import many file types, from CSVs to databases.

For example, this is how to import a CSV into R.

# first row contains variable names, comma is separator 
    # assign the variable id to row names
    # note the / instead of \ on mswindows systems
    
    mydata <- read.table("c:/mydata.csv", header=TRUE, 
       sep=",", row.names="id")

Descriptive Statistics
R provides a wide range of functions for obtaining summary statistics. 
One way to get descriptive statistics is to use the sapply( ) function with a specified summary statistic.

Below is how to get the mean with the sapply( ) function:

# get means for variables in data frame mydata
    # excluding missing values
    sapply(mydata, mean, na.rm=TRUE)
  

Possible functions used in sapply include  mean, sd, var, min, max, median, range, and quantile.
  Plotting in R
  In R, graphs are typically created interactively. 
Here is an example:

  # Creating a Graph
      attach(mtcars)
      plot(wt, mpg) 
      abline(lm(mpg~wt))
      title("Regression of MPG on Weight")

  The plot( ) function opens a graph window and plots weight vs. 
miles per gallon. 
The next line of code adds a regression line to this graph. 
The final line adds a title.

  

  Packages
  Packages are collections of R functions, data, and compiled code in a well-defined format. 
The directory where packages are stored is called the library. 
R comes with a standard set of packages. 
Others are available for download and installation. 
Once installed, they have to be loaded into the session to be used.

  .libPaths() # get library location
      
      library()   # see all packages installed 
      search()    # see packages currently loaded

  Getting Help
  Once R is installed, there is a comprehensive built-in help system. 
At the program's command prompt you can use any of the following:

  help.start()   # general help
      help(foo)      # help about function foo
      ?foo           # same thing 
      apropos("foo")
      # list all functions containing string foo
      example(foo)   # show an example of function foo 
  Going Further
  If you prefer an online interactive environment to learn R, this free R tutorial by DataCamp is a great way to get started.


R is a dialect of the S language. 
It is a case-sensitive, interpreted language. 
You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. 
There is a wide variety of data types, including vectors (numerical, character, logical), matrices, data frames, and lists. 
Most functionality is provided through built-in and user-created functions and all data objects are kept in memory during an interactive session. 
Basic functions are available by default. 
Other functions are contained in packages that can be attached to a current session as needed. 

    R is a case sensitive language. 
FOO, Foo, and foo are three different objects!


This section describes working with the R interface. 
A key skill to using R effectively is learning how to use the built-in help system. 
Other sections describe the working environment, inputting programs and outputting results, installing new functionality through packages, GUIs that have been developed for R, customizing the environment, producing high quality output, and running programs in batch. 
A fundamental design feature of R is that the output from most functions can be used as input to other functions. 
This is described in reusing results. 
To Practice 
 To explore R, try this  introduction to working with RStudio, which gives an overview of installing and working with RStudio.
  
    

      Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays,
      data frames) that you can operate on through functions that perform statistical analyses and create graphs. 
In this way,
      R is similar to PROC IML. 
    This section describes how to enter or import data into R, and how to prepare it for use in statistical analyses.
      Topics include R data structures, importing data 
      (from Excel, SPSS, SAS, Stata, and ASCII Text Files), entering data from the keyboard, creating an
      interface with a database management system, exporting data
      (to Excel, SPSS, SAS, Stata, and Tab Delimited Text Files), annotating data (with variable
labels  and value labels), and listing data. 
In addition,
      methods for handling missing values and date values are presented. 

  


To Practice
Loading data into R is covered in the free first chapter of this interactive course: Introduction to Data.


Once you have access to your data, you will want to massage it into useful form. 
This includes creating new variables (including recoding and renaming existing variables), sorting and merging datasets, aggregating data, reshaping data, and subsetting datasets (including selecting observations that meet criteria, randomly sampling observeration, and dropping or keeping variables). 
Each of these activities usually involve the use of R's built-in operators (arithmetic and logical) and functions (numeric, character, and statistical). 
Additionally, you may need to use control structures (if-then, for, while, switch) in your programs and/or create your own functions. 
Finally you may need to convert variables or datasets from one type to another (e.g. 
numeric to character or matrix to data frame). 
This section describes each task from an R perspective. 
To Practice 
 To practice managing data in R, try the first chapter of this interactive course.


This section describes basic (and not so basic) statistics. 
It includes code for obtaining descriptive statistics, frequency counts and crosstabulations (including tests of independence), correlations (pearson, spearman, kendall, polychoric), t-tests (with equal and unequal variances), nonparametric tests of group differences (Mann Whitney U, Wilcoxon Signed Rank, Kruskall Wallis Test, Friedman Test), multiple linear regression (including diagnostics, cross-validation and variable selection), analysis of variance (including ANCOVA and MANOVA), and statistics based on resampling. 
Since modern data analyses almost always involve graphical assessments of relationships and assumptions, links to appropriate graphical methods are provided throughout. 
It is always important to check model assumptions before making statistical inferences. 
Although it is somewhat artificial to separate regression modeling and an ANOVA framework in this regard, many people learn these topics separately, so I've followed the same convention here. 
Regression diagnostics cover outliers, influential observations, non-normality, non-constant error variance, multicolinearity, nonlinearity, and non-independence of errors. 
Classical test assumptions for ANOVA/ANCOVA/MANCOVA include the assessment of normality and homogeneity of variances in the univariate case, and multivariate normality and homogeneity of covariance matrices in the multivariate case. 
The identification of multivariate outliers is also considered. 
Power analysis provides methods of statistical power analysis and sample size estimation for a variety of designs. 
Finally, two functions that aid in efficient processing (with and by) are described.

More advanced statistical modeling can be found in the Advanced Statistics section. 

Going Further
To practice statistics in R interactively, try  this course on the introduction to statistics.


This section describes more advanced statistical methods. 
This includes the discovery and exploration of complex multivariate relationships among variables. 
Links to appropriate graphical methods are also provided throughout. 
Basic statistics are described in the previous section. 
It is difficult to order these topics in a straight-forward way. 
I have chosen the following (admittedly arbitrary) headings.

Predictive models
Under predictive models, we have generalized linear models (include logistic regression, poisson regression, and survival analysis), discriminant function analysis (both linear and quadratic), and time series modeling.

Latent Variable Models
This includes factor analysis (principal components, exploratory and confirmatory factor analysis), correspondence analysis, and multidimensional scaling (metric and nonmetric). 
Partitioning Methods
Cluster Analysis includes partitioning (k-means), hierarchical agglomerative, and model based approaches. 
Tree-Based methods (which could easily have gone under predictive models!) include classification and regression trees, random forests, and other partitioning methodologies. 
Other Tools 
This section includes tools that are broadly useful including bootstrapping in R and  matrix algebra programming (think MATRIX in SPSS or PROC IML in SAS). 
Going Further
Try  the Kaggle R Tutorial on Machine Learning which includes an exercise with Random Forests.


One of the main reasons data analysts turn to R is for its strong graphic capabilities. 
Creating a Graph provides an overview of creating and saving graphs in R. 
The remainder of the section describes how to create basic graph types. 
These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and Scatterplots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots). 
The Advanced Graphs section describes how to customize and annotate graphs, and covers more statistically complex types of graphs. 
To Practice 
 To practice the basics of plotting in R interactively, try this course from DataCamp.



This section describes how to customize your graphs. 
It also covers more statistically sophisticated graphs. 
This is one of the many places that R really shines. 
Customization
Graphical parameters describes how to change a graph's symbols, fonts, colors, and lines. 
Axes and text describe how to customize a graph's axes, add reference lines, text annotations and a legend. 
Combining plots  describes how to organize multiple plots into a single graph. 
Advanced Graph Types 
The lattice package provides a comprehensive system for visualizing multivariate data, including the ability to create plots conditioned on one or more variables. 
The ggplot2 package offers a elegant systems for generating univariate and multivariate graphs based on a grammar of graphics. 
Other graph types include probability plots, mosaic plots, and correlograms. 
Finally methods of interacting with graphs (e.g. 
linking multiple graphs with color brushing, or interactive rotation in real-time) are provided. 
For simpler, more fundamental graphs, see the Basic Graphs section. 
Going Further
Try the first chapter of this interactive course on data visualization with ggplot2. 
Why R has A Steep Learning Curve 

A long answer to a simple question... 

I have been a hardcore SAS and SPSS programmer for more than 25 years, a Systat programmer for 15 years and a Stata programmer for 2 years. 
But when I started learning R recently, I found it frustratingly difficult. 
Why? 

I think that there are two reasons why R can be challenging to learn quickly. 
First, while there are many introductory tutorials (covering data types, basic commands, the interface), none alone are comprehensive. 
In part, this is because much of the advanced functionality of R comes from hundreds of user contributed packages. 
Hunting for what you want can be time consuming, and it can be hard to get a clear overview of what procedures are available. 
The second reason is more ephemeral. 
As users of statistical packages, we tend to run one prescribed procedure for each type of analysis. 
Think of PROC GLM in SAS. 
We can carefully set up the run with all the parameters and options that we need. 
When we run the procedure, the resulting output may be a hundred pages long. 
We then sift through this output pulling out what we need and discarding the rest. 
The paradigm in R is different. 
Rather than setting up a complete analysis at once, the process is highly interactive. 
You run a command (say fit a model), take the results and process it through another command (say a set of diagnostic plots), take those results and process it through another command (say cross-validation), etc. 
The cycle may include transforming the data, and looping back through the whole process again. 
You stop when you feel that you have fully analyzed the data. 
It may sound trite, but this reminds me of the paradigm shift from top-down procedural programming to object oriented programming we saw a few years ago. 
It is not an easy mental shift for many of us to make.

In that in the end, however, I believe that you will feel much more intimately in touch with your data and in control of your work. 
And it's fun! 

To Practice
 This free interactive course covers the basics of R.
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). 
At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started. 
Commands are entered interactively at the R user prompt. 
Up and down arrow keys scroll through your command history. 

  
  You will probably want to keep different projects in different physical directories. 
Here are some standard commands for managing your workspace. 

getwd() # print the current working directory - cwd 
    ls()    # list the objects in the current workspace

 setwd(mydirectory)      # change to mydirectory
    setwd("c:/docs/mydir")  # note / instead of \ in windows 
    setwd("/usr/rob/mydir") # on linux

# view and set options for the session
    help(options) # learn about available options
    options() # view current option settings
    options(digits=3) # number of digits to print on output
  

# work with your previous commands
    history() # display last 25 commands
    history(max.show=Inf) # display all previous commands
    # save your command history 
    savehistory(file="myfile") # default is ".Rhistory" 
    # recall your command history 
    loadhistory(file="myfile")
    # default is ".Rhistory"

# save the workspace to the file .RData in the cwd
    save.image()
    # save specific objects to a file
    # if you don't specify the path, the cwd is assumed
    save(object list,file="myfile.RData") 

# load a workspace into the current session
    # if you don't specify the path, the cwd is assumed 
    load("myfile.RData")
  

q() # quit R. 
You will be prompted to save the workspace. 


Important Note to Windows Users:
R gets confused if you use a path in your code like:
  c:\mydocuments\myfile.txt
  This is because R sees "\" as an escape character. 
Instead, use:
  c:\\my documents\\myfile.txt

    c:/mydocuments/myfile.txt
  Either will work. 
I use the second convention throughout this website. 
To Practice
 This free intro to R course will get you familiar with the R workspace.
R is a command line driven program. 
The user enters commands at the prompt ( > by default ) and each command is executed one at a time. 
There have been a number of attempts to create a more graphical interface, ranging from code editors that interact with R, to full-blown GUIs that present the user with menus and dialog boxes.

 click to view

RStudio is my favorite example of a code editor that interfaces with R for Windows, MacOS, and Linux platforms.

https://www.statmethods.net/interface/images/smrcmdr.jpg
 click to view
Perhaps the most stable, full-blown GUI is R Commander, which can also run under Windows, Linux, and MacOS (see the documentation for technical requirements).
Both of these programs can make R a lot easier to use.

To Practice 
 This interactive course gives an overview of installing and working with RStudio.
R's binary and logical operators will look very familiar to programmers. 
Note that binary operators work on vectors and matrices as well as scalars. 

Arithmetic Operators 

  
    
      Operator
      Description
    
    
      +
      addition
    
    
      -
      subtraction
    
    
      *
      multiplication
    
    
      /
      division
    
    
      ^ or ** 
      exponentiation
    
    
      x %% y 
      modulus (x mod y) 5%%2 is 1 
    
    
      x %/% y 
      integer division 5%/%2 is 2 
    
  

Logical Operators 

  

    
      Operator
      Description
    
    
      <
      less than 
    
    
      <=
      less than or equal to 
    
    
      >
      greater than 
    
    
      >=
      greater than or equal to 
    
    
      ==
      exactly equal to 
    
    
      !=
      not equal to 
    
    
      !x
      Not x 
    
    
      x | y 
      x OR y 
    
    
      x & y 
      x AND y 
    
    
      isTRUE(x)
      test if X is TRUE 
    
  

# An example 
    x <- c(1:10)
    x[(x>8) | (x<5)]
    # yields 1 2 3 4 9 10
    # How it works 
    x <- c(1:10)
    x
    1 2 3 4 5 6 7 8 9 10
    x > 8
    F F F F F F F F T T
    x < 5
    T T T T F F F F F F
    x > 8 | x < 5
    T T T T F F F F T T
    x[c(T,T,T,T,F,F,F,F,T,T)]
    1 2 3 4 9 10
  
Going Further
To practice working with logical operators in R, try  the free first chapter on conditionals of this interactive course.
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Vectors
 a <- c(1,2,5.3,6,-2,4) # numeric vector
    b <- c("one","two","three") # character vector
    c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
  

Refer to elements of a vector using subscripts. 
 a[c(2,4)] # 2nd and 4th elements of vector

Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. 
The general format is

mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, 
       dimnames=list(char_vector_rownames, char_vector_colnames))  

byrow=TRUE indicates that the matrix should be filled by rows. 
byrow=FALSE indicates that the matrix should be filled by columns (the default). 
dimnames provides optional labels for the columns and rows. 
  # generates 5 x 4 numeric matrix
    y<-matrix(1:20, nrow=5,ncol=4)
    # another example
    cells <- c(1,26,24,68)
    rnames <- c("R1", "R2")
    cnames <- c("C1", "C2")
    mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
      dimnames=list(rnames, cnames))
  

Identify rows, columns or elements using subscripts. 
 x[,4] # 4th column of matrix
    x[3,] # 3rd row of matrix 
    x[2:4,1:3] # rows 2,3,4 of columns 1,2,3 

Arrays
Arrays are similar to matrices but can have more than two dimensions. 
See help(array) for details. 
Data Frames
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). 
This is similar to SAS and SPSS datasets.

d <- c(1,2,3,4)
    e <- c("red", "white", "red", NA)
    f <- c(TRUE,TRUE,TRUE,FALSE)
    mydata <- data.frame(d,e,f)
    names(mydata) <- c("ID","Color","Passed") # variable names
  

There are a variety of ways to identify the elements of a data frame .

myframe[3:5] # columns 3,4,5 of data frame
    myframe[c("ID","Age")] # columns ID and Age from data frame
    myframe$X1 # variable x1 in the data frame
  

Lists
An ordered collection of objects (components). 
A list allows you to gather a variety of (possibly unrelated) objects under one name. 
# example of a list with 4 components - 
    #
    a string, a numeric vector, a matrix, and a scaler 
    w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
    # example of a list containing two lists 
    v <- c(list1,list2)
  

Identify elements of a list using the [[]] convention. 
 mylist[[2]] # 2nd component of the list
    mylist[["mynumbers"]] # component named mynumbers in list
Factors
Tell R that a variable is nominal  by making it a factor. 
The factor stores the nominal values as a vector of integers in the range [ 1... 
k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. 
# variable gender with 20 "male" entries and 
    #
    30 "female" entries 
    gender <- c(rep("male",20), rep("female", 30)) 
    gender <- factor(gender) 
    # stores gender as 20 1s and 30 2s and associates
    #
    1=female, 2=male internally (alphabetically)
    # R now treats gender as a nominal variable 
    summary(gender)
  

An ordered factor is used to represent an ordinal variable. 
# variable rating coded as "large", "medium", "small'
    rating <- ordered(rating)
    # recodes rating to 1,2,3 and associates
    #
    1=large, 2=medium, 3=small internally
    # R now treats rating as ordinal 

R will treat factors as nominal variables and ordered factors as ordinal variables in statistical proceedures and graphical analyses. 
You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overiding the alphabetical ordering). 
You can also use factors to create value labels. 
For more on factors see the UCLA page. 

Useful Functions
length(object) # number of elements or components
    str(object)    # structure of an object 
    class(object)  # class or type of an object
    names(object)  # names
    c(object,object,...)       # combine objects into a vector
    cbind(object, object, ...) # combine objects as columns
    rbind(object, object, ...) # combine objects as rows
    
    object     # prints the object
    ls()       # list current objects
    rm(object) # delete an object
    newobject <- edit(object) # edit copy and save as newobject 
    fix(object)               # edit in place
  

To Practice 
 To explore data types in R, try this free interactive introduction to R course
Use the assignment operator <- to create new variables. 
A wide array of operators and functions are available here.

# Three examples for doing the same computations
    mydata$sum <- mydata$x1 + mydata$x2
    mydata$mean <- (mydata$x1 + mydata$x2)/2
    attach(mydata)
    mydata$sum <- x1 + x2
    mydata$mean <- (x1 + x2)/2
    detach(mydata)
    mydata <- transform( mydata,
    sum = x1 + x2,
    mean = (x1 + x2)/2 
    )
  

(To practice working with variables in R, try the first chapter of  this free interactive course.)

Recoding variables 
In order to recode data, you will probably use one or more of R's control structures. 
# create 2 age categories
    mydata$agecat <- ifelse(mydata$age > 70, 
    c("older"), c("younger"))

    
    # another example: create 3 age categories
    attach(mydata)
    mydata$agecat[age > 75] <- "Elder"
    mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
    mydata$agecat[age <= 45] <- "Young"
    detach(mydata)
  

Renaming variables 
You can rename variables programmatically or interactively. 
 # rename interactively 
    fix(mydata) # results are saved on close
    
    # rename programmatically 
    library(reshape)
    mydata <- rename(mydata, c(oldname="newname"))
    # you can re-enter all the variable names in order
    # changing the ones you need to change.the limitation
    #
    is that you need to enter all of them!
    names(mydata) <- c("x1","age","y", "ses")
  
Almost everything in R is done through functions. 
Here I'm only refering to numeric and character functions that are commonly used in creating or recoding variables. 
 (To practice working with functions, try the functions sections of this this interactive course.)
Numeric Functions 

  
    
      Function
      Description
    
    
      abs(x)
      absolute value 
    
    
      sqrt(x)
      square root 
    
    
      ceiling(x)
      ceiling(3.475) is 4 
    
    
      floor(x)
      floor(3.475) is 3 
    
    
      trunc(x)
      trunc(5.99) is 5 
    
    
      round(x, digits=n) 
      round(3.475, digits=2) is 3.48 
    
    
      signif(x, digits=n) 
      signif(3.475, digits=2) is 3.5 
    
    
      cos(x), sin(x), tan(x) 
      also acos(x), cosh(x), acosh(x), etc. 

    
    
      log(x)
      natural logarithm
    
    
      log10(x)
      common logarithm 
    
    
      exp(x)
      e^x
    
  

Character Functions 

  

    
      Function
      Description
    
    
      substr(x, start=n1, stop=n2) 
      Extract or replace substrings in a character vector.
x <- "abcdef" 
substr(x, 2, 4) is "bcd" 
substr(x, 2, 4) <- "22222" is "a222ef" 
    
    
      grep(pattern, x , ignore.case=FALSE, fixed=FALSE) 
      Search for pattern in x. 
If fixed =FALSE then pattern is a regular expression. 
If fixed=TRUE then pattern is a text string. 
Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2 
    
    
      sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE) 
      Find pattern in x and replace with replacement text. 
If fixed=FALSE then pattern is a regular expression.
 If fixed = T then pattern is a text string. 

sub("\\s",".","Hello There") returns "Hello.There" 
    
    
      strsplit(x, split)
      Split the elements of character vector x at split. 

strsplit("abc", "") returns 3 element vector "a","b","c" 
    
    
      paste(..., sep="") 
      Concatenate strings after using sep string to seperate them.
paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date()) 
    
    
      toupper(x)
      Uppercase
    
    
      tolower(x)
      Lowercase
    
  

Statistical Probability Functions
The following table describes functions related to probaility distributions. 
For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.


  
    Function
    Description
  
  
    dnorm(x)
    normal density function (by default m=0 sd=1)
      # plot standard normal curve
      x <- pretty(c(-3,3), 30)
      y <- dnorm(x)
      plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i") 
  
  
    pnorm(q)
    cumulative normal probability for q 
      (area under the normal curve to the left of q)
      pnorm(1.96) is 0.975 
  
  
    qnorm(p)
    normal quantile. 

      value at the p percentile of normal distribution 
      qnorm(.9) is 1.28 # 90th percentile 
  
  
    rnorm(n, m=0,sd=1)
    n random normal deviates with mean m 
      and standard deviation sd. 

      #50 random normal variates with mean=50, sd=10
      x <- rnorm(50, m=50, sd=10) 
  
  
    dbinom(x, size, prob)
pbinom(q, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
    binomial distribution where size is the sample size 
      and prob is the probability of a heads (pi) 
      # prob of 0 to 5 heads of fair coin out of 10 flips
      dbinom(0:5, 10, .5) 
      # prob of 5 or less heads of fair coin out of 10 flips
      pbinom(5, 10, .5) 
  
  
    dpois(x, lamda)
ppois(q, lamda)
qpois(p, lamda)
rpois(n, lamda)
    poisson distribution with m=std=lamda
      #probability of 0,1, or 2 events with lamda=4
      dpois(0:2, 4)
      # probability of at least 3 events with lamda=4 
      1- ppois(2,4) 
  
  
    dunif(x, min=0, max=1)
punif(q, min=0, max=1)
qunif(p, min=0, max=1)
runif(n, min=0, max=1) 
    uniform distribution, follows the same pattern 
      as the normal distribution above. 

      #10 uniform random variates
      x <- runif(10)
  


Other Statistical Functions
Other useful statistical functions are provided in the following table. 
Each has the option na.rm to strip missing values before calculations. 
Otherwise the presence of missing values will lead to a missing result. 
Object can be a numeric vector or data frame. 

  
    Function
    Description
  
  
    mean(x, trim=0,
na.rm=FALSE)
    mean of object x
      # trimmed mean, removing any missing values and 
      # 5 percent of highest and lowest scores 
      mx <- mean(x,trim=.05,na.rm=TRUE) 
  
  
    sd(x)
    standard deviation of object(x). 
also look at var(x) for variance and mad(x) for median absolute deviation. 

  
  
    median(x)
    median
  
  
    quantile(x, probs)
    quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1].
      # 30th and 84th percentiles of x
      y <- quantile(x, c(.3,.84)) 
  
  
    range(x)
    range
  
  
    sum(x)
    sum
  
  
    diff(x, lag=1) 
    lagged differences, with lag indicating which lag to use 
  
  
    min(x) 
    minimum
  
  
    max(x)
    maximum
  
  
    scale(x, center=TRUE, scale=TRUE)
    column center or standardize a matrix. 

  


Other Useful Functions 

  
    
      Function
      Description
    
    
      seq(from , to, by)
      generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9) 
    
    
      rep(x, ntimes)
      repeat x n times
y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)
    
    
      cut(x, n) 
      divide continuous variable in factor with n levels 
y <- cut(x, 5) 
    
  
Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well. 

Importing data into R is fairly simple. 
For Stata and Systat, use the foreign package. 
For SPSS and SAS I would recommend the Hmisc package for ease and functionality. 
See the Quick-R section on packages, for information on obtaining and installing the these packages. 
Example of importing data are provided below. 
From A Comma Delimited Text File
# first row contains variable names, comma is separator 
    # assign the variable id to row names
    # note the / instead of \ on mswindows systems
    
    mydata <- read.table("c:/mydata.csv", header=TRUE, 
       sep=",", row.names="id")

(To practice importing a csv file, try this exercise.)

From Excel
One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above. 
Alternatively you can use the xlsx package to access Excel files. 
The first row should contain variable/column names. 

  
    # read in the first worksheet from the workbook myexcel.xlsx
    # first row contains variable names
    library(xlsx)
    mydata <- read.xlsx("c:/myexcel.xlsx", 1)
    # read in the worksheet named mysheet
    mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
  
(To practice, try this  exercise on importing an Excel worksheet into R.)
From SPSS
# save SPSS dataset in trasport format
    get file='c:\mydata.sav'.
    export outfile='c:\mydata.por'. 

    # in R
    library(Hmisc)
    mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
    # last option converts value labels to R factors 

(To practice importing SPSS data with the foreign package, try this exercise.)

From SAS
# save SAS dataset in trasport format
    libname out xport 'c:/mydata.xpt';
    data out.mydata;
    set sasuser.mydata;
    run;
    # in R
    library(Hmisc)
    mydata <- sasxport.get("c:/mydata.xpt")
    # character variables are converted to R factors
  

From Stata
# input Stata file
    library(foreign)
    mydata <- read.dta("c:/mydata.dta")
  

(To practice importing Stata data with the foreign package, try this exercise.)

From systat 
# input Systat file
    library(foreign)
    mydata <- read.systat("c:/mydata.dta")
  

Going Further
Try this interactive course: Importing Data in R (Part 1), to work with csv and xlsx files in R. 
To work with SAS, Stata, and other formats try  Part 2.
R provides a wide range of functions for obtaining summary statistics. 
One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic. 
# get means for variables in data frame mydata
    # excluding missing values
    sapply(mydata, mean, na.rm=TRUE)
  

Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile. 
There are also numerous R functions designed to provide a range of descriptive statistics at once. 
For example 

 # mean,median,25th and 75th quartiles,min,max
    summary(mydata)
    # Tukey min,lower-hinge, median,upper-hinge,max
    fivenum(x)

Using the Hmisc package

library(Hmisc)
    describe(mydata) 
    # n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles 
    # 5 lowest and 5 highest scores

Using the pastecs package

library(pastecs)
    stat.desc(mydata)
    # nbr.val, nbr.null, nbr.na, min max, range, sum, 
    #
    median, mean, SE.mean, CI.mean, var, std.dev, coef.var
  

Using the psych package

library(psych)
    describe(mydata)
    # item name ,item number, nvalid,
    mean, sd, 
    #
    median, mad, min, max, skew, kurtosis, se

Summary Statistics by Group 
A simple way of generating summary statistics by grouping variable is available in the psych package.

library(psych)
    describe.by(mydata, group,...)

The doBy package provides much of the functionality of SAS PROC SUMMARY. 
It defines the desired table using a model formula and a function. 
Here is a simple example. 
library(doBy)
    summaryBy(mpg + wt ~ cyl + vs, data = mtcars, 
      FUN = function(x) {
    c(m = mean(x), s = sd(x))
    } )
    # produces mpg.m wt.m mpg.s wt.s for each 
    # combination of the levels of cyl and vs 

See also: aggregating data. 
To Practice 
 Want to practice interactively? Try this free course on statistics and R
In R, graphs are typically created interactively. 
# Creating a Graph
    attach(mtcars)
    plot(wt, mpg) 
    abline(lm(mpg~wt))
    title("Regression of MPG on Weight")

 The plot( ) function opens a graph window and plots weight vs. 
miles per gallon. 

  The next line of code adds a regression line to this graph. 
The final line adds a title. 
 click to view 

Saving Graphs 
You can save the graph in a variety of formats from the menu 
  File -> Save As. 
You can also save the graph via code using one of the following functions.
  
    
      Function
      Output to 
    
    
      pdf("mygraph.pdf")
      pdf file 
    
    
      win.metafile("mygraph.wmf")
      windows metafile 
    
    
      png("mygraph.png")
      png file 
    
    
      jpeg("mygraph.jpg")
      jpeg file 
    
    
      bmp("mygraph.bmp")
      bmp file 
    
    
      postscript("mygraph.ps")
      postscript file 
    
  
See input/output for details.

Viewing Several Graphs 
Creating a new graph by issuing a high level plotting command (plot, hist, boxplot, etc.) will typically overwrite a previous graph. 
To avoid this, open a new graph window before creating a new graph. 
To open a new graph window use one of the functions below.
  
    
      Function
      Platform
    
    
      windows()
      Windows
    
    
      X11()
      Unix
    
    
      quartz()
      Mac
    
  
You can have multiple graph windows open at one time. 
See help(dev.cur) for more details. 
Alternatively, after opening the first graph window, choose History -> Recording from the graph window menu. 
Then you can use Previous and Next to step through the graphs you have created. 
Graphical Parameters
You can specify fonts, colors, line styles, axes, reference lines, etc. 
by specifying graphical parameters. 
This allows a wide degree of customization. 
Graphical parameters, are covered in the Advanced Graphs section. 
The Advanced Graphs section also includes a more detailed coverage of axis and text customization. 
To Practice
Try the creating graph exercises in  this course on data visualization in R.
Packages are collections of R functions, data, and compiled code in a well-defined format. 
The directory where packages are stored is called the library. 
R comes with a standard set of packages. 
Others are available for download and installation. 
Once installed, they have to be loaded into the session to be used. 
.libPaths() # get library location
    library()   # see all packages installed 
    search()    # see packages currently loaded
Adding Packages
 You can expand the types of analyses you do be adding other packages. 
A complete list of contributed packages is available from CRAN. 
Follow these steps:


Download and install a package (you only need to do this once).
To use the package, invoke the library(package) command to load it into the current session. 
(You need to do this once in each session, unless you customize your environment to automatically load it each time.) 

On MS Windows:


Choose Install Packages from the Packages menu.
Select a CRAN Mirror. 
(e.g. 
Norway)
Select a package. 
(e.g. 
boot)
Then use the library(package) function to load it for use. 
(e.g. 
library(boot)) 

On Linux:


Download the package of interest as a compressed file. 

At the command prompt, install it using 
R CMD INSTALL [options] [l-lib] pkgs 
Use the library(package) function within R to load it for use in the session. 



Creating Your Own Packages
To create your own packages look at Writing R Extensions (the definitive guide), Leisch's Creating R Packages: A Tutorial, and Rossi's Making R packages Under Windows: A Tutorial. 
To Practice
 This free interactive course covers the basics of R.
Once R is installed, there is a comprehensive built-in help system. 
At the program's command prompt you can use any of the following:
help.start()   # general help
    help(foo)      # help about function foo
    ?foo           # same thing 
    apropos("foo")
    # list all functions containing string foo
    example(foo)   # show an example of function foo 

# search for foo in help manuals and archived mailing lists
    RSiteSearch("foo") 

# get vignettes on using installed packages
    vignette()      # show available vingettes
    vignette("foo") # show specific vignette
  

Sample Datasets
R comes with a number of sample datasets that you can experiment with. 
Type data( ) to see the available datasets. 
The results will depend on which packages you have loaded. 
Type help(datasetname) for details on a sample dataset. 

To Practice
 This free interactive course covers the basics of R.
Once R is installed, there is a comprehensive built-in help system. 
At the program's command prompt you can use any of the following:
help.start()   # general help
    help(foo)      # help about function foo
    ?foo           # same thing 
    apropos("foo")
    # list all functions containing string foo
    example(foo)   # show an example of function foo 

# search for foo in help manuals and archived mailing lists
    RSiteSearch("foo") 

# get vignettes on using installed packages
    vignette()      # show available vingettes
    vignette("foo") # show specific vignette
  

Sample Datasets
R comes with a number of sample datasets that you can experiment with. 
Type data( ) to see the available datasets. 
The results will depend on which packages you have loaded. 
Type help(datasetname) for details on a sample dataset. 

To Practice
 This free interactive course covers the basics of R.
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). 
At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started. 
Commands are entered interactively at the R user prompt. 
Up and down arrow keys scroll through your command history. 

  
  You will probably want to keep different projects in different physical directories. 
Here are some standard commands for managing your workspace. 

getwd() # print the current working directory - cwd 
    ls()    # list the objects in the current workspace

 setwd(mydirectory)      # change to mydirectory
    setwd("c:/docs/mydir")  # note / instead of \ in windows 
    setwd("/usr/rob/mydir") # on linux

# view and set options for the session
    help(options) # learn about available options
    options() # view current option settings
    options(digits=3) # number of digits to print on output
  

# work with your previous commands
    history() # display last 25 commands
    history(max.show=Inf) # display all previous commands
    # save your command history 
    savehistory(file="myfile") # default is ".Rhistory" 
    # recall your command history 
    loadhistory(file="myfile")
    # default is ".Rhistory"

# save the workspace to the file .RData in the cwd
    save.image()
    # save specific objects to a file
    # if you don't specify the path, the cwd is assumed
    save(object list,file="myfile.RData") 

# load a workspace into the current session
    # if you don't specify the path, the cwd is assumed 
    load("myfile.RData")
  

q() # quit R. 
You will be prompted to save the workspace. 


Important Note to Windows Users:
R gets confused if you use a path in your code like:
  c:\mydocuments\myfile.txt
  This is because R sees "\" as an escape character. 
Instead, use:
  c:\\my documents\\myfile.txt

    c:/mydocuments/myfile.txt
  Either will work. 
I use the second convention throughout this website. 
To Practice
 This free intro to R course will get you familiar with the R workspace.
By default, launching R starts an interactive session with input from the keyboard and output to the screen. 
However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations.

Input 
The source( ) function runs a script in the current session. 
If the filename does not include a path, the file is taken from the current working directory. 
# input a script
    source("myfile")
  

Output
The sink( ) function defines the direction of the output. 
# direct output to a file 
    sink("myfile", append=FALSE, split=FALSE)
    # return output to the terminal 
    sink()

The append option controls whether output overwrites or adds to a file. 
The split option determines if output is also sent to the screen as well as the output file.

Here are some examples of the sink() function. 
 # output directed to output.txt in c:\projects directory.
    # output overwrites existing file. 
no output to terminal. 

    sink("c:/projects/output.txt")
    # output directed to myfile.txt in cwd. 
output is appended
    # to existing file. 
output also send to terminal. 

    sink("myfile.txt", append=TRUE, split=TRUE) 

When redirecting output, use the cat( ) function to annotate the output. 
Graphs
sink( ) will not redirect graphic output. 
To redirect graphic output use one of the following functions. 
Use dev.off( ) to return output to the terminal. 

  
    Function
    Output to 
  
  
    pdf("mygraph.pdf")
    pdf file 
  
  
    win.metafile("mygraph.wmf")
    windows metafile 
  
  
    png("mygraph.png")
    png file 
  
  
    jpeg("mygraph.jpg")
    jpeg file 
  
  
    bmp("mygraph.bmp")
    bmp file 
  
  
    postscript("mygraph.ps")
    postscript file 
  

Use a full path in the file name to save the graph outside of the current working directory. 
# example - output graph to jpeg file 
    jpeg("c:/mygraphs/myplot.jpg")
    plot(x)
    dev.off()

To Practice
 To start running scripts in R, try this free interactive introduction to R course.
Packages are collections of R functions, data, and compiled code in a well-defined format. 
The directory where packages are stored is called the library. 
R comes with a standard set of packages. 
Others are available for download and installation. 
Once installed, they have to be loaded into the session to be used. 
.libPaths() # get library location
    library()   # see all packages installed 
    search()    # see packages currently loaded
Adding Packages
 You can expand the types of analyses you do be adding other packages. 
A complete list of contributed packages is available from CRAN. 
Follow these steps:


Download and install a package (you only need to do this once).
To use the package, invoke the library(package) command to load it into the current session. 
(You need to do this once in each session, unless you customize your environment to automatically load it each time.) 

On MS Windows:


Choose Install Packages from the Packages menu.
Select a CRAN Mirror. 
(e.g. 
Norway)
Select a package. 
(e.g. 
boot)
Then use the library(package) function to load it for use. 
(e.g. 
library(boot)) 

On Linux:


Download the package of interest as a compressed file. 

At the command prompt, install it using 
R CMD INSTALL [options] [l-lib] pkgs 
Use the library(package) function within R to load it for use in the session. 



Creating Your Own Packages
To create your own packages look at Writing R Extensions (the definitive guide), Leisch's Creating R Packages: A Tutorial, and Rossi's Making R packages Under Windows: A Tutorial. 
To Practice
 This free interactive course covers the basics of R.
R is a command line driven program. 
The user enters commands at the prompt ( > by default ) and each command is executed one at a time. 
There have been a number of attempts to create a more graphical interface, ranging from code editors that interact with R, to full-blown GUIs that present the user with menus and dialog boxes.

 click to view

RStudio is my favorite example of a code editor that interfaces with R for Windows, MacOS, and Linux platforms.
 click to view
Perhaps the most stable, full-blown GUI is R Commander, which can also run under Windows, Linux, and MacOS (see the documentation for technical requirements).
Both of these programs can make R a lot easier to use.

To Practice 
 This interactive course gives an overview of installing and working with RStudio.
You can customize the R environment through a site initialization file or a directory initialization file. 
R will always source the Rprofile.site file first. 
On Windows, the file is in the C:\Program Files\R\R-n.n.n\etc directory. 
You can also place a .Rprofile file in any directory that you are going to run R from or in the user home directory. 
At startup, R will source the Rprofile.site file. 
It will then look for a .Rprofile file to source in the current working directory. 
If it doesn't find it, it will look for one in the user's home directory. 
There are two special functions you can place in these files. 
.First( ) will be run at the start of the R session and .Last( ) will be run at the end of the session. 
# Sample Rprofile.site file
    
    # Things you might want to change
    # options(papersize="a4")
    # options(editor="notepad")
    # options(pager="internal")
    # R interactive prompt 
    # options(prompt="> ")
    # options(continue="+ ")
    
    # to prefer Compiled HTML 
    help
    options(chmhelp=TRUE)

    # to prefer HTML help
    # options(htmlhelp=TRUE)

    
    #
    General options

    options(tab.width = 2)
    options(width = 130)
    options(graphics.record=TRUE)
    
    .First <- function(){
     library(Hmisc)
     library(R2HTML)
     cat("\nWelcome at", date(), "\n")
    }
    .Last <- function(){
     cat("\nGoodbye at ", date(), "\n")
    }
  

Going Further
 To explore customizing the RStudio interface, try this RStudio course which is taught by Garrett Grolemund, data scientist for RStudio.
Compared with SAS and SPSS, R's ability to output results for publication quality reports is somewhat rudimentary (although this is evolving).
  The R2HTML package lets you output text, tables, and graphs in HTML format. 
Here is a sample session, followed by an explanation. 
  # Sample Session 
      library(R2HTML)
      HTMLStart(outdir="c:/mydir", file="myreport",
         extension="html", echo=FALSE, HTMLframe=TRUE)
      HTML.title("My Report", HR=1)
      
      HTML.title("Description of my data", HR=3)
      summary(mydata)
      
      
      HTMLhr()
      
      HTML.title("X Y Scatter Plot", HR=2)
      plot(mydata$y~mydata$x)
      HTMLplot()
      
      
      HTMLStop()
    

  Once you invoke HTMLStart( ), the prompt will change to HTML> until you end with HTMLStop(). 
  The echo=TRUE option copies commands to the same file as the output. 
  HTMLframe=TRUE creates framed output, with commands in the left frame, linked to output in the right frame. 
By default, a CSS file named R2HTML.css controlling page look and feel is output to the same directory. 
Optionally, you can include a CSSFile= option to use your own formatting file. 
  Use HTML.title() to annotate the output. 
The HR option refers to HTML title types (H1, H2, H3, etc.) . 
The default is HR=2. 
  HTMLhr() creates a horizontal rule. 
  Since several interactive commands may be necessary to create a finished graph, invoke the HTMLplot() function when each graph is ready to output. 
  The RNews article The R2HTML Package has more complex examples using titles, annotations, header and footer files, and cascading style sheets.
  Other Options
  The R Markdown Package from R Studio supports dozens of static and dynamic output formats including HTML, PDF, MS Word, scientific articles, websites, and more.
    (To practice R Markdown, try this tutorial taught by Garrett Grolemund, Data Scientist for R Studio.)
  Sweave allows you to imbed R code in LaTeX, producing attractive reports if you know that markup language.

  The odfWeave package has functions that allow you to imbedd R output in Open Document Format (ODF) files. 
These are the types of files created by OpenOffice software.

  The SWordInstaller package allows you to add R output to Microsoft Word documents.

  The R2PPT provides wrappers for adding R output to Microsoft PowerPoint presentations.
You can run R non-interactively with input from infile and send output (stdout/stderr) to another file. 
Here are examples. 

# on Linux 
    R CMD BATCH [options] my_script.R [outfile] 

# on Microsoft Windows (adjust the path to R.exe as needed) 
    "C:\Program Files\R\R-2.13.1\bin\R.exe" CMD BATCH 
       --vanilla --slave "c:\my projects\my_script.R" 

Be sure to look at the section on I/O for help writing R scripts.

 See an Introduction to R (Appendix B) for information on the command line options. 
To Practice 
 To start running scripts in R, try this free interactive introduction to R course.
In SAS, you can save the results of statistical analyses using the Output Delivery System (ODS). 
While ODS is a vast improvement over PROC PRINTO, it's sophistication can make some features very hard to learn (just try mastering PROC TEMPLATE). 
In SPSS you can do the same thing with the Output Management System (OMS). 
Again, not one of the easiest topics to learn. 
One of the most useful design features of R is that the output of analyses can easily be saved and used as input to additional analyses. 
# Example 1 
    lm(mpg~wt, data=mtcars) 

This will run a simple linear regression of miles per gallon on car weight using the data frame mtcars. 
Results are sent to the screen. 
Nothing is saved.

# Example 2 
    fit <- lm(mpg~wt, data=mtcars)
  

This time, the same regression is performed but the results are saved under the name fit. 
No output is sent to the screen. 
However, you now can manipulate the results. 
# Example 2 (continued...)
    str(fit) # view the contents/structure of "fit"

The assignment has actually created a list called "fit" that contains a wide range of information (including the predicted values, residuals, coefficients, and more.

# Example 2 (continued again)
    # plot residuals by fitted values
    plot(fit$residuals, fit$fitted.values)
  

To see what a function returns, look at the value section of the online help for that function. 
Here we would look at help(lm).

The results can also be used by a wide range of other functions.

# Example 2 (one last time, I promise)
    # produce diagnostic plots
    plot(fit) 
    # predict mpg from wt in a new set of data 
    predict(fit, mynewdata)
    # get and save influence statistics
    cook <- cooks.distance(fit) 
To Practice 
 To practice reusing results in variables, try this interactive course on the introduction to R programming from DataCamp. 

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Vectors
 a <- c(1,2,5.3,6,-2,4) # numeric vector
    b <- c("one","two","three") # character vector
    c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
  

Refer to elements of a vector using subscripts. 
 a[c(2,4)] # 2nd and 4th elements of vector

Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. 
The general format is

mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, 
       dimnames=list(char_vector_rownames, char_vector_colnames))  

byrow=TRUE indicates that the matrix should be filled by rows. 
byrow=FALSE indicates that the matrix should be filled by columns (the default). 
dimnames provides optional labels for the columns and rows. 
  # generates 5 x 4 numeric matrix
    y<-matrix(1:20, nrow=5,ncol=4)
    # another example
    cells <- c(1,26,24,68)
    rnames <- c("R1", "R2")
    cnames <- c("C1", "C2")
    mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
      dimnames=list(rnames, cnames))
  

Identify rows, columns or elements using subscripts. 
 x[,4] # 4th column of matrix
    x[3,] # 3rd row of matrix 
    x[2:4,1:3] # rows 2,3,4 of columns 1,2,3 

Arrays
Arrays are similar to matrices but can have more than two dimensions. 
See help(array) for details. 
Data Frames
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). 
This is similar to SAS and SPSS datasets.

d <- c(1,2,3,4)
    e <- c("red", "white", "red", NA)
    f <- c(TRUE,TRUE,TRUE,FALSE)
    mydata <- data.frame(d,e,f)
    names(mydata) <- c("ID","Color","Passed") # variable names
  

There are a variety of ways to identify the elements of a data frame .

myframe[3:5] # columns 3,4,5 of data frame
    myframe[c("ID","Age")] # columns ID and Age from data frame
    myframe$X1 # variable x1 in the data frame
  

Lists
An ordered collection of objects (components). 
A list allows you to gather a variety of (possibly unrelated) objects under one name. 
# example of a list with 4 components - 
    #
    a string, a numeric vector, a matrix, and a scaler 
    w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
    # example of a list containing two lists 
    v <- c(list1,list2)
  

Identify elements of a list using the [[]] convention. 
 mylist[[2]] # 2nd component of the list
    mylist[["mynumbers"]] # component named mynumbers in list
Factors
Tell R that a variable is nominal  by making it a factor. 
The factor stores the nominal values as a vector of integers in the range [ 1... 
k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. 
# variable gender with 20 "male" entries and 
    #
    30 "female" entries 
    gender <- c(rep("male",20), rep("female", 30)) 
    gender <- factor(gender) 
    # stores gender as 20 1s and 30 2s and associates
    #
    1=female, 2=male internally (alphabetically)
    # R now treats gender as a nominal variable 
    summary(gender)
  

An ordered factor is used to represent an ordinal variable. 
# variable rating coded as "large", "medium", "small'
    rating <- ordered(rating)
    # recodes rating to 1,2,3 and associates
    #
    1=large, 2=medium, 3=small internally
    # R now treats rating as ordinal 

R will treat factors as nominal variables and ordered factors as ordinal variables in statistical proceedures and graphical analyses. 
You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overiding the alphabetical ordering). 
You can also use factors to create value labels. 
For more on factors see the UCLA page. 

Useful Functions
length(object) # number of elements or components
    str(object)    # structure of an object 
    class(object)  # class or type of an object
    names(object)  # names
    c(object,object,...)       # combine objects into a vector
    cbind(object, object, ...) # combine objects as columns
    rbind(object, object, ...) # combine objects as rows
    
    object     # prints the object
    ls()       # list current objects
    rm(object) # delete an object
    newobject <- edit(object) # edit copy and save as newobject 
    fix(object)               # edit in place
  

To Practice 
 To explore data types in R, try this free interactive introduction to R course
Importing data into R is fairly simple. 
For Stata and Systat, use the foreign package. 
For SPSS and SAS I would recommend the Hmisc package for ease and functionality. 
See the Quick-R section on packages, for information on obtaining and installing the these packages. 
Example of importing data are provided below. 
From A Comma Delimited Text File
# first row contains variable names, comma is separator 
    # assign the variable id to row names
    # note the / instead of \ on mswindows systems
    
    mydata <- read.table("c:/mydata.csv", header=TRUE, 
       sep=",", row.names="id")

(To practice importing a csv file, try this exercise.)

From Excel
One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above. 
Alternatively you can use the xlsx package to access Excel files. 
The first row should contain variable/column names. 

  
    # read in the first worksheet from the workbook myexcel.xlsx
    # first row contains variable names
    library(xlsx)
    mydata <- read.xlsx("c:/myexcel.xlsx", 1)
    # read in the worksheet named mysheet
    mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
  
(To practice, try this  exercise on importing an Excel worksheet into R.)
From SPSS
# save SPSS dataset in trasport format
    get file='c:\mydata.sav'.
    export outfile='c:\mydata.por'. 

    # in R
    library(Hmisc)
    mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
    # last option converts value labels to R factors 

(To practice importing SPSS data with the foreign package, try this exercise.)

From SAS
# save SAS dataset in trasport format
    libname out xport 'c:/mydata.xpt';
    data out.mydata;
    set sasuser.mydata;
    run;
    # in R
    library(Hmisc)
    mydata <- sasxport.get("c:/mydata.xpt")
    # character variables are converted to R factors
  

From Stata
# input Stata file
    library(foreign)
    mydata <- read.dta("c:/mydata.dta")
  

(To practice importing Stata data with the foreign package, try this exercise.)

From systat 
# input Systat file
    library(foreign)
    mydata <- read.systat("c:/mydata.dta")
  

Going Further
Try this interactive course: Importing Data in R (Part 1), to work with csv and xlsx files in R. 
To work with SAS, Stata, and other formats try  Part 2.
Usually you will obtain a data frame by importing it from SAS, SPSS, Excel, Stata, a database, or an ASCII file. 
To create it interactively, you can do something like the following. 
# create a data frame from scratch 
    age <- c(25, 30, 56)
    gender <- c("male", "female", "male")
    weight <- c(160, 110, 220)
    mydata <- data.frame(age,gender,weight)
  

You can also use R's built in spreadsheet to enter the data interactively, as in the following example. 
# enter data using editor 
    mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
    mydata <- edit(mydata)
    # note that without the assignment in the line above, 
    # the edits are not saved!
  
To Practice 
 To practice importing data in R, try the first chapter of this course from DataCamp 
ODBC Interface 
The RODBC package provides access to databases (including Microsoft Access and Microsoft SQL Server) through an ODBC interface. 
The primary functions are given below.


  
    
      Function 
    
      Description 
  
  
    odbcConnect(dsn, uid="", pwd="") 
    Open a connection to an ODBC database 
  
  
    sqlFetch(channel, sqtable) 
    Read a table from an ODBC database into a data frame 
  
  
    sqlQuery(channel, query) 
    Submit a query to an ODBC database and return the results 
  
  
    sqlSave(channel, mydf, tablename = sqtable, append = FALSE)
    Write or update (append=True) a data frame to a table in the ODBC database 
  
  
    sqlDrop(channel, sqtable) 
    Remove a table from the ODBC database 
  
  
    close(channel)
    Close the connection 
  

# RODBC Example
    # import 2 tables (Crime and Punishment) from a DBMS
    # into R
    data frames (and call them crimedat and pundat)
    library(RODBC)
    myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark")
    crimedat <- sqlFetch(myconn, "Crime")
    pundat <- sqlQuery(myconn, "select * from Punishment")
    close(myconn)
  

Other Interfaces
The RMySQL package provides an interface to MySQL.

The ROracle package provides an interface for Oracle. 
The RJDBC package provides access to databases through a JDBC interface. 

Going Further 
 This tutorial at DataCamp has another example with the RODBC package. 

There are numerous methods for exporting R objects into other formats . 
For SPSS, SAS and Stata, you will need to load the foreign packages. 
For Excel, you will need the xlsReadWrite package. 
To A Tab Delimited Text File
write.table(mydata, "c:/mydata.txt", sep="\t") 

To an Excel Spreadsheet 
 library(xlsx)
    write.xlsx(mydata, "c:/mydata.xlsx") 

(To work with R and Excel, try this interactive course on importing data in R.)

To SPSS
 # write out text datafile and
    # an SPSS program to read it
    library(foreign)
    write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps",   package="SPSS") 

(Alternatively, to practice importing SPSS data with the foreign package, try this exercise.)

To SAS
# write out text datafile and
    # an SAS program to read it
    library(foreign)
    write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas",   package="SAS")
  

To Stata
# export data frame to Stata binary format
    library(foreign)
    write.dta(mydata, "c:/mydata.dta")
  

(Alternatively, to practice importing Stata data with the foreign package, try this exercise.)
There are a number of functions for listing the contents of an object or dataset. 
# list objects in the working environment
    ls() 

 # list the variables in mydata
    names(mydata)

 # list the structure of mydata
    str(mydata) 

 # list levels of factor v1 in mydata
    levels(mydata$v1)

# dimensions of an object
    dim(object)
  

 # class of an object (numeric, matrix, data frame, etc)
    class(object)

 # print mydata 
    mydata

 # print first 10 rows of mydata
    head(mydata, n=10)

 # print last 5 rows of mydata
    tail(mydata, n=5)
  
To Practice
Try the free first chapter of  this course on cleaning data.
R's ability to handle variable labels is somewhat unsatisfying. 
If you use the Hmisc package, you can take advantage of some labeling features. 
library(Hmisc)
    label(mydata$myvar) <- "Variable label for variable myvar" 
    describe(mydata)

Unfortunately the label is only in effect for functions provided by the Hmisc package, such as describe(). 
Your other option is to use the variable label as the variable name and then refer to the variable by position index.

names(mydata)[3] <- "This is the label for variable 3"
    mydata[3] # list the variable
  
To Practice 
 Want to practice more? Try this exercise on variable recoding from DataCamp
To understand value labels in R, you need to understand the data structure factor. 
You can use the factor function to create your own value labels. 
# variable v1 is coded 1, 2 or 3
    # we want to attach value labels 1=red, 2=blue, 3=green
    mydata$v1 <- factor(mydata$v1,
    levels = c(1,2,3),
    labels = c("red", "blue", "green"))
  

# variable y is coded 1, 3 or 5 
    # we want to attach value labels 1=Low, 3=Medium, 5=High
    mydata$v1 <- ordered(mydata$y,
    levels = c(1,3, 5),
    labels = c("Low", "Medium", "High"))
  

Use the factor() function for nominal data and the ordered() function for ordinal data. 
R statistical and graphic functions will then treat the data appriopriately.

Note: factor and ordered are used the same way, with the same arguments. 
The former creates factors and the later creates ordered factors.

To Practice
Factors are covered in the fourth chapter of  this free interactive introduction to R course.
In R, missing values are represented by the symbol NA (not available). 
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). 
Unlike SAS, R uses the same symbol for character and numeric data. 
For more practice on working with missing data, try this course on cleaning data in R.

Testing for Missing Values
is.na(x) # returns TRUE of x is missing
    y <- c(1,2,3,NA)
    is.na(y) # returns a vector (F F F T)
  

Recoding Values to Missing
# recode 99 to missing for variable v1
    # select rows where v1 is 99 and recode column v1 
    mydata$v1[mydata$v1==99] <- NA
  

Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values. 
x <- c(1,2,NA,3)
    mean(x) # returns NA
    mean(x, na.rm=TRUE) # returns 2
  

The function complete.cases() returns a logical vector indicating which cases are complete. 
# list rows of data that have missing values 
    mydata[!complete.cases(mydata),]

The function na.omit() returns the object with listwise deletion of missing values. 
# create new dataset without missing data 
    newdata <- na.omit(mydata)

  

Advanced Handling of Missing Data 
Most modeling functions in R offer options for dealing with missing values. 
You can go beyond pairwise of listwise deletion of missing values through methods such as multiple imputation. 
Good implementations that can be accessed through R include Amelia II, Mice, and mitools. 

Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates. 
# use as.Date( ) to convert strings to dates 
    mydates <- as.Date(c("2007-06-22", "2004-02-13"))
    # number of days between 6/22/07 and 2/13/04 
    days <- mydates[1] - mydates[2]

Sys.Date( ) returns today's date. 

  date() returns the current date and time. 
The following symbols can be used with the format( ) function to print dates. 

  
    Symbol
    Meaning
    Example
  
  
    %d
    day as a number (0-31) 
    01-31
  
  
    %a
%A
    abbreviated weekday 
      unabbreviated weekday 
    Mon
      Monday
  
  
    %m
    month (00-12) 
    00-12
  
  
    %b
%B
    abbreviated month
      unabbreviated month 
    Jan
      January
  
  
    %y
%Y
    2-digit year 
      4-digit year 
    07
      2007
  

Here is an example.

# print today's date
    today <-
    Sys.Date()
    format(today, format="%B %d %Y")
    "June 20 2007"

Date Conversion
Character to Date 
You can use the as.Date( ) function to convert character data to dates. 
The format is as.Date(x, "format"), where x is the character data and format gives the appropriate format. 
# convert date info in format 'mm/dd/yyyy'
    strDates <- c("01/05/1965", "08/16/1975")
    dates <- as.Date(strDates, "%m/%d/%Y") 

The default format is yyyy-mm-dd

mydates <- as.Date(c("2007-06-22", "2004-02-13"))

Date to Character 
You can convert dates to character data using the as.Character( ) function. 
# convert dates to character data
    strDates <- as.character(dates)
  

Learning More
See help(as.Date) and help(strftime) for details on converting character data to dates. 
See help(ISOdatetime) for more information about formatting date/times. 

To Practice 
This intermediate R course includes a section on working with times and dates.
Use the assignment operator <- to create new variables. 
A wide array of operators and functions are available here.

# Three examples for doing the same computations
    mydata$sum <- mydata$x1 + mydata$x2
    mydata$mean <- (mydata$x1 + mydata$x2)/2
    attach(mydata)
    mydata$sum <- x1 + x2
    mydata$mean <- (x1 + x2)/2
    detach(mydata)
    mydata <- transform( mydata,
    sum = x1 + x2,
    mean = (x1 + x2)/2 
    )
  

(To practice working with variables in R, try the first chapter of  this free interactive course.)

Recoding variables 
In order to recode data, you will probably use one or more of R's control structures. 
# create 2 age categories
    mydata$agecat <- ifelse(mydata$age > 70, 
    c("older"), c("younger"))

    
    # another example: create 3 age categories
    attach(mydata)
    mydata$agecat[age > 75] <- "Elder"
    mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
    mydata$agecat[age <= 45] <- "Young"
    detach(mydata)
  

Renaming variables 
You can rename variables programmatically or interactively. 
 # rename interactively 
    fix(mydata) # results are saved on close
    
    # rename programmatically 
    library(reshape)
    mydata <- rename(mydata, c(oldname="newname"))
    # you can re-enter all the variable names in order
    # changing the ones you need to change.the limitation
    #
    is that you need to enter all of them!
    names(mydata) <- c("x1","age","y", "ses")
  
R's binary and logical operators will look very familiar to programmers. 
Note that binary operators work on vectors and matrices as well as scalars. 

Arithmetic Operators 

  
    
      Operator
      Description
    
    
      +
      addition
    
    
      -
      subtraction
    
    
      *
      multiplication
    
    
      /
      division
    
    
      ^ or ** 
      exponentiation
    
    
      x %% y 
      modulus (x mod y) 5%%2 is 1 
    
    
      x %/% y 
      integer division 5%/%2 is 2 
    
  

Logical Operators 

  

    
      Operator
      Description
    
    
      <
      less than 
    
    
      <=
      less than or equal to 
    
    
      >
      greater than 
    
    
      >=
      greater than or equal to 
    
    
      ==
      exactly equal to 
    
    
      !=
      not equal to 
    
    
      !x
      Not x 
    
    
      x | y 
      x OR y 
    
    
      x & y 
      x AND y 
    
    
      isTRUE(x)
      test if X is TRUE 
    
  

# An example 
    x <- c(1:10)
    x[(x>8) | (x<5)]
    # yields 1 2 3 4 9 10
    # How it works 
    x <- c(1:10)
    x
    1 2 3 4 5 6 7 8 9 10
    x > 8
    F F F F F F F F T T
    x < 5
    T T T T F F F F F F
    x > 8 | x < 5
    T T T T F F F F T T
    x[c(T,T,T,T,F,F,F,F,T,T)]
    1 2 3 4 9 10
  
Going Further
To practice working with logical operators in R, try  the free first chapter on conditionals of this interactive course.
Almost everything in R is done through functions. 
Here I'm only refering to numeric and character functions that are commonly used in creating or recoding variables. 
 (To practice working with functions, try the functions sections of this this interactive course.)
Numeric Functions 

  
    
      Function
      Description
    
    
      abs(x)
      absolute value 
    
    
      sqrt(x)
      square root 
    
    
      ceiling(x)
      ceiling(3.475) is 4 
    
    
      floor(x)
      floor(3.475) is 3 
    
    
      trunc(x)
      trunc(5.99) is 5 
    
    
      round(x, digits=n) 
      round(3.475, digits=2) is 3.48 
    
    
      signif(x, digits=n) 
      signif(3.475, digits=2) is 3.5 
    
    
      cos(x), sin(x), tan(x) 
      also acos(x), cosh(x), acosh(x), etc. 

    
    
      log(x)
      natural logarithm
    
    
      log10(x)
      common logarithm 
    
    
      exp(x)
      e^x
    
  

Character Functions 

  

    
      Function
      Description
    
    
      substr(x, start=n1, stop=n2) 
      Extract or replace substrings in a character vector.
x <- "abcdef" 
substr(x, 2, 4) is "bcd" 
substr(x, 2, 4) <- "22222" is "a222ef" 
    
    
      grep(pattern, x , ignore.case=FALSE, fixed=FALSE) 
      Search for pattern in x. 
If fixed =FALSE then pattern is a regular expression. 
If fixed=TRUE then pattern is a text string. 
Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2 
    
    
      sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE) 
      Find pattern in x and replace with replacement text. 
If fixed=FALSE then pattern is a regular expression.
 If fixed = T then pattern is a text string. 

sub("\\s",".","Hello There") returns "Hello.There" 
    
    
      strsplit(x, split)
      Split the elements of character vector x at split. 

strsplit("abc", "") returns 3 element vector "a","b","c" 
    
    
      paste(..., sep="") 
      Concatenate strings after using sep string to seperate them.
paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date()) 
    
    
      toupper(x)
      Uppercase
    
    
      tolower(x)
      Lowercase
    
  

Statistical Probability Functions
The following table describes functions related to probaility distributions. 
For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.


  
    Function
    Description
  
  
    dnorm(x)
    normal density function (by default m=0 sd=1)
      # plot standard normal curve
      x <- pretty(c(-3,3), 30)
      y <- dnorm(x)
      plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i") 
  
  
    pnorm(q)
    cumulative normal probability for q 
      (area under the normal curve to the left of q)
      pnorm(1.96) is 0.975 
  
  
    qnorm(p)
    normal quantile. 

      value at the p percentile of normal distribution 
      qnorm(.9) is 1.28 # 90th percentile 
  
  
    rnorm(n, m=0,sd=1)
    n random normal deviates with mean m 
      and standard deviation sd. 

      #50 random normal variates with mean=50, sd=10
      x <- rnorm(50, m=50, sd=10) 
  
  
    dbinom(x, size, prob)
pbinom(q, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
    binomial distribution where size is the sample size 
      and prob is the probability of a heads (pi) 
      # prob of 0 to 5 heads of fair coin out of 10 flips
      dbinom(0:5, 10, .5) 
      # prob of 5 or less heads of fair coin out of 10 flips
      pbinom(5, 10, .5) 
  
  
    dpois(x, lamda)
ppois(q, lamda)
qpois(p, lamda)
rpois(n, lamda)
    poisson distribution with m=std=lamda
      #probability of 0,1, or 2 events with lamda=4
      dpois(0:2, 4)
      # probability of at least 3 events with lamda=4 
      1- ppois(2,4) 
  
  
    dunif(x, min=0, max=1)
punif(q, min=0, max=1)
qunif(p, min=0, max=1)
runif(n, min=0, max=1) 
    uniform distribution, follows the same pattern 
      as the normal distribution above. 

      #10 uniform random variates
      x <- runif(10)
  


Other Statistical Functions
Other useful statistical functions are provided in the following table. 
Each has the option na.rm to strip missing values before calculations. 
Otherwise the presence of missing values will lead to a missing result. 
Object can be a numeric vector or data frame. 

  
    Function
    Description
  
  
    mean(x, trim=0,
na.rm=FALSE)
    mean of object x
      # trimmed mean, removing any missing values and 
      # 5 percent of highest and lowest scores 
      mx <- mean(x,trim=.05,na.rm=TRUE) 
  
  
    sd(x)
    standard deviation of object(x). 
also look at var(x) for variance and mad(x) for median absolute deviation. 

  
  
    median(x)
    median
  
  
    quantile(x, probs)
    quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1].
      # 30th and 84th percentiles of x
      y <- quantile(x, c(.3,.84)) 
  
  
    range(x)
    range
  
  
    sum(x)
    sum
  
  
    diff(x, lag=1) 
    lagged differences, with lag indicating which lag to use 
  
  
    min(x) 
    minimum
  
  
    max(x)
    maximum
  
  
    scale(x, center=TRUE, scale=TRUE)
    column center or standardize a matrix. 

  


Other Useful Functions 

  
    
      Function
      Description
    
    
      seq(from , to, by)
      generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9) 
    
    
      rep(x, ntimes)
      repeat x n times
y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)
    
    
      cut(x, n) 
      divide continuous variable in factor with n levels 
y <- cut(x, 5) 
    
  
Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well. 

R has the standard control structures you would expect. 
expr can be multiple (compound) statements by enclosing them in braces { }. 
It is more efficient to use built-in functions rather than control structures whenever possible.

if-else
if (cond) expr
    if (cond) expr1 else expr2

for
for (var in seq) expr
while
while (cond) expr 

switch
switch(expr, ...) 

ifelse
ifelse(test,yes,no)

Example
# transpose of a matrix
    # a poor alternative to built-in t() function
    mytrans <- function(x) { 
      if (!is.matrix(x)) {
        warning("argument is not a matrix: returning NA")
        return(NA_real_)
      }
      y <- matrix(1, nrow=ncol(x), ncol=nrow(x)) 
      for (i in 1:nrow(x)) {
        for (j in 1:ncol(x)) {
          y[j,i] <- x[i,j]
        }
      }
    return(y)
    }
    # try it
    z <- matrix(1:10, nrow=5, ncol=2)
    tz <- mytrans(z)
  
Going Further
To practice working with control structures in R, try  the chapter on conditionals and control flow of this interactive R course.
One of the great strengths of R is the user's ability to add functions. 
In fact, many of the functions in R are actually functions of functions. 
The structure of a function is given below.

myfunction <- function(arg1, arg2, ... 
){
    statements
    return(object)
    }

Objects in the function are local to the function. 
The object returned can be any data type. 
Here is an example.

# function example - get measures of central tendency
    # and spread for a numeric vector x. 
The user has a
    # choice of measures and whether the results are printed.
    mysummary <- function(x,npar=TRUE,print=TRUE) {
      if (!npar) {
        center <- mean(x); spread <- sd(x)
      } else {
        center <- median(x); spread <- mad(x) 
      }
      if (print & !npar) {
        cat("Mean=", center, "\n", "SD=", spread, "\n")
      }
    else if (print & npar) {
        cat("Median=", center, "\n", "MAD=", spread, "\n")
      }
      result <- list(center=center,spread=spread)
      return(result)
    }
    # invoking the function 
    set.seed(1234)
    x <- rpois(500, 4) 
    y <- mysummary(x)
    Median= 4
    MAD= 1.4826
    # y$center is the median (4)
    # y$spread is the median absolute deviation (1.4826)
    y <- mysummary(x, npar=FALSE, print=FALSE)
    # no output
    # y$center is the mean (4.052)
    #
    y$spread is the standard deviation (2.01927) 

It can be instructive to look at the code of a function. 
In R, you can view a function's code by typing the function name without the ( ). 
If this method fails, look at the following R Wiki link for hints on viewing function sourcecode. 

Finally, you may want to store your own functions, and have them available in every session. 
You can customize the R environment to load your functions at start-up. 
To Practice
Try this interactive course on writing functions in R. 

To sort a data frame in R, use the order( ) function. 
By default, sorting is ASCENDING. 
Prepend the sorting variable by a minus sign to indicate DESCENDING order. 
Here are some examples.

# sorting examples using the mtcars dataset
    attach(mtcars)
    # sort by mpg
    newdata <- mtcars[order(mpg),]
    
    # sort by mpg and cyl
    newdata <- mtcars[order(mpg, cyl),]
    #sort by mpg (ascending) and cyl (descending)
    newdata <- mtcars[order(mpg, -cyl),]
    
    detach(mtcars)
  
To practice, try this sorting exercise  with the order() function.
Adding Columns 
To merge two data frames (datasets) horizontally, use the merge function. 
In most cases, you join two data frames by one or more common key variables (i.e., an inner join). 
# merge two data frames by ID
    total <- merge(data frameA,data frameB,by="ID")

# merge two data frames by ID and Country
    total <- merge(data frameA,data frameB,by=c("ID","Country"))
  
Adding Rows 

To join two data frames (datasets) vertically, use the rbind function. 
The two data frames must have the same variables, but they do not have to be in the same order.

total <- rbind(data frameA, data frameB) 

If data frameA has variables that data frameB does not, then either:


Delete the extra variables in data frameA or
Create the additional variables in data frameB and set them to NA (missing) 

before joining them with rbind( ). 

Going Further
To practice manipulating data frames with the dplyr package, try  this interactive course on data frame manipulation in R.
It is relatively easy to collapse data in R using one or more BY variables and a defined function. 
# aggregate data frame mtcars by cyl and vs, returning means
    # for numeric variables
    attach(mtcars)
    aggdata <-aggregate(mtcars, by=list(cyl,vs), 
      FUN=mean, na.rm=TRUE)
    print(aggdata)
    detach(mtcars)
  

When using the aggregate() function, the by variables must be in a list (even if there is only one). 
The function can be built-in or user provided. 
See also:


 summarize() in the Hmisc package 
summaryBy() in the doBy package 


Going Further
To practice aggregate() and other functions, try the exercises  in this manipulating data tutorial.
R provides a variety of methods for reshaping data prior to analysis. 
Transpose 
Use the t() function to transpose a matrix or a data frame. 
In the later case, rownames become variable (column) names. 
# example using built-in dataset 
    mtcars
    t(mtcars)
  

The Reshape Package 
Hadley Wickham has created a comprehensive package called reshape to massage data. 
Both an introduction and article are available. 
There is even a video! 

Basically, you "melt" data so that each row is a unique id-variable combination. 
Then you "cast" the melted data into any shape you would like. 
Here is a very simple example. 
mydata 


  
    id
    time
    x1
    x2
  
  
    1
    1
    5
    6
  
  
    1
    2
    3
    5
  
  
    2
    1
    6
    1
  
  
    2
    2
    2
    4
  

  

# example of melt function
    library(reshape)
    mdata <- melt(mydata, id=c("id","time"))
  

newdata


  
    id
    time
    variable
    value
  
  
    1
    1
    x1
    5
  
  
    1
    2
    x1
    3
  
  
    2
    1
    x1
    6
  
  
    2
    2
    x1
    2
  
  
    1
    1
    x2
    6
  
  
    1
    2
    x2
    5
  
  
    2
    1
    x2
    1
  
  
    2
    2
    x2
    4
  

# cast the melted data
    # cast(data, formula, function) 
    subjmeans <- cast(mdata, id~variable, mean)
    timemeans <- cast(mdata, time~variable, mean)
  

subjmeans


  
    id
    x1
    x2
  
  
    1
    4
    5.5
  
  
    2
    4
    2.5
  

timemeans


  
    time
    x1
    x2
  
  
    1
    5.5
    3.5
  
  
    2
    2.5
    4.5
  

There is much more that you can do with the melt( ) and cast( ) functions. 
See the documentation for more details. 
Going Further 
 To practice massaging data, try this course in cleaning data in R.
R has powerful indexing features for accessing object elements. 
These features can be used to select and exclude variables and observations. 
The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset. 
Selecting (Keeping) Variables 
# select variables v1, v2, v3
    myvars <- c("v1", "v2", "v3")
    newdata <- mydata[myvars]
    # another method
    myvars <- paste("v", 1:3, sep="")
    newdata <- mydata[myvars]
    # select 1st and 5th thru 10th variables
    newdata <- mydata[c(1,5:10)] 

 To practice this interactively, try the selection of data frame elements exercises in the Data frames chapter of this introduction to R course.
  
    Excluding (DROPPING) Variables 
    # exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3")

newdata <- mydata[!myvars]

# exclude 3rd and 5th variable 
newdata <- mydata[c(-3,-5)]

# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL
      

    Selecting Observations
    # first 5 observations
newdata <- mydata[1:5,]

# based on variable values
newdata <- mydata[ which(mydata$gender=='F' 
& mydata$age > 65), ]

# or
attach(mydata)
newdata <- mydata[ which(gender=='F' & age > 65),]
detach(mydata)
      
    Selection using the Subset Function 
    The subset( ) function is the easiest way to select variables and observations. 
In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10. 
We keep the ID and Weight columns. 
    # using subset function 
newdata <- subset(mydata, age >= 20 | age < 10, 
select=c(ID, Weight))
      

    In the next example, we select all men over the age of 25 and we keep variables weight through income (weight, income and all columns between them). 
    # using subset function (part 2)
newdata <- subset(mydata, sex=="m" & age > 25,
select=weight:income)
      

     To practice the subset() function, try this this interactive exercise. on subsetting data.tables.

     Random Samples
    Use the sample( ) function to take a random sample of size n from a dataset.

    # take a random sample of size 50 from a dataset mydata

# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50,
   replace=FALSE),]
      
Type conversions in R work as you would expect. 
For example, adding a character string to a numeric vector converts all the elements in the vector to character. 
Use is.foo to test for data type foo. 
Returns TRUE or FALSE
  Use as.foo to explicitly convert it.

 is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()
  as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame) 

Examples


  
     
    to one long 
vector
    to
matrix
    to
data frame
  
  
    from
vector
    c(x,y)

    cbind(x,y)
      rbind(x,y)
    data.frame(x,y)
  
  
    from
matrix
    as.vector(mymatrix)
     
    as.data.frame(mymatrix)
  
  
    from
data frame
     
    as.matrix(myframe)
     
  


Dates
You can convert dates to and from character or numeric data. 
See date values for more information. 

To Practice 
 To explore data types in R, try this free interactive introduction to R course.
R provides a wide range of functions for obtaining summary statistics. 
One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic. 
# get means for variables in data frame mydata
    # excluding missing values
    sapply(mydata, mean, na.rm=TRUE)
  

Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile. 
There are also numerous R functions designed to provide a range of descriptive statistics at once. 
For example 

 # mean,median,25th and 75th quartiles,min,max
    summary(mydata)
    # Tukey min,lower-hinge, median,upper-hinge,max
    fivenum(x)

Using the Hmisc package

library(Hmisc)
    describe(mydata) 
    # n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles 
    # 5 lowest and 5 highest scores

Using the pastecs package

library(pastecs)
    stat.desc(mydata)
    # nbr.val, nbr.null, nbr.na, min max, range, sum, 
    #
    median, mean, SE.mean, CI.mean, var, std.dev, coef.var
  

Using the psych package

library(psych)
    describe(mydata)
    # item name ,item number, nvalid,
    mean, sd, 
    #
    median, mad, min, max, skew, kurtosis, se

Summary Statistics by Group 
A simple way of generating summary statistics by grouping variable is available in the psych package.

library(psych)
    describe.by(mydata, group,...)

The doBy package provides much of the functionality of SAS PROC SUMMARY. 
It defines the desired table using a model formula and a function. 
Here is a simple example. 
library(doBy)
    summaryBy(mpg + wt ~ cyl + vs, data = mtcars, 
      FUN = function(x) {
    c(m = mean(x), s = sd(x))
    } )
    # produces mpg.m wt.m mpg.s wt.s for each 
    # combination of the levels of cyl and vs 

See also: aggregating data. 
To Practice 
 Want to practice interactively? Try this free course on statistics and R
This section describes the creation of frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results. 
Generating Frequency Tables
R provides many methods for creating frequency and contingency tables. 
Three are described below. 
In the following examples, assume that A, B, and C represent categorical variables. 
table
You can generate frequency tables using the table( ) function, tables of proportions using the prop.table( ) function, and marginal frequencies using margin.table( ). 
 # 2-Way Frequency Table 
    attach(mydata)
    mytable <- table(A,B) # A will be rows, B will be columns 
    mytable # print table 
    margin.table(mytable, 1) # A frequencies (summed over B) 
    margin.table(mytable, 2) # B frequencies (summed over A)
    prop.table(mytable) # cell percentages
    prop.table(mytable, 1) # row percentages 
    prop.table(mytable, 2) # column percentages 

table( ) can also generate multidimensional tables based on 3 or more categorical variables. 
In this case, use the ftable( ) function to print the results more attractively. 
# 3-Way Frequency Table 
    mytable <- table(A, B, C) 
    ftable(mytable)
  

Table ignores missing values. 
To include NA as a category in counts, include the table option exclude=NULL if the variable is a vector. 
If the variable is a factor you have to create a new factor using newfactor <- factor(oldfactor, exclude=NULL). 
xtabs
The xtabs( ) function allows you to create crosstabulations using formula style input. 
# 3-Way Frequency Table
    mytable <- xtabs(~A+B+c, data=mydata)
    ftable(mytable) # print table 
    summary(mytable) # chi-square test of indepedence
  

If a variable is included on the left side of the formula, it is assumed to be a vector of frequencies (useful if the data have already been tabulated). 
Crosstable 
The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS. 
It has a wealth of options.

# 2-Way Cross Tabulation
    library(gmodels)
    CrossTable(mydata$myrowvar, mydata$mycolvar)
  

There are options to report percentages (row, column, cell), specify decimal places, produce Chi-square, Fisher, and McNemar tests of independence, report expected and residual values (pearson, standardized, adjusted standardized), include missing values as valid, annotate with row and column titles, and format as SAS or SPSS style output! 
  See help(CrossTable) for details. 

Tests of Independence 

Chi-Square Test
For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable. 
By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic. 
Optionally, the p-value can be derived via Monte Carlo simultation. 
Fisher Exact Test
fisher.test(x) provides an exact test of independence. 
x is a two dimensional contingency table in matrix form. 
Mantel-Haenszel test 
Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null hypothesis that two nominal variables are conditionally independent in each stratum, assuming that there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension refers to the strata. 
Loglinear Models 
You can use the loglm( ) function in the MASS package to produce log-linear models. 
For example, let's assume we have a 3-way contingency table based on variables A, B, and C. 
library(MASS)
    mytable <- xtabs(~A+B+C, data=mydata)
  

We can perform the following tests:

Mutual Independence: A, B, and C are pairwise independent.
  loglm(~A+B+C, mytable)

Partial Independence: A is partially independent of B and C (i.e., A is independent of the composite variable BC).
  loglin(~A+B+C+B*C, mytable) 

Conditional Independence: A is independent of B, given C.
  loglm(~A+B+C+A*C+B*C, mytable)
  No Three-Way Interaction
    loglm(~A+B+C+A*B+A*C+B*C, mytable)

  Martin Theus and Stephan Lauer have written an excellent article on Visualizing Loglinear Models, using mosaic plots.

  Measures of Association 
  The assocstats(mytable) function in the vcd package calculates the phi coefficient, contingency coefficient, and Cramer's V for an rxc table. 
The kappa(mytable) function in the vcd package calculates Cohen's kappa and weighted kappa for a confusion matrix. 
See Richard Darlington's article on Measures of Association in Crosstab Tables for an excellent review of these statistics. 
  Visualizing results 
  Use bar and pie charts for visualizing frequencies in one dimension. 
  Use the vcd package for visualizing relationships among categorical data (e.g. 
mosaic and association plots).
  Use the ca package for correspondence analysis (visually exploring relationships between rows and columns in contingency tables). 

  To practice making these charts, try  the data visualization course at DataCamp.

  Converting Frequency Tables to an "Original" Flat file 
  Finally, there may be times that you wil need the original "flat file" data frame rather than the frequency table. 
Marc Schwartz has provided code on the Rhelp mailing list for converting a table back into a data frame.
You can use the cor( ) function to produce correlations and the cov( ) function to produces covariances. 
A simplified format is cor(x, use=, method= ) where


  
    Option
    Description
  
  
    x
    Matrix or data frame 
  
  
    use
    Specifies the handling of missing data. 
Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
  
  
    method
    Specifies the type of correlation.
      Options are pearson, spearman or kendall.
  

 # Correlations/covariances among numeric variables in 
    # data frame mtcars. 
Use listwise deletion of missing data. 

    cor(mtcars, use="complete.obs", method="kendall") 
    cov(mtcars, use="complete.obs") 

Unfortunately, neither cor( ) or cov( ) produce tests of significance, although you can use the cor.test( ) function to test a single correlation coefficient. 
The rcorr( ) function in the Hmisc package produces correlations/covariances and significance levels for pearson and spearman correlations. 
However, input must be a matrix and pairwise deletion is used. 
# Correlations with significance levels
    library(Hmisc)
    rcorr(x, type="pearson")
    # type can be pearson or spearman
    #mtcars is a data frame
    rcorr(as.matrix(mtcars))
  

You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the columns of Y. 
This similar to the VAR and WITH commands in SAS PROC CORR. 
# Correlation matrix from mtcars
    # with
    mpg, cyl, and disp as rows 
    # and hp, drat, and wt as columns
    x <- mtcars[1:3]
    y <- mtcars[4:6]
    cor(x, y)
  

Other Types of Correlations 
# polychoric correlation
    # x is a contingency table of counts
    library(polycor)
    polychor(x) 
    # heterogeneous correlations in one matrix 
    #
    pearson (numeric-numeric), 
    #
    polyserial (numeric-ordinal), 
    # and polychoric (ordinal-ordinal)
    # x is a data frame with
    ordered factors
    # and
    numeric variables
    library(polycor)
    hetcor(x) 
    
    # partial correlations
    library(ggm)
    data(mydata)
    pcor(c("a", "b", "x", "y", "z"), var(mydata))
    # partial corr between a and b controlling for x, y, z
  

Visualizing Correlations 
Use corrgram( ) to plot correlograms . 
Use the pairs() or splom( ) to create scatterplot matrices.

To Practice
Try this interactive course on correlations and regressions in R.
The t.test( ) function produces a variety of t-tests. 
Unlike most statistical packages, the default assumes unequal variance and applies the Welsh df modification.# independent 2-group t-test
    t.test(y~x) # where y is numeric and x is a binary factor
  

# independent 2-group t-test
    t.test(y1,y2) # where y1 and y2 are numeric 

# paired t-test
    t.test(y1,y2,paired=TRUE) # where y1 & y2 are numeric 

# one sample t-test
    t.test(y,mu=3) # Ho: mu=3
  

You can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate. 
You can use the alternative="less" or alternative="greater" option to specify a one tailed test. 
Nonparametric and resampling alternatives to t-tests are available.

Visualizing Results 
Use box plots or density plots  to visualize group differences.

To Practice
The chapter "Introduction to t-tests"  of this online statistics in R course has a number of interactive exercises on how to do t-tests in R.
R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank, Kruskal Wallis, and Friedman tests.

# independent 2-group Mann-Whitney U Test 
    wilcox.test(y~A) 
    # where y is numeric and A is A binary factor
  

# independent 2-group Mann-Whitney U Test
    wilcox.test(y,x) # where y and x are numeric 

# dependent 2-group Wilcoxon Signed Rank Test 
    wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric 

# Kruskal Wallis Test One Way Anova by Ranks 
    kruskal.test(y~A) # where y1 is numeric and A is a factor 

# Randomized Block Design - Friedman Test 
    friedman.test(y~A|B)
    # where y are the data values, A is a grouping factor
    # and B is a blocking factor
  

For the wilcox.test you can use the alternative="less" or alternative="greater" option to specify a one tailed test. 
Parametric and resampling alternatives are available.

The package pgirmess provides nonparametric multiple comparisons. 
(Note: This package has been
  withdrawn but is still available in the CRAN archives.)

library(npmc)
    npmc(x) 
    #
    where x is a data frame containing variable 'var' 
    #
    (response variable) and 'class' (grouping variable)
  

Visualizing Results 
Use box plots or density plots  to visual group differences.

To Practice 
 This interactive example allows you to practice the Wilcoxon Signed Rank test with R.
R provides comprehensive support for multiple linear regression. 
The topics below are provided in order of increasing complexity. 
Fitting the Model 
# Multiple Linear Regression Example
    fit <- lm(y ~ x1 + x2 + x3, data=mydata)
    summary(fit) # show results

# Other useful functions 
    coefficients(fit) # model coefficients
    confint(fit, level=0.95) # CIs for model parameters
    fitted(fit) # predicted values
    residuals(fit) # residuals
    anova(fit) # anova table 
    vcov(fit) # covariance matrix for model parameters 
    influence(fit) # regression diagnostics
  

Diagnostic Plots 
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations. 
# diagnostic plots 
    layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
    plot(fit)

 click to view 

For a more comprehensive evaluation of model fit see regression diagnostics or the exercises in this interactive course on regression.
Comparing Models
You can compare nested models with the anova( ) function. 
The following code provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2.

# compare models
    fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
    fit2 <- lm(y ~ x1 + x2)
    anova(fit1, fit2)
  

Cross Validation
You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package. 
# K-fold cross-validation
    library(DAAG)
    cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation

Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate.

You can assess R2 shrinkage via K-fold cross-validation. 
Using the crossval() function from the bootstrap package, do the following: 

# Assessing R2 shrinkage using 10-Fold Cross-Validation
    
    fit <- lm(y~x1+x2+x3,data=mydata) 
    library(bootstrap)
    # define functions 
    theta.fit <- function(x,y){lsfit(x,y)}
    theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
    
    # matrix of predictors
    X
    <- as.matrix(mydata[c("x1","x2","x3")])
    # vector of predicted values
    y <- as.matrix(mydata[c("y")]) 
    results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)
    cor(y, fit$fitted.values)**2 # raw R2 
    cor(y,results$cv.fit)**2 # cross-validated R2 
Variable Selection
Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic. 
You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. 
stepAIC( )  performs stepwise model selection by exact AIC. 
# Stepwise Regression
    library(MASS)
    fit <- lm(y~x1+x2+x3,data=mydata)
    step <- stepAIC(fit, direction="both")
    step$anova # display results 

 Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package. 
In the following code nbest indicates the number of subsets of each size to report. 
Here, the ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.). 
# All Subsets Regression
    library(leaps)
    attach(mydata)
    leaps<-regsubsets(y~x1+x2+x3+x4,data=mydata,nbest=10)
    # view results 
    summary(leaps)
    # plot a table of models showing variables in each model.
    #
    models are ordered by the selection statistic.
    plot(leaps,scale="r2")
    # plot statistic by subset size 
    library(car)
    subsets(leaps, statistic="rsq") 

  click to view 

Other options for plot( ) are bic, Cp, and adjr2. 
Other options for plotting with 
    subset( ) are bic, cp, adjr2, and rss. 
Relative Importance
The relaimpo package provides measures of relative importance for each of the predictors in the model. 
See help(calc.relimp) for details on the four measures of relative importance provided. 
# Calculate Relative Importance for Each Predictor
    library(relaimpo)
    calc.relimp(fit,type=c("lmg","last","first","pratt"),
       rela=TRUE)
    # Bootstrap Measures of Relative Importance (1000 samples) 
    boot <- boot.relimp(fit, b = 1000, type = c("lmg", 
      "last", "first", "pratt"), rank = TRUE, 
      diff = TRUE, rela = TRUE)
    booteval.relimp(boot) # print result
    plot(booteval.relimp(boot,sort=TRUE)) # plot result 

 click to view 

Graphic Enhancements 
The car package offers a wide variety of plots for regression, including added variable plots, and enhanced diagnostic and Scatterplots. 
Going Further
Nonlinear Regression 
The nls package provides functions for nonlinear regression. 
See John Fox's Nonlinear Regression and Nonlinear Least Squares for an overview. 
Huet and colleagues' Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS and R Examples is a valuable reference book. 
Robust Regression 
There are many functions in R to aid with robust regression. 
For example, you can perform robust regression with the rlm( ) function in the MASS package. 
John Fox's (who else?) Robust Regression provides a good starting overview. 
The UCLA Statistical Computing website has Robust Regression Examples. 
The robust package provides a comprehensive library of robust methods, including regression. 
The robustbase package also provides basic robust statistics including model selection methods. 
And David Olive has provided an detailed online review of Applied Robust Statistics with sample R code. 
To Practice
This course in machine learning in R includes excercises in multiple regression and cross validation.
An excellent review of regression diagnostics is provided in John Fox's aptly named Overview of Regression Diagnostics. Dr. 
Fox's car package provides advanced utilities for regression modeling. 
# Assume that we are fitting a multiple linear regression
    #
    on the MTCARS data
    library(car)
    fit <- lm(mpg~disp+hp+wt+drat, data=mtcars)
  

This example is for exposition only. 
We will ignore the fact that this may not be a great way of modeling the this particular set of data!

Outliers
# Assessing Outliers
    outlierTest(fit) # Bonferonni p-value for most extreme obs
    qqPlot(fit, main="QQ Plot") #qq plot for studentized resid
    leveragePlots(fit) # leverage plots
  

  click to view 

Influential Observations
# Influential Observations
    # added variable plots 
    av.Plots(fit)
    # Cook's D plot
    # identify D values > 4/(n-k-1) 
    cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2)) 
    plot(fit, which=4, cook.levels=cutoff)
    # Influence Plot
    influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

   click to view 

Non-normality
# Normality of Residuals
    # qq plot for studentized resid
    qqPlot(fit, main="QQ Plot")
    # distribution of studentized residuals
    library(MASS)
    sresid <- studres(fit) 
    hist(sresid, freq=FALSE, 
       main="Distribution of Studentized Residuals")
    xfit<-seq(min(sresid),max(sresid),length=40) 
    yfit<-dnorm(xfit)
    lines(xfit, yfit) 

  click to view 

Non-constant Error Variance
# Evaluate homoscedasticity
    # non-constant error variance test
    ncvTest(fit)
    # plot
    studentized residuals vs. 
fitted values 
    spreadLevelPlot(fit)

 click to view 

Multi-collinearity
# Evaluate Collinearity
    vif(fit) # variance inflation factors 
    sqrt(vif(fit)) > 2 # problem? 

Nonlinearity
# Evaluate Nonlinearity
    # component + residual plot 
    crPlots(fit)
    # Ceres plots 
    ceresPlots(fit)

  click to view 

Non-independence of Errors
# Test for Autocorrelated Errors
    durbinWatsonTest(fit)

Additional Diagnostic Help
The gvlma( ) function in the gvlma package, performs a global validation of linear model assumptions as well separate evaluations of skewness, kurtosis, and heteroscedasticity. 
# Global test of model assumptions
    library(gvlma)
    gvmodel <- gvlma(fit)
    summary(gvmodel) 

Going Further
If you would like to delve deeper into regression diagnostics, two books written by John Fox can help: Applied regression analysis and generalized linear models (2nd ed) and An R and S-Plus companion to applied regression. 

If you have been analyzing ANOVA designs in traditional statistical packages, you are likely to find R's approach less coherent and user-friendly. 
A good online presentation on ANOVA in R can
  be found in ANOVA section of the Personality Project. 
(Note: I have found that these pages render fine in Chrome and Safari browsers, but can appear distorted in iExplorer.)

1. 
Fit a Model
In the following examples lower case letters are numeric variables and upper case letters are factors. 
# One Way Anova (Completely Randomized Design)
    fit <- aov(y ~ A, data=mydataframe)
  

# Randomized Block Design (B is the blocking factor) 
    fit <- aov(y ~ A + B, data=mydataframe)
  

# Two Way Factorial Design 
    fit <- aov(y ~ A + B + A:B, data=mydataframe)
    fit <- aov(y ~ A*B, data=mydataframe)
    # same thing
  

# Analysis of Covariance 
    fit <- aov(y ~ A + x, data=mydataframe) 

For within subjects designs, the data frame has to be rearranged so that each measurement on a subject is a separate observation. 
See R and Analysis of Variance. 
 

# One Within Factor
    fit <- aov(y~A+Error(Subject/A),data=mydataframe)

# Two Within Factors W1 W2, Two Between Factors B1 B2 
    fit <- aov(y~(W1*W2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),
       data=mydataframe) 

2. 
Look at Diagnostic Plots 
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.layout(matrix(c(1,2,3,4),2,2)) # optional layout
    plot(fit) # diagnostic plots

For details on the evaluation of test requirements, see (M)ANOVA Assumptions. 
 3. 
Evaluate Model Effects 
WARNING: R provides Type I sequential SS, not the default Type III marginal SS reported by SAS and SPSS. 
In a nonorthogonal design with more than one term on the right hand side of the equation order will matter (i.e., A+B and B+A will produce different results)! We will need use the drop1( ) function to produce the familiar Type III results. 
It will compare each term with the full model. 
Alternatively, we can use anova(fit.model1, fit.model2) to compare nested models directly. 
summary(fit) # display Type I ANOVA table
    drop1(fit,~.,test="F") # type III SS and F Tests 

Nonparametric and resampling alternatives are available.

Multiple Comparisons 
You can get Tukey HSD tests using the function below. 
By default, it calculates post hoc comparisons on each factor in the model. 
You can specify specific factors as an option. 
Again, remember that results are based on Type I SS! 

# Tukey Honestly Significant Differences
    TukeyHSD(fit) # where fit comes from aov()
  

Visualizing Results 
Use box plots  and line plots to visualize group differences. 
There are also two functions specifically designed for visualizing mean differences in ANOVA layouts. 
interaction.plot( ) in the base stats package produces plots for two-way interactions. 
plotmeans( ) in the gplots package produces mean plots for single factors, and includes confidence intervals.

# Two-way Interaction Plot 
    attach(mtcars)
    gears <- factor(gears)
    cyl <- factor(cyl)
    interaction.plot(cyl, gear, mpg, type="b", col=c(1:3), 
       leg.bty="o", leg.bg="beige", lwd=2, pch=c(18,24,22), 
       xlab="Number of Cylinders", 
       ylab="Mean Miles Per Gallon", 
       main="Interaction Plot")

 click to view

# Plot Means with Error Bars
    library(gplots)
    attach(mtcars)
    cyl <- factor(cyl)
    plotmeans(mpg~cyl,xlab="Number of Cylinders",
      ylab="Miles Per Gallon", main="Mean Plot\nwith 95% CI")
  

 click to view 

MANOVA
If there is more than one dependent (outcome) variable, you can test them simultaneously using a multivariate analysis of variance (MANOVA). 
In the following example, let Y be a matrix whose columns are the dependent variables. 
# 2x2 Factorial MANOVA with 3 Dependent Variables. 

    Y <- cbind(y1,y2,y3)
    fit <- manova(Y ~ A*B)
    summary(fit, test="Pillai")

Other test options are "Wilks", "Hotelling-Lawley", and "Roy". 
Use summary.aov( ) to get univariate statistics. 
TukeyHSD( ) and plot( ) will not work with a MANOVA fit. 
Run each dependent variable separately to obtain them. 
Like ANOVA, MANOVA results in R are based on Type I SS. 
To obtain Type III SS, vary the order of variables in the model and rerun the analyses. 
For example, fit y~A*B for the TypeIII B effect and y~B*A for the Type III A effect. 
Going Further
R has excellent facilities for fitting linear and generalized linear mixed-effects models. 
The lastest implimentation is in package lme4. 
See the R News Article on Fitting Mixed Linear Models in R for details. 

In classical parametric procedures we often assume normality and constant variance for the model error term. 
Methods of exploring these assumptions in an ANOVA/ANCOVA/MANOVA framework are discussed here. 
Regression diagnostics are covered under multiple linear regression. 

Outliers
Since outliers can severly affect normality and homogeneity of variance, methods for detecting disparate observerations are described first. 
The aq.plot() function in the mvoutlier package allows you to identfy multivariate outliers by plotting the ordered squared robust Mahalanobis distances of the observations against the empirical distribution function of the MD²_i. 
Input consists of a matrix or data frame. 
The function produces 4 graphs and returns a boolean vector identifying the outliers.

# Detect Outliers in the MTCARS Data
    library(mvoutlier)
    outliers <- 
    aq.plot(mtcars[c("mpg","disp","hp","drat","wt","qsec")])
    outliers # show list of outliers 

 click to view

Univariate Normality
You can evaluate the normality of a variable using a Q-Q plot. 
# Q-Q Plot for variable MPG 
    attach(mtcars)
    qqnorm(mpg)
    qqline(mpg)

 click to view

Significant departures from the line suggest violations of normality. 
You can also perform a Shapiro-Wilk test of normality with the shapiro.test(x) function, where x is a numeric vector. 
Additional functions for testing normality are available in nortest package. 
Multivariate Normality 
MANOVA assumes multivariate normality. 
The function mshapiro.test( ) in the mvnormtest package produces the Shapiro-Wilk test for multivariate normality. 
Input must be a numeric matrix.

# Test Multivariate Normality
    mshapiro.test(M)
  

 If we have p x 1 multivariate normal random vector 
  then the squared Mahalanobis distance between x and μ is going to be chi-square distributed with p degrees of freedom. 
We can use this fact to construct a Q-Q plot to assess multivariate normality. 
# Graphical Assessment of Multivariate Normality
    x <- as.matrix(mydata) # n x p numeric matrix
    center <- colMeans(x) # centroid
    n <- nrow(x); p <- ncol(x); cov <- cov(x); 
    d <-
    mahalanobis(x,center,cov) # distances
    qqplot(qchisq(ppoints(n),df=p),d,
      main="QQ Plot Assessing Multivariate Normality",
      ylab="Mahalanobis D2")
    abline(a=0,b=1)
  

 click to view 

Homogeneity of Variances
The bartlett.test( ) function provides a parametric K-sample test of the equality of variances. 
The fligner.test( ) function provides a non-parametric test of the same. 
In the following examples y is a numeric variable and G is the grouping variable. 
# Bartlett Test of Homogeneity of Variances
    bartlett.test(y~G, data=mydata)
    # Figner-Killeen Test of Homogeneity of Variances
    fligner.test(y~G, data=mydata)
  

The hovPlot( ) function in the HH package provides a graphic test of homogeneity of variances based on Brown-Forsyth. 
In the following example, y is numeric and G is a grouping factor. 
Note that G must be of type factor. 
# Homogeneity of Variance Plot
    library(HH)
    hov(y~G, data=mydata)
    hovPlot(y~G,data=mydata)
  

 click to view 

Homogeneity of Covariance Matrices 
MANOVA and LDF assume homogeneity of variance-covariance matrices. 
The assumption is usually tested with Box's M. 
Unfortunately the test is very sensitive to violations of normality, leading to rejection in most typical cases. 
Box's M is available via the boxM function
  in the biotools package. 
To Practice
Try the free first chapter of  this course on ANOVA with R.
The coin package provides the ability to perform a wide variety of re-randomization or permutation based statistical tests. 
These tests do not assume random sampling from well-defined populations. 
They can be a reasonable alternative to classical procedures when test assumptions can not be met. 
See coin: A Computational Framework for Conditional Inference for details. 
In the examples below, lower case letters represent numerical variables and upper case letters represent categorical factors. 
Monte-Carlo simulation are available for all tests. 
Exact tests are available for 2 group procedures. 
Independent Two- and K-Sample Location Tests
 # Exact Wilcoxon Mann Whitney Rank Sum Test 
    # where y is numeric and A is a binary factor 
    library(coin)
    wilcox_test(y~A, data=mydata, distribution="exact") 

# One-Way Permutation Test based on 9999 Monte-Carlo 
    # resamplings. 
y is numeric and A is a categorical factor
    library(coin)
    oneway_test(y~A, data=mydata,
      distribution=approximate(B=9999))

Symmetry of a response for repeated measurements 
# Exact Wilcoxon Signed Rank Test 
    # where y1 and y2 are repeated measures 
    library(coin)
    wilcoxsign_test(y1~y2, data=mydata, distribution="exact")

# Freidman Test based on 9999 Monte-Carlo resamplings.
    # y is numeric, A is a grouping factor, and B is a 
    #
    blocking factor. 

    library(coin)
    friedman_test(y~A|B, data=mydata, 
       distribution=approximate(B=9999))

Independence of Two Numeric Variables
# Spearman Test of Independence based on 9999 Monte-Carlo
    # resamplings. 
x and y are numeric variables.
    library(coin)
    spearman_test(y~x, data=mydata, 
       distribution=approximate(B=9999))
  

Independence in Contingency Tables 
# Independence in 2-way Contingency Table based on
    # 9999 Monte-Carlo resamplings. 
A and B are factors.
    library(coin)
    chisq_test(A~B, data=mydata, 
       distribution=approximate(B=9999))
  

# Cochran-Mantel-Haenzsel Test of 3-way Contingency Table
    # based on 9999 Monte-Carlo resamplings. 
A, B, are
    factors 
    # and C is a stratefying factor.
    library(coin)
    mh_test(A~B|C, data=mydata, 
       distribution=approximate(B=9999))

# Linear by Linear Association Test based on 9999 
    #
    Monte-Carlo resamplings.
    A and B are ordered factors.
    library(coin)
    lbl_test(A~B, data=mydata, 
       distribution=approximate(B=9999)) 

Many other univariate and multivariate tests are possible using the functions in the coin package. 
See A Lego System for Conditional Inference for more details. 
To Practice
Try the exercises in  this course on data analysis and statistical inference in R.
Overview
Power analysis is an important aspect of experimental design. 
It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. 
Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. 
If the probability is unacceptably low, we would be wise to alter or abandon the experiment.

The following four quantities have an intimate relationship:

sample size
effect size
significance level = P(Type I error) = probability of finding an effect that is not there 
power = 1 - P(Type II error) = probability of finding an effect that is there 


Given any three, we can determine the fourth. 
Power Analysis in R 
The pwr package develped by Stéphane Champely, impliments power analysis as outlined by Cohen (!988). 
Some of the more important functions are listed below. 

  
    function
    power calculations for 
  
  
    pwr.2p.test
    two proportions (equal n) 
  
  
    pwr.2p2n.test
    two proportions (unequal n) 
  
  
    pwr.anova.test
    balanced one way ANOVA 
  
  
    pwr.chisq.test
    chi-square test
  
  
    pwr.f2.test
    general linear model 
  
  
    pwr.p.test
    proportion (one sample) 
  
  
    pwr.r.test
    correlation
  
  
    pwr.t.test
    t-tests (one sample, 2 sample, paired) 
  
  
    pwr.t2n.test
    t-test (two samples with unequal n) 
  

For each of these functions, you enter three of the four quantities (effect size, sample size, significance level, power) and the fourth is calculated. 


The significance level defaults to 0.05. 
Therefore, to calculate the significance level, given an effect size, sample size, and power, use the option "sig.level=NULL". 
Specifying an effect size can be a daunting task. 
ES formulas and Cohen's suggestions (based on social science research) are provided below. 
Cohen's suggestions should only be seen as very rough guidelines. 
Your own subject matter experience should be brought to bear. 

(To explore confidence intervals and drawing conclusions from samples try  this interactive course on the foundations of inference.)
t-tests
For t-tests, use the following functions: 

pwr.t.test(n = , d = , sig.level = , power = ,
    type = c("two.sample", "one.sample", "paired")) 

where n is the sample size, d is the effect size, and type indicates a two-sample t-test, one-sample t-test or paired t-test. 
If you have unequal sample sizes, use 

pwr.t2n.test(n1 = , n2= , d = , sig.level =, power = )

where n1 and n2 are the sample sizes.

For t-tests, the effect size is assessed as 



Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively. 
You can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test. 
A two tailed test is the default. 
ANOVA
For a one-way analysis of variance use 

pwr.anova.test(k = , n = , f = , sig.level = , power = ) 

where k is the number of groups and n is the common sample size in each group. 
For a one-way ANOVA effect size is measured by f where 


  Cohen suggests that f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes respectively. 
Correlations
For correlation coefficients use 

pwr.r.test(n = , r = , sig.level = , power = ) 

where n is the sample size and r is the correlation. 
We use the population correlation coefficient as the effect size measure. 
Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively.

Linear Models
For linear models (e.g., multiple regression) use 

pwr.f2.test(u =, v = , f2 = , sig.level = , power = ) 

where u and v are the numerator and denominator degrees of freedom. 
We use f2 as the effect size measure. 




The first formula is appropriate when we are evaluating the impact of a set of predictors on an outcome. 
The second formula is appropriate when we are evaluating the impact of one set of predictors above and beyond a second set of predictors (or covariates). 
Cohen suggests f2 values of 0.02, 0.15, and 0.35 represent small, medium, and large effect sizes. 
Tests of Proportions
When comparing two proportions use 

pwr.2p.test(h = , n = , sig.level =, power = ) 

where h is the effect size and n is the common sample size in each group. 


Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively. 
For unequal n's use 

pwr.2p2n.test(h = , n1 = , n2 = , sig.level = , power = ) 

To test a single proportion use 

pwr.p.test(h = , n = , sig.level = power = )

For both two sample and one sample proportion tests, you can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test. 
A two tailed test is the default. 
Chi-square Tests
For chi-square tests use 

pwr.chisq.test(w =, N = , df = , sig.level =, power = )  

where w is the effect size, N is the total sample size, and df is the degrees of freedom. 
The effect size w is defined as 



Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively. 
Some Examples
library(pwr)
    # For a one-way ANOVA comparing 5 groups, calculate the
    # sample size needed in each group to obtain a power of
    #
    0.80, when the effect size is moderate (0.25) and a
    #
    significance level of 0.05 is employed.
    pwr.anova.test(k=5,f=.25,sig.level=.05,power=.8)
    # What is the power of a one-tailed t-test, with a
    # significance level of 0.01, 25 people in each group, 
    # and an effect size equal to 0.75?
    pwr.t.test(n=25,d=0.75,sig.level=.01,alternative="greater")
    # Using a two-tailed test proportions, and assuming a
    #
    significance level of 0.01 and a common sample size of 
    #
    30 for each
    proportion, what effect size can be detected 
    #
    with a power of .75? 
    pwr.2p.test(n=30,sig.level=0.01,power=0.75) 

Creating Power or Sample Size Plots 
The functions in the pwr package can be used to generate power and sample size graphs. 
# Plot sample size curves for detecting correlations of
    # various sizes.
    library(pwr)
    # range of correlations
    r <- seq(.1,.5,.01)
    nr <- length(r)
    # power values
    p <- seq(.4,.9,.1)
    np <- length(p)
    # obtain sample sizes
    samsize <- array(numeric(nr*np), dim=c(nr,np))
    for (i in 1:np){
      for (j in 1:nr){
        result <- pwr.r.test(n = NULL, r = r[j],
        sig.level = .05, power = p[i],
        alternative = "two.sided")
        samsize[j,i] <- ceiling(result$n)
      }
    }
    # set up graph
    xrange <- range(r)
    yrange <- round(range(samsize))
    colors <- rainbow(length(p))
    plot(xrange, yrange, type="n",
      xlab="Correlation Coefficient (r)",
      ylab="Sample Size (n)" )
    # add power curves
    for (i in 1:np){
      lines(r, samsize[,i], type="l", lwd=2, col=colors[i])
    }
    # add annotation (grid lines, title, legend) 
    abline(v=0, h=seq(0,yrange[2],50), lty=2, col="grey89")
    abline(h=0, v=seq(xrange[1],xrange[2],.02), lty=2,
       col="grey89")
    title("Sample Size Estimation for Correlation Studies\n
      Sig=0.05 (Two-tailed)")
    legend("topright", title="Power",
    as.character(p),
       fill=colors)

 click to view 
There are two functions that can help write simpler and more efficient code.

With
The with( ) function applys an expression to a dataset. 
It is similar to DATA= in SAS.

# with(data, expression)
    # example applying a t-test to a data frame mydata
    with(mydata, t.test(y ~ group))

By
The by( ) function applys a function to each level of a factor or factors. 
It is similar to BY processing in SAS. 
# by(data, factorlist, function)
    # example obtain variable means separately for
    # each level of byvar in data frame mydata 
    by(mydata, mydata$byvar, function(x) mean(x))

To Practice 
This data manipulation tutorial in R includes excercises on using the by() function.
Generalized linear models are fit using the glm( ) function. 
The form of the glm function is 

glm(formula, family=familytype(link=linkfunction), data=)


  
    
      Family 
    
      Default Link Function 
  
  
    binomial
    (link = "logit") 
  
  
    gaussian
    (link = "identity") 
  
  
    Gamma 
    (link = "inverse")
  
  
    inverse.gaussian
    (link = "1/mu^2") 
  
  
    poisson
    (link = "log")
  
  
     quasi 
    (link = "identity", variance = "constant")
  
  
    quasibinomial
    (link = "logit") 
  
  
    quasipoisson
    (link = "log")
  

See help(glm) for other modeling options. 
See help(family) for other allowable link functions for each family. 
Three subtypes of generalized linear models will be covered here: logistic regression, poisson regression, and survival analysis.

Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous predictor variables. 
It is frequently preferred over discriminant function analysis because of its less restrictive assumptions.

# Logistic Regression
    # where F is a binary factor and 
    #
    x1-x3 are continuous predictors 
    fit <- glm(F~x1+x2+x3,data=mydata,family=binomial())
    summary(fit) # display results
    confint(fit) # 95% CI for the coefficients
    exp(coef(fit)) # exponentiated coefficients
    exp(confint(fit)) # 95% CI for exponentiated coefficients
    predict(fit, type="response") # predicted values
    residuals(fit, type="deviance") # residuals
  

You can use anova(fit1,fit2, test="Chisq") to compare nested models. 
Additionally, cdplot(F~x, data=mydata) will display the conditional density plot of the binary outcome F on the continuous x variable. 
 click to view 

Poisson Regression
Poisson regression is useful when predicting an outcome variable representing counts from a set of continuous predictor variables.

# Poisson Regression
    # where count is a count and 
    #
    x1-x3 are continuous predictors
    fit <- glm(count ~ x1+x2+x3, data=mydata, family=poisson())
    summary(fit) display results
  
  If you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may want to use quasipoisson() instead of poisson().

Survival Analysis
Survival analysis (also called event history analysis or reliability analysis) covers a set of techniques for modeling the time to an event. 
Data may be right censored - the event may not have occured by the end of the study or we may have incomplete information on an observation but know that up to a certain time the event had not occured (e.g. 
the participant dropped out of study in week 10 but was alive at that time). 
While generalized linear models are typically analyzed using the glm( ) function, survival analyis is typically carried out using functions from the survival package . 
The survival package can handle one and two sample problems, parametric accelerated failure models, and the Cox proportional hazards model. 
Data are typically entered in the format start time, stop time, and status (1=event occured, 0=event did not occur). 
Alternatively, the data may be in the format time to event and status (1=event occured, 0=event did not occur). 
A status=0 indicates that the observation is right cencored. 
Data are bundled into a Surv object via the Surv( ) function prior to further analyses. 
survfit( ) is used to estimate a survival distribution for one or more groups.
  survdiff( ) tests for differences in survival distributions between two or more groups. 

    coxph( ) models the hazard function on a set of predictor variables. 
# Mayo Clinic Lung Cancer Data
    library(survival)
    # learn about the dataset
    help(lung)
    # create a Surv object
    survobj <- with(lung, Surv(time,status))
    # Plot survival distribution of the total sample
    # Kaplan-Meier estimator
    fit0 <- survfit(survobj~1, data=lung)
    summary(fit0)
    plot(fit0, xlab="Survival Time in Days", 
       ylab="% Surviving", yscale=100,
       main="Survival Distribution (Overall)") 
    # Compare the survival distributions of men and women 
    fit1 <- survfit(survobj~sex,data=lung)
    # plot the survival distributions by sex 
    plot(fit1, xlab="Survival Time in Days", 
      ylab="% Surviving", yscale=100, col=c("red","blue"),
      main="Survival Distributions by Gender") 
      legend("topright", title="Gender", c("Male", "Female"),
      fill=c("red", "blue"))
    # test for difference between male and female 
    # survival curves (logrank test) 
    survdiff(survobj~sex, data=lung)
    
    # predict male survival from age and medical scores
    MaleMod <- coxph(survobj~age+ph.ecog+ph.karno+pat.karno,
      data=lung, subset=sex==1)
    # display results 
    MaleMod
    # evaluate the proportional hazards assumption 
    cox.zph(MaleMod)
  

  click to view 

See Thomas Lumley's R news article on the survival package for more information. 
Other good sources include Mai Zhou's Use R Software to do Survival Analysis and Simulation and M. 
J. 
Crawley's chapter on Survival Analysis. 
To Practice
Try this interactive exercise on basic logistic regression with R using age as a predictor for credit risk.
The MASS package contains functions for performing linear and quadratic 
  discriminant function analysis. 
Unless prior probabilities are specified, each assumes proportional prior probabilities (i.e., prior probabilities are based on sample sizes). 
In the examples below, lower case letters are numeric variables and upper case letters are categorical factors. 
Linear Discriminant Function 
 # Linear Discriminant Analysis with Jacknifed Prediction
    library(MASS)
    fit <- lda(G ~ x1 + x2 + x3, data=mydata, 
       na.action="na.omit", CV=TRUE)
    fit # show results
  
The code above performs an LDA, using listwise deletion of missing data. 
CV=TRUE generates jacknifed (i.e., leave one out) predictions. 
The code below assesses the accuracy of the prediction.

# Assess the accuracy of the prediction
    # percent correct for each category of G
    ct <- table(mydata$G, fit$class)
    diag(prop.table(ct, 1))
    # total percent correct
    sum(diag(prop.table(ct)))

lda() prints discriminant functions based on centered (not standardized) variables. 
The "proportion of trace" that is printed is the proportion of between-class variance that is explained by successive discriminant functions. 
No significance tests are produced. 
Refer to the section on MANOVA for such tests.

Quadratic Discriminant Function
To obtain a quadratic discriminant function use qda( ) instead of lda( ). 
Quadratic discriminant function does not assume homogeneity of variance-covariance matrices.

# Quadratic Discriminant Analysis with 3 groups applying 
    #
    resubstitution prediction and equal prior probabilities. 

    library(MASS)
    fit <- qda(G ~ x1 + x2 + x3 + x4, data=na.omit(mydata),
      prior=c(1,1,1)/3))

Note the alternate way of specifying listwise deletion of missing data. 
Re-subsitution (using the same data to derive the functions and evaluate their prediction accuracy) is the default method unless CV=TRUE is specified. 
Re-substitution will be overly optimistic. 
Visualizing the Results
You can plot each observation in the space of the first 2 linear discriminant functions using the following code. 
Points are identified with the group ID.

# Scatter plot using the 1st two discriminant dimensions 
    plot(fit) # fit from lda

 click to view 

The following code displays histograms and density plots for the observations in each group on the first linear discriminant dimension. 
There is one panel for each group and they all appear lined up on the same graph. 
# Panels of histograms and overlayed density plots
    # for 1st discriminant function
    plot(fit, dimen=1, type="both") # fit from lda
  

 click to view 

The partimat( ) function in the klaR package can display the results of a linear or quadratic classifications 2 variables at a time. 
# Exploratory Graph for LDA or QDA
    library(klaR)
    partimat(G~x1+x2+x3,data=mydata,method="lda")
  

 click to view 

You can also produce a scatterplot matrix with color coding by group. 
# Scatterplot for 3 Group Problem 
    pairs(mydata[c("x1","x2","x3")], main="My Title ", pch=22, 
       bg=c("red", "yellow", "blue")[unclass(mydata$G)]) 

 click to view 

Test Assumptions
See (M)ANOVA Assumptions for methods of evaluating multivariate normality and homogeneity of covariance matrices. 
To Practice
To practice improving predictions, try the  Kaggle R Tutorial on Machine Learning
R has extensive facilities for analyzing time series data. 
This section describes the creation of a time series, seasonal decomposition, modeling with exponential and ARIMA models, and forecasting with the forecast package.
Creating a time series
The ts() function will convert a numeric vector into an R time series object. 
The format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).

# save a numeric vector containing 72 monthly observations

    # from Jan 2009 to Dec 2014 as a time series object

    myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)
    


    # subset the time series (June 2014 to December 2014)

    myts2 <- window(myts, start=c(2014, 6), end=c(2014, 12))
    


    # plot series

    plot(myts)

Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the stl() function. 
Note that a series with multiplicative effects can often by transformed into series with additive effects through a log transformation (i.e., newts <- log(myts)).


  # Seasonal decomposition

  fit <- stl(myts, s.window="period")

  plot(fit)

  

  # additional plots

  monthplot(myts)

  library(forecast)

  seasonplot(myts)

Exponential Models
Both the HoltWinters() function in the base installation, and the ets() function in the forecast package, can be used to fit exponential models.


  # simple exponential - models level

  fit <- HoltWinters(myts, beta=FALSE, gamma=FALSE)

  # double exponential - models level and trend

  fit <- HoltWinters(myts, gamma=FALSE)

  # triple exponential - models level, trend, and seasonal components

  fit <- HoltWinters(myts)


  # predictive accuracy

  library(forecast)

  accuracy(fit)


  # predict next three future values

  library(forecast)

  forecast(fit, 3)

  plot(forecast(fit, 3))

ARIMA Models
The arima() function can be used to fit an autoregressive integrated moving averages model. 
Other useful functions include:
  
    
      
lag(ts, k)
      lagged version of time series, shifted back k observations
    
    
      diff(ts, differences=d)
      difference the time series d times
    
    
    

      ndiffs(ts)
      Number of differences required to achieve stationarity (from the forecast package)
    
    
    

      acf(ts)
      autocorrelation function
    
    
    

      pacf(ts)
      partial autocorrelation function
    
    
    

      adf.test(ts)
      Augemented Dickey-Fuller test. 
Rejecting the null hypothesis suggests that a time series is stationary (from the tseries package)
    
    
      Box.test(x, type="Ljung-Box")
      Pormanteau test that observations in vector or time series x are independent
    
    
  
Note that the forecast package has somewhat nicer versions of acf() and pacf() called Acf() and Pacf() respectively.


  # fit an ARIMA model of order P, D, Q

  fit <- arima(myts, order=c(p, d, q)



  # predictive accuracy

  library(forecast)

  accuracy(fit)



  # predict next 5 observations

  library(forecast)

  forecast(fit, 5)

  plot(forecast(fit, 5))


Automated Forecasting
The forecast package provides functions for the automatic selection of exponential and ARIMA models. 
The ets() function supports both additive and multiplicative models. 
The auto.arima() function can handle both seasonal and nonseasonal ARIMA models. 
Models are chosen to maximize one of several fit criteria.


  library(forecast)

  # Automated forecasting using an exponential model

  fit <- ets(myts)


  # Automated forecasting using an ARIMA model

  fit <- auto.arima(myts)

Going Further

  There are many good online resources for learning time series analysis with R. 
These include A little book of R for time series by Avril Chohlan and DataCamp's manipulating time series in R course by Jeffrey Ryan.
This section covers principal components and factor analysis. 
The latter includes both exploratory and confirmatory methods.

Principal Components
The princomp( ) function produces an unrotated principal component analysis. 
# Pricipal Components Analysis
    # entering raw data and extracting PCs 
    #
    from the correlation matrix
    fit <- princomp(mydata, cor=TRUE)
    summary(fit) # print variance accounted for 
    loadings(fit) # pc loadings 
    plot(fit,type="lines") # scree plot 
    fit$scores # the principal components
    biplot(fit)
  

  click to view 

Use cor=FALSE to base the principal components on the covariance matrix. 
Use the covmat= option to enter a correlation or covariance matrix directly. 
If entering a covariance matrix, include the option n.obs=.

The principal( ) function in the psych package can be used to extract and rotate principal components. 
# Varimax Rotated Principal Components
    # retaining 5 components 
    library(psych)
    fit <- principal(mydata, nfactors=5, rotate="varimax")
    fit # print results
  

mydata can be a raw data matrix or a covariance matrix. 
Pairwise deletion of missing data is used. 
rotate can "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster"

Exploratory Factor Analysis
The factanal( ) function produces maximum likelihood factor analysis. 
# Maximum Likelihood Factor Analysis
    # entering raw data and extracting 3 factors, 
    #
    with varimax rotation
    fit <- factanal(mydata, 3, rotation="varimax")
    print(fit, digits=2, cutoff=.3, sort=TRUE)
    # plot factor 1 by factor 2 
    load <- fit$loadings[,1:2] 
    plot(load,type="n") # set up plot 
    text(load,labels=names(mydata),cex=.7) # add variable names 

 click to view 

The rotation= options include "varimax", "promax", and "none". 
Add the option scores="regression" or "Bartlett" to produce factor scores. 
Use the covmat= option to enter a correlation or covariance matrix directly. 
If entering a covariance matrix, include the option n.obs=.

The  factor.pa( ) function in the psych package offers a number of factor analysis related functions, including principal axis factoring. 
# Principal Axis Factor Analysis
    library(psych)
    fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
    fit # print results
  

mydata can be a raw data matrix or a covariance matrix. 
Pairwise deletion of missing data is used. 
Rotation can be "varimax" or "promax". 

Determining the Number of Factors to Extract
A crucial decision in exploratory factor analysis is how many factors to extract. 
The nFactors package offer a suite of functions to aid in this decision. 
Details on this methodology can be found in a PowerPoint presentation by Raiche, Riopel, and Blais. 
Of course, any factor solution must be interpretable to be useful. 
# Determine Number of Factors to Extract
    library(nFactors)
    ev <- eigen(cor(mydata)) # get eigenvalues
    ap <- parallel(subject=nrow(mydata),var=ncol(mydata),
      rep=100,cent=.05)
    nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
    plotnScree(nS)
  

 click to view 
Going Further 
The FactoMineR package offers a large number of additional functions for exploratory factor analysis. 
This includes the use of both quantitative and qualitative variables, as well as the inclusion of supplimentary variables and observations. 
Here is an example of the types of graphs that you can create with this package.

# PCA Variable Factor Map 
    library(FactoMineR)
    result <- PCA(mydata) # graphs generated automatically
  

  click to view 

Thye GPARotation package offers a wealth of rotation options beyond varimax and promax.

Structual Equation Modeling 
Confirmatory Factor Analysis (CFA) is a subset of the much wider Structural Equation Modeling (SEM) methodology. 
SEM is provided in R via the sem package. 
Models are entered via RAM specification (similar to PROC CALIS in SAS). 
While sem is a comprehensive package, my recommendation is that if you are doing significant SEM work, you spring for a copy of AMOS. 
It can be much more user-friendly and creates more attractive and publication ready output. 
Having said that, here is a CFA example using sem.


  Assume that we have six observered variables (X1, X2, ..., X6). 
We hypothesize that there are two unobserved latent factors (F1, F2) that underly the observed variables as described in this diagram. 
X1, X2, and X3 load on F1 (with loadings lam1, lam2, and lam3). 
X4, X5, and X6 load on F2 (with loadings lam4, lam5, and lam6). 
The double headed arrow indicates the covariance between the two latent factors (F1F2). 
e1 thru e6 represent the residual variances (variance in the observed variables not accounted for by the two latent factors). 
We set the variances of F1 and F2 equal to one so that the parameters will have a scale. 
This will result in F1F2 representing the correlation between the two latent factors. 
  For sem, we need the covariance matrix of the observed variables - thus the cov( ) statement in the code below. 
The CFA model is specified using the specify.model( ) function. 
The format is arrow specification, parameter name, start value. 
Choosing a start value of NA tells the program to choose a start value rather than supplying one yourself. 
Note that the variance of F1 and F2 are fixed at 1 (NA in the second column). 
The blank line is required to end the RAM specification. 
  # Simple CFA Model
      library(sem)
      mydata.cov <- cov(mydata)
      model.mydata <- specify.model() 
      F1 ->  X1, lam1, NA
      F1 ->  X2, lam2, NA 
      F1 ->  X3, lam3, NA 
      F2 ->  X4, lam4, NA 
      F2 ->  X5, lam5, NA 
      F2 ->  X6, lam6, NA 
      X1 <-> X1, e1,   NA 
      X2 <-> X2, e2,   NA 
      X3 <-> X3, e3,   NA 
      X4 <-> X4, e4,   NA 
      X5 <-> X5, e5,   NA 
      X6 <-> X6, e6,   NA 
      F1 <-> F1, NA,    1 
      F2 <-> F2, NA,    1 
      F1 <-> F2, F1F2, NA
      
      mydata.sem <- sem(model.mydata, mydata.cov, nrow(mydata))
      # print results (fit indices, paramters, hypothesis tests) 
      summary(mydata.sem)
      # print standardized coefficients (loadings)
      
      std.coef(mydata.sem)
    

  You can use the boot.sem( ) function to bootstrap the structual equation model. 
See help(boot.sem) for details. 
Additionally, the function mod.indices( ) will produce modification indices. 
Using modification indices to improve model fit by respecifying the parameters moves you from a confirmatory to an exploratory analysis. 
  For more information on sem, see Structural Equation Modeling with the sem Package in R, by John Fox. 
  To Practice
  To practice improving predictions, try the  Kaggle R Tutorial on Machine Learning
Correspondence analysis provides a graphic method of exploring the relationship between variables in a contingency table. 
There are many options for correspondence analysis in R. 
I recommend the ca package by Nenadic and Greenacre because it supports supplimentary points, subset analyses, and comprehensive graphics. 
You can obtain the package here. 
Although ca can perform multiple correspondence analysis (more than two categorical variables), only simple correspondence analysis is covered here. 
See their article for details on multiple CA. 
Simple Correspondence Analysis
In the following example, A and B are categorical factors.

# Correspondence Analysis
    library(ca)
    mytable <- with(mydata, table(A,B)) # create a 2 way table
    prop.table(mytable, 1) # row percentages
    prop.table(mytable, 2) # column percentages
    fit <- ca(mytable)
    print(fit) # basic results 
    summary(fit) # extended results 
    plot(fit) # symmetric map
    plot(fit, mass = TRUE, contrib = "absolute", map =
       "rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

  

The first graph is the standard symmetric representation of a simple correspondence analysis with rows and column represented by points.

 click to view 

Row points (column points) that are closer together have more similar column profiles (row profiles). 
Keep in mind that you can not interpret the distance between row and column points directly. 
The second graph is asymmetric , with rows in the principal coordinates and columns in reconstructions of the standarized residuals. 
Additionally, mass is represented by points and columns are represented by arrows. 
Point intensity (shading) corresponds to the absolute contributions for the rows. 
This example is included to highlight some of the available options.

 click to view 

Going Further
Try this interactive course on exploratory data analysis.
R provides functions for both classical and nonmetric multidimensional scaling. 
Assume that we have N objects measured on p numeric variables. 
We want to represent the distances among the objects in a parsimonious (and visual) way (i.e., a lower k-dimensional space). 
Classical MDS
You can perform a classical MDS using the cmdscale( ) function. 
# Classical MDS
    # N rows (objects) x p columns (variables)
    # each row identified by a unique row name
    d <- dist(mydata) # euclidean distances between the rows
    fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim
    fit # view results
    # plot solution 
    x <- fit$points[,1]
    y <- fit$points[,2]
    plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", 
      main="Metric MDS", type="n")
    text(x, y, labels = row.names(mydata), cex=.7)
  

 click to view 

Nonmetric MDS 
Nonmetric MDS is performed using the isoMDS( ) function in the MASS package.

# Nonmetric MDS
    # N rows (objects) x p columns (variables)
    # each row identified by a unique row name
    library(MASS)
    d <- dist(mydata) # euclidean distances between the rows
    fit <- isoMDS(d, k=2) # k is the number of dim
    fit # view results
    # plot solution 
    x <- fit$points[,1]
    y <- fit$points[,2]
    plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", 
      main="Nonmetric MDS", type="n")
    text(x, y, labels = row.names(mydata), cex=.7)
  

 click to view 

Individual Difference Scaling
3-way or individual difference scaling can be completed using the indscal() function in the SensoMineR package. 
The smacof package offers a three way analysis of individual differences based on stress minimization of means of majorization.

To Practice
This tutorial on ggplot2 includes exercises on Distance matrices and Multi-Dimensional Scaling (MDS).
R has an amazing variety of functions for cluster analysis. 
In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. 
While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below. 
Data Preparation 
Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability.

# Prepare Data
    mydata <- na.omit(mydata) # listwise deletion of missing
    mydata <- scale(mydata) # standardize variables
  

Partitioning
K-means clustering is the most popular partitioning method. 
It requires the analyst to specify the number of clusters to extract. 
A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. 
The analyst looks for a bend in the plot similar to a scree test in factor analysis. 
See Everitt & Hothorn (pg. 
251). 
# Determine number of clusters
    wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
    for (i in 2:15) wss[i] <- sum(kmeans(mydata, 
       centers=i)$withinss)
    plot(1:15, wss, type="b", xlab="Number of Clusters",
      ylab="Within groups sum of squares") 

# K-Means Cluster Analysis
    fit <- kmeans(mydata, 5) # 5 cluster solution
    # get cluster means 
    aggregate(mydata,by=list(fit$cluster),FUN=mean)
    # append cluster assignment
    mydata <- data.frame(mydata, fit$cluster)
  

A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). 
The function pamk( ) in the fpc package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width. 
Hierarchical Agglomerative
There are a wide range of hierarchical clustering approaches. 
I have had good luck with Ward's method described below. 
# Ward Hierarchical Clustering
    d <- dist(mydata,
    method = "euclidean") # distance matrix
    fit <- hclust(d, method="ward")
    plot(fit) # display dendogram
    groups <- cutree(fit, k=5) # cut tree into 5 clusters
    # draw dendogram with red borders around the 5 clusters 
    rect.hclust(fit, k=5, border="red")
  

 click to view 

The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on multiscale bootstrap resampling. 
Clusters that are highly supported by the data will have large p values. 
Interpretation details are provided Suzuki. 
Be aware that pvclust clusters columns, not rows. 
Transpose your data before using. 
# Ward Hierarchical Clustering with Bootstrapped p values
    library(pvclust)
    fit <-
    pvclust(mydata, method.hclust="ward",
       method.dist="euclidean")
    plot(fit) # dendogram with p values
    # add rectangles around groups highly supported by the data
    pvrect(fit, alpha=.95) 

 click to view 
Model Based 
Model based approaches assume a variety of data models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. 
Specifically, the Mclust( ) function in the mclust package selects the optimal model according to BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models. 
(phew!). 
One chooses the model and number of clusters with the largest BIC. 
See help(mclustModelNames) to details on the model chosen as best. 
# Model Based Clustering
    library(mclust)
    fit <- Mclust(mydata)
    plot(fit) # plot results
    summary(fit) # display the best model 

  click to view 

Plotting Cluster Solutions 
It is always a good idea to look at the cluster results.

# K-Means Clustering with 5 clusters
    fit <- kmeans(mydata, 5)
    # Cluster Plot against 1st 2 principal components
    # vary parameters for most readable graph
    library(cluster)
    clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, 
       labels=2, lines=0)
    # Centroid Plot against 1st 2 discriminant functions
    library(fpc)
    plotcluster(mydata, fit$cluster)
  

  click to view 

Validating cluster solutions
 The function cluster.stats() in the fpc package provides a mechanism for comparing the similarity of two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index and the corrected rand index) 

# comparing 2 cluster solutions
    library(fpc)
    cluster.stats(d, fit1$cluster, fit2$cluster)
  

where d is a distance matrix among objects, and fit1$cluster and fit$cluster are integer vectors containing classification results from two different clusterings of the same data.
To Practice
Try the clustering exercise  in this introduction to machine learning course.
Recursive partitioning is a fundamental tool in data mining. 
It helps us explore the stucture of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. 
This section briefly describes CART modeling, conditional inference trees, and random forests.

CART Modeling via rpart 
Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be generated through the rpart package. 
Detailed information on rpart is available in An Introduction to Recursive Partitioning
    Using the RPART Routines. 
The general steps are provided below followed by two examples. 
1. 
Grow the Tree 
To grow a tree, use
  rpart(formula, data=, method=,control=) where
  
    
      
formula
      is in the format 
  outcome ~ predictor1+predictor2+predictor3+ect.
    
    
      data=
      specifies the data frame
    
    
      method=
      "class" for a classification tree 
  "anova" for a regression tree 
    
    
      control=
      optional parameters for controlling tree growth. 
For example, control=rpart.control(minsplit=30, cp=0.001) requires that the minimum number of observations in a node be 30 before attempting a split and that a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted. 

    
  
2. 
Examine the results 
The following functions help us to examine the results.


  
    printcp(fit) 
    display cp table 
  
  
    plotcp(fit) 
    plot cross-validation results
  
  
    rsq.rpart(fit)
    plot approximate R-squared and relative error for different splits (2 plots). 
labels are only appropriate for the "anova" method. 

  
  
    print(fit) 
    print results 
  
  
    summary(fit)
    detailed results including surrogate splits 
  
  
    plot(fit) 
    plot decision tree 
  
  
    text(fit) 
    label the decision tree plot 
  
  
    post(fit, file=) 
    create postscript plot of decision tree 
  


  In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs below).

3. 
prune tree
Prune back the tree to avoid overfitting the data. 
Typically, you will want to select a tree size that minimizes the cross-validated error, the xerror column printed by printcp( ). 
Prune the tree to the desired size using
  prune(fit, cp= ) 

Specifically, use printcp( ) to examine the cross-validated error results, select the complexity parameter associated with minimum error, and place it into the prune( ) function. 
Alternatively, you can use the code fragment 

     fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]

to automatically select the complexity parameter associated with the smallest cross-validated error. 
Thanks to HSAUR for this idea. 
Classification Tree example 
Let's use the data frame kyphosis to predict a type of deformation (kyphosis) after surgery, from age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).

 # Classification Tree with rpart
    library(rpart)
    # grow tree 
    fit <- rpart(Kyphosis ~ Age + Number + Start,
       method="class", data=kyphosis)
    printcp(fit) # display the results 
    plotcp(fit) # visualize cross-validation results 
    summary(fit) # detailed summary of splits
    # plot tree 
    plot(fit, uniform=TRUE, 
       main="Classification Tree for Kyphosis")
    text(fit, use.n=TRUE, all=TRUE, cex=.8)
    # create attractive postscript plot of tree 
    post(fit, file = "c:/tree.ps", 
       title = "Classification Tree for Kyphosis")

   click to view 


    # prune the tree 
    pfit<- prune(fit, cp=   fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
    # plot the pruned tree 
    plot(pfit, uniform=TRUE, 
       main="Pruned Classification Tree for Kyphosis")
    text(pfit, use.n=TRUE, all=TRUE, cex=.8)
    post(pfit, file = "c:/ptree.ps", 
       title = "Pruned Classification Tree for Kyphosis")

  click to view 

Regression Tree example
In this example we will predict car mileage from price, country, reliability, and car type. 
The data frame is cu.summary. 
# Regression Tree Example
    library(rpart)
    # grow tree 
    fit <- rpart(Mileage~Price + Country + Reliability + Type, 
       method="anova", data=cu.summary)
    printcp(fit) # display the results 
    plotcp(fit) # visualize cross-validation results
    summary(fit) # detailed summary of splits
    # create additional plots
    par(mfrow=c(1,2)) # two plots on one page
    rsq.rpart(fit) # visualize cross-validation results
      
    # plot tree 
    plot(fit, uniform=TRUE, 
       main="Regression Tree for Mileage ")
    text(fit, use.n=TRUE, all=TRUE, cex=.8)
    # create attractive postcript plot of tree 
    post(fit, file = "c:/tree2.ps", 
       title = "Regression Tree for Mileage ")

    click to view 

# prune the tree 
    pfit<- prune(fit, cp=0.01160389) # from cptable   
    # plot the pruned tree 
    plot(pfit, uniform=TRUE, 
       main="Pruned Regression Tree for Mileage")
    text(pfit, use.n=TRUE, all=TRUE, cex=.8)
    post(pfit, file = "c:/ptree2.ps", 
       title = "Pruned Regression Tree for Mileage")

It turns out that this produces the same tree as the original. 
Conditional inference trees via party 
The party package provides nonparametric regression trees for nominal, ordinal, numeric, censored, and multivariate responses. 
party: A laboratory for recursive partitioning, provides details. 
You can create a regression or classification tree via the function

ctree(formula, data=)
  The type of tree created will depend on the outcome variable (nominal factor, ordered factor, numeric, etc.). 
Tree growth is based on statistical stopping rules, so pruning should not be required. 
The previous two examples are re-analyzed below. 
# Conditional Inference Tree for Kyphosis
    library(party)
    fit <- ctree(Kyphosis ~ Age + Number + Start, 
       data=kyphosis)
    plot(fit, main="Conditional Inference Tree for Kyphosis")

 click to view 

# Conditional Inference Tree for Mileage
    library(party)
    fit2 <- ctree(Mileage~Price + Country + Reliability + Type, 
       data=na.omit(cu.summary))

 click to view

Random Forests 
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). 
Breiman and Cutler's random forest approach is implimented via the randomForest package. 
Here is an example.

# Random Forest prediction of Kyphosis data
    library(randomForest)
    fit <- randomForest(Kyphosis ~ Age + Number + Start,   data=kyphosis)
    print(fit) # view results
    importance(fit) # importance of each predictor 

For more details see the comprehensive Random Forest website. 
Going Further
This section has only touched on the options available. 
To learn more, see the CRAN Task View on Machine & Statistical Learning. 
To Practice
Try  the Kaggle R Tutorial on Machine Learning which includes an exercise with Random Forests.
Nonparametric Bootstrapping 
The boot package provides extensive facilities for bootstrapping and related resampling methods. 
You can bootstrap a single statistic (e.g. 
a median), or a vector (e.g., regression weights). 
This section will get you started with basic nonparametric bootstrapping.

The main bootstrapping function is boot( ) and has the following format:

bootobject <- boot(data= , statistic= , R=, ...) where


  
    parameter
    description
  
  
    data
    A vector, matrix, or data frame 
  
  
    statistic
    A function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). 

      The function should include an indices parameter that the boot() function can use to select cases for each replication (see examples below). 

  
  
    R
    Number of bootstrap replicates 
  
  
    ...
    Additional parameters to be passed to the function that produces the statistic of interest 
  

boot( ) calls the statistic function R times. 
Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data). 
These indices are used within the statistic function to select a sample. 
The statistics are calculated on the sample and the results are accumulated in the bootobject. 
The bootobject structure includes


  
    element
    description
  
  
    t0
    The observed values of k statistics applied to the orginal data. 

  
  
    t
    An R x k matrix where each row is a bootstrap replicate of the k statistics. 

  

You can access these as bootobject$t0 and bootobject$t.

Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to examine the results. 
If the results look reasonable, you can use boot.ci( ) function to obtain confidence intervals for the statistic(s). 
The format is

 boot.ci(bootobject, conf=, type=  ) where
  
    
      parameter
      description
    
    
      bootobject
      The object returned by the boot function
    
    
      conf
      The desired confidence interval (default: conf=0.95)
    
    
      type
      The type of confidence interval returned. 
Possible values are "norm", "basic", "stud", "perc", "bca" and "all" (default: type="all")
    
  
Bootstrapping a Single Statistic (k=1) 
The following example generates the bootstrapped 95% confidence interval for R-squared in the linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp). 
The data source is mtcars. 
The bootstrapped confidence interval is based on 1000 replications. 
# Bootstrap 95% CI for R-Squared
    library(boot)
    # function to obtain R-Squared from the data 
    rsq <- function(formula, data, indices)
    {
      d <- data[indices,] # allows boot to select sample 
      fit <- lm(formula, data=d)
      return(summary(fit)$r.square)
    }
    # bootstrapping with 1000 replications 
    results <- boot(data=mtcars, statistic=rsq, 
       R=1000, formula=mpg~wt+disp)

    # view results
    results
    plot(results)

    # get 95% confidence interval 
    boot.ci(results, type="bca")

 click to view 

Bootstrapping several Statistics (k>1) 
In example above, the function rsq returned a number and boot.ci returned a single confidence interval. 
The statistics function you provide can also return a vector. 
In the next example we get the 95% CI for the three model regression coefficients (intercept, car weight, displacement). 
In this case we add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to analyzed. 
# Bootstrap 95% CI for regression coefficients 
    library(boot)
    # function to obtain regression weights
    bs <- function(formula, data, indices)
    {
      d <- data[indices,] # allows boot to select sample 
      fit <- lm(formula, data=d)
      return(coef(fit))
    }
    # bootstrapping with 1000 replications 
    results <- boot(data=mtcars, statistic=bs, 
       R=1000, formula=mpg~wt+disp)
    # view results
    results
    plot(results, index=1) # intercept 
    plot(results, index=2) # wt 
    plot(results, index=3) # disp
    
    # get 95% confidence intervals 
    boot.ci(results, type="bca", index=1)
    # intercept 
    boot.ci(results, type="bca", index=2)
    # wt 
    boot.ci(results, type="bca", index=3)
    # disp 

   click to view 

Going Further
The boot( ) function can generate both nonparametric and parametric resampling. 
For the nonparametric bootstrap, resampling methods include ordinary, balanced, antithetic and permutation. 
For the nonparametric bootstrap, stratified resampling is supported. 
Importance resampling weights can also be specified. 
The boot.ci( ) function takes a bootobject and generates 5 different types of two-sided nonparametric confidence intervals. 
These include the first order normal approximation, the basic bootstrap interval, the studentized bootstrap interval, the bootstrap percentile interval, and the adjusted bootstrap percentile (BCa) interval. 
Look at help(boot), help(boot.ci), and help(plot.boot) for more details. 
Learning More
Good sources of information include Resampling Methods in R: The boot Package by Angelo Canty, Getting started with the boot package by Ajay Shah, Bootstrapping Regression Models by John Fox, and Bootstrap Methods and Their Applications by Davison and Hinkley. 
To Practice
Try this interactive exercise with the boot package from DataCamp's Intro to Computational Finance with R course.
Most of the methods on this website actually describe the programming of matrices. 
It is built deeply into the R language. 
This section will simply cover operators and functions specifically suited to linear algebra. 
Before proceeding you many want to review the sections on Data Types and Operators. 
Matrix facilites 
In the following examples, A and B are matrices and x and b are a vectors.
  
    
      Operator or Function
      Description
    
    
      A * B

      Element-wise multiplication
    
    
      A %*% B 
      Matrix multiplication 
    
    
      A %o% B 
      Outer product. AB' 
    

    
      crossprod(A,B)
  crossprod(A)
      A'B and A'A respectively. 

    
    
      t(A) 
      Transpose
    
    
      diag(x)
      Creates diagonal matrix with elements of x in the principal diagonal 
    
    
      diag(A)
      Returns a vector containing the elements of the principal diagonal 
    
    
      diag(k)
      If k is a scalar, this creates a k x k identity matrix. 
Go figure. 

    
    
      solve(A, b) 
      Returns vector x in the equation b = Ax (i.e., A^-1b) 
    
    
      solve(A)
      Inverse of A where A is a square matrix. 

    
    
      ginv(A)
      Moore-Penrose Generalized Inverse of A. 

ginv(A) requires loading the MASS package. 

    
    
      y<-eigen(A)
      y$val are the eigenvalues of A
y$vec are the eigenvectors of A 
    
    
      y<-svd(A)
      Single value decomposition of A.
y$d = vector containing the singular values of A
  y$u = matrix with columns contain the left singular vectors of A 
y$v = matrix with columns contain the right singular vectors of A 
    
    
      R <- chol(A)
      Choleski factorization of A. 
Returns the upper triangular factor, such that R'R = A.
    
    
      y <- qr(A) 
      QR decomposition of A. 

y$qr has an upper triangle that contains the decomposition and a lower triangle that contains information on the Q decomposition.
 y$rank is the rank of A. 

y$qraux a vector which contains additional information on Q. 

y$pivot contains information on the pivoting strategy used. 

    
    
      cbind(A,B,...)
      Combine matrices(vectors) horizontally. 
Returns a matrix. 

    
    
      rbind(A,B,...)
      Combine matrices(vectors) vertically. 
Returns a matrix. 

    
    
      rowMeans(A)
      Returns vector of row means.
    
    
      rowSums(A)
      Returns vector of row sums. 

    
    
      colMeans(A)
      Returns vector of column means. 

    
    
      colSums(A)
      Returns vector of column sums. 

    
  

Matlab Emulation 
The matlab package contains wrapper functions and variables used to replicate MATLAB function calls as best possible. 
This can help porting MATLAB applications and code to R. 
Going Further
The Matrix package contains functions that extend R to support highly dense or sparse matrices. 
It provides efficient access to BLAS (Basic Linear Algebra Subroutines), Lapack
  (dense matrix), TAUCS (sparse matrix) and UMFPACK (sparse matrix)
  routines.

To Practice
Try some of the exercises in matrix algebra in  this course on intro to statistics with R.
In R, graphs are typically created interactively. 
# Creating a Graph
    attach(mtcars)
    plot(wt, mpg) 
    abline(lm(mpg~wt))
    title("Regression of MPG on Weight")

 The plot( ) function opens a graph window and plots weight vs. 
miles per gallon. 

  The next line of code adds a regression line to this graph. 
The final line adds a title. 
 click to view 

Saving Graphs 
You can save the graph in a variety of formats from the menu 
  File -> Save As. 
You can also save the graph via code using one of the following functions.
  
    
      Function
      Output to 
    
    
      pdf("mygraph.pdf")
      pdf file 
    
    
      win.metafile("mygraph.wmf")
      windows metafile 
    
    
      png("mygraph.png")
      png file 
    
    
      jpeg("mygraph.jpg")
      jpeg file 
    
    
      bmp("mygraph.bmp")
      bmp file 
    
    
      postscript("mygraph.ps")
      postscript file 
    
  
See input/output for details.

Viewing Several Graphs 
Creating a new graph by issuing a high level plotting command (plot, hist, boxplot, etc.) will typically overwrite a previous graph. 
To avoid this, open a new graph window before creating a new graph. 
To open a new graph window use one of the functions below.
  
    
      Function
      Platform
    
    
      windows()
      Windows
    
    
      X11()
      Unix
    
    
      quartz()
      Mac
    
  
You can have multiple graph windows open at one time. 
See help(dev.cur) for more details. 
Alternatively, after opening the first graph window, choose History -> Recording from the graph window menu. 
Then you can use Previous and Next to step through the graphs you have created. 
Graphical Parameters
You can specify fonts, colors, line styles, axes, reference lines, etc. 
by specifying graphical parameters. 
This allows a wide degree of customization. 
Graphical parameters, are covered in the Advanced Graphs section. 
The Advanced Graphs section also includes a more detailed coverage of axis and text customization. 
To Practice
Try the creating graph exercises in  this course on data visualization in R.
Histograms
You can create histograms with the function hist(x) where x is a numeric vector of values to be plotted. 
The option  freq=FALSE plots probability densities instead of frequencies. 
The option breaks= controls the number of bins.

# Simple Histogram
    hist(mtcars$mpg)

 click to view 

# Colored Histogram with Different Number of Bins
    hist(mtcars$mpg, breaks=12, col="red") 

 click to view

# Add a Normal Curve (Thanks to Peter Dalgaard)
    x <- mtcars$mpg 
    h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", 
       main="Histogram with Normal Curve") 
    xfit<-seq(min(x),max(x),length=40) 
    yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
    yfit <- yfit*diff(h$mids[1:2])*length(x)
    lines(xfit, yfit, col="blue", lwd=2) 

 click to view 

 Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.

To practice making a density plot with the hist() function, try this exercise. 

Kernel Density Plots
Kernal density plots are usually a much more effective way to view the distribution of a variable. 
Create the plot using plot(density(x)) where x is a numeric vector. 
# Kernel Density Plot
    d <- density(mtcars$mpg) # returns the density data 
    plot(d) # plots the results
  

 click to view

# Filled Density Plot
    d <- density(mtcars$mpg)
    plot(d, main="Kernel Density of Miles Per Gallon")
    polygon(d, col="red", border="blue")
  

 click to view 

Comparing Groups VIA Kernal Density 
The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. 
The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable. 
# Compare MPG distributions for cars with 
    #
    4,6, or 8 cylinders
    library(sm)
    attach(mtcars)
    # create value labels 
    cyl.f <- factor(cyl, levels= c(4,6,8),
      labels = c("4 cylinder", "6 cylinder", "8 cylinder")) 
    # plot densities 
    sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
    title(main="MPG Distribution by Car Cylinders")
    # add legend via mouse click
    colfill<-c(2:(2+length(levels(cyl.f)))) 
    legend(locator(1), levels(cyl.f), fill=colfill) 

 click to view
Create dotplots with the dotchart(x, labels=) function, where x is a numeric vector and labels is a vector of labels for each point. 
You can add a groups= option to designate a factor specifying how the elements of x are grouped. 
If so, the option gcolor= controls the color of the groups label. 
cex controls the size of the labels. 
# Simple Dotplot
    dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
       main="Gas Milage for Car Models", 
       xlab="Miles Per Gallon")

 click to view 

# Dotplot: Grouped Sorted and Colored
    # Sort by mpg, group and color by cylinder
    x <- mtcars[order(mtcars$mpg),] # sort by mpg
    x$cyl <- factor(x$cyl) # it must be a factor
    x$color[x$cyl==4] <- "red"
    x$color[x$cyl==6] <- "blue"
    x$color[x$cyl==8] <- "darkgreen" 
    dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
       main="Gas Milage for Car Models\ngrouped by cylinder",
       xlab="Miles Per Gallon", gcolor="black", color=x$color)

  

 click to view 

Going Further 
Advanced dotplots can be created with the dotplot2( ) function in the Hmisc package and with the panel.dotplot( ) function in the lattice package.

To Practice
To practice making a dot plot in R, try  this interactive exercise from a DataCamp course.
Create barplots with the barplot(height) function, where height is a vector or matrix. 
If height is a vector, the values determine the heights of the bars in the plot. 
If height is a matrix and the option beside=FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked “sub-bars”. 
If height is a matrix and beside=TRUE, then the values in each column are juxtaposed rather than stacked. 
Include option names.arg=(character vector) to label the bars. 
The option horiz=TRUE to createa a horizontal barplot. 
Simple Bar Plot 
# Simple Bar Plot 
    counts <- table(mtcars$gear)
    barplot(counts, main="Car Distribution", 
       xlab="Number of Gears")
  

 click to view

# Simple Horizontal Bar Plot with Added Labels 
    counts <- table(mtcars$gear)
    barplot(counts, main="Car Distribution", horiz=TRUE,
      names.arg=c("3 Gears", "4 Gears", "5 Gears"))
  

 click to view

(To practice making a simple bar plot in R, try this interactive video.)
Stacked Bar Plot 
# Stacked Bar Plot with Colors and Legend
    counts <- table(mtcars$vs, mtcars$gear)
    barplot(counts, main="Car Distribution by Gears and VS",
      xlab="Number of Gears", col=c("darkblue","red"),
      legend = rownames(counts))
  

 click to view

Grouped Bar Plot

# Grouped Bar Plot
    counts <- table(mtcars$vs, mtcars$gear)
    barplot(counts, main="Car Distribution by Gears and VS",
      xlab="Number of Gears", col=c("darkblue","red"),
      legend = rownames(counts), beside=TRUE)

 click to view
Notes
Bar plots need not be based on counts or frequencies. 
You can create bar plots that represent means, medians, standard deviations, etc. 
Use the aggregate( ) function and pass the results to the barplot( ) function.

 By default, the categorical axis line is suppressed. 
Include the option axis.lty=1 to draw it. 
With many bars, bar labels may start to overlap. 
You can decrease the font size using the cex.names = option. 
Values smaller than one will shrink the size of the label. 
Additionally, you can use graphical parameters such as the following to help text spacing:

# Fitting Labels 
    par(las=2) # make label text perpendicular to axis
    par(mar=c(5,8,4,2)) # increase y-axis margin.
    counts <- table(mtcars$gear)
    barplot(counts, main="Car Distribution", horiz=TRUE, names.arg=c("3 Gears", "4 Gears", "5   Gears"), cex.names=0.8)

   click to view
Overview
Line charts are created with the function lines(x, y, type=) where x and y are numeric vectors of (x,y) points to connect. 
type= can take the following values:
  
    
      type 
      description
    
    
      p
      points
    
    
      l
      lines
    
    
      o
      overplotted points and lines 
    
    
      b, c 
      points (empty if "c") joined by lines 
    

    
      s, S 
      stair steps 
    
    
      h
      histogram-like vertical lines 
    
    
      n
      does not produce any points or lines 
    
  
The lines( ) function adds information to a graph. 
It can not produce a graph on its own. 
Usually it follows a plot(x, y) command that produces a graph. 

By default, plot( )  plots the (x,y) points. 
Use the type="n" option in the plot( ) command, to create the graph with axes, titles, etc., but without plotting the points. 
(To practice creating line charts with this lines( ) function, try this exercise.)
Example
 In the following code each of the type= options is applied to the same dataset. 
The plot( ) command sets up the graph, but does not plot the points. 
x <- c(1:5);
    y <- x # create some data 
    par(pch=22, col="red") # plotting symbol and color 
    par(mfrow=c(2,4)) # all plots on one page 
    opts = c("p","l","o","b","c","s","S","h") 
    for(i in 1:length(opts)){ 
      heading = paste("type=",opts[i]) 
      plot(x, y, type="n", main=heading) 
      lines(x, y, type=opts[i]) 
    }

 click to view

Next, we demonstrate each of the type= options when plot( ) sets up the graph and does plot the points. 
x <- c(1:5);
    y <- x # create some data
    par(pch=22, col="blue") # plotting symbol and color 
    par(mfrow=c(2,4)) # all plots on one page 
    opts = c("p","l","o","b","c","s","S","h") 
    for(i in 1:length(opts){ 
      heading = paste("type=",opts[i]) 
      plot(x, y, main=heading) 
      lines(x, y, type=opts[i]) 
    }

 click to view

As you can see, the type="c" option only looks different from the type="b" option if the plotting of points is suppressed in the plot( ) command. 
To demonstrate the creation of a more complex line chart, let's plot the growth of 5 orange trees over time. 
Each tree will have its own distinctive line. 
The data come from the dataset Orange. 
# Create Line Chart
    # convert factor to numeric for convenience 
    Orange$Tree <- as.numeric(Orange$Tree) 
    ntrees <- max(Orange$Tree)
    # get the range for the x and y axis 
    xrange <- range(Orange$age) 
    yrange <- range(Orange$circumference) 
    # set up the plot 
    plot(xrange, yrange, type="n", xlab="Age (days)",
       ylab="Circumference (mm)" ) 
    colors <- rainbow(ntrees) 
    linetype <- c(1:ntrees) 
    plotchar <- seq(18,18+ntrees,1)
    # add lines 
    for (i in 1:ntrees) { 
      tree <- subset(Orange, Tree==i) 
      lines(tree$age, tree$circumference, type="b", lwd=1.5,
        lty=linetype[i], col=colors[i], pch=plotchar[i]) 
    } 
    # add a title and subtitle 
    title("Tree Growth", "example of line plot")
    # add a legend 
    legend(xrange[1], yrange[2], 1:ntrees, cex=0.8, col=colors,
       pch=plotchar, lty=linetype, title="Tree")

 click to view 

Going Further
Try the exercises in  this course on plotting and data visualization in R.
Pie charts are not recommended in the R documentation, and their features are somewhat limited. 
The authors recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume. 
Pie charts are created with the function pie(x, labels=) where x is a non-negative numeric vector indicating the area of each slice and labels= notes a character vector of names for the slices. 
Simple Pie Chart 
# Simple Pie Chart
    slices <- c(10, 12,4, 16, 8)
    lbls <- c("US", "UK", "Australia", "Germany", "France")
    pie(slices, labels = lbls, main="Pie Chart of Countries")

 click to view 

Pie Chart with Annotated Percentages 
# Pie Chart with Percentages
    slices <- c(10, 12, 4, 16, 8)
    lbls <- c("US", "UK", "Australia", "Germany", "France")
    pct <- round(slices/sum(slices)*100)
    lbls <- paste(lbls, pct)
    # add percents to labels 
    lbls <- paste(lbls,"%",sep="") # ad % to labels 
    pie(slices,labels = lbls, col=rainbow(length(lbls)),
       main="Pie Chart of Countries")
  

 click to view 

3D Pie Chart 
The pie3D( ) function in the plotrix package provides 3D exploded pie charts. 
# 3D Exploded Pie Chart
    library(plotrix)
    slices <- c(10, 12, 4, 16, 8) 
    lbls <- c("US", "UK", "Australia", "Germany", "France")
    pie3D(slices,labels=lbls,explode=0.1,
       main="Pie Chart of Countries ")

 click to view 

Creating Annotated Pies from a data frame
# Pie Chart from data frame with Appended Sample Sizes
    mytable <- table(iris$Species)
    lbls <- paste(names(mytable), "\n", mytable, sep="")
    pie(mytable, labels = lbls, 
       main="Pie Chart of Species\n (with sample sizes)")
  

 click to view 
Boxplots can be created for individual variables or for variables by group. 
The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data. 
An example of a formula is y~group where a separate boxplot for numeric variable y is generated for each value of group. 
Add varwidth=TRUE to make boxplot widths proportional to the square root of the samples sizes. 
Add horizontal=TRUE to reverse the axis orientation. 
# Boxplot of MPG by Car Cylinders
    boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", 
       xlab="Number of Cylinders", ylab="Miles Per Gallon")
  

 click to view 
# Notched Boxplot of Tooth Growth Against 2 Crossed Factors
    # boxes colored for ease of interpretation
    boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE, 
      col=(c("gold","darkgreen")),
      main="Tooth Growth", xlab="Suppliment and Dose") 

 click to view

In the notched boxplot, if two boxes' notches do not overlap this is ‘strong evidence’ their medians differ (Chambers et al., 1983, p. 
62). 
Colors recycle. 
In the example above, if I had listed 6 colors, each box would have its own color. 
Earl F. 
Glynn has created an easy to use list of colors is PDF format. 
Other Options 
The boxplot.matrix( ) function in the sfsmisc package draws a boxplot for each column (row) in a matrix. 
The boxplot.n( ) function in the gplots package annotates each boxplot with its sample size. 
The bplot( ) function in the Rlab package offers many more options controlling the positioning and labeling of boxes in the output. 
Violin Plots
A violin plot is a combination of a boxplot and a kernel density plot. 
They can be created using the vioplot( ) function from vioplot package. 
# Violin Plots
    library(vioplot)
    x1 <- mtcars$mpg[mtcars$cyl==4]
    x2 <- mtcars$mpg[mtcars$cyl==6]
    x3 <- mtcars$mpg[mtcars$cyl==8]
    vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), 
       col="gold")
    title("Violin Plots of Miles Per Gallon")

 click to view 

Bagplot - A 2D Boxplot Extension 
The bagplot(x, y) function in the aplpack package provides a bivariate version of the univariate boxplot. 
The bag contains 50% of all points. 
The bivariate median is approximated. 
The fence separates points in the fence from points outside. 
Outliers are displayed. 
# Example of a Bagplot
    library(aplpack)
    attach(mtcars)
    bagplot(wt,mpg, xlab="Car Weight", ylab="Miles Per Gallon",
      main="Bagplot Example") 

 click to view 

To Practice
Try the boxplot exercises in  this course on plotting and data visualization in R.
Simple Scatterplot
There are many ways to create a scatterplot in R. 
The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot. 
# Simple Scatterplot
    attach(mtcars)
    plot(wt, mpg, main="Scatterplot Example", 
       xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19) 
 click to view 

(To practice making a simple scatterplot, try this interactive example from DataCamp.)

# Add fit lines
    abline(lm(mpg~wt), col="red") # regression line (y~x) 
    lines(lowess(wt,mpg), col="blue") # lowess line (x,y) 

 click to view 
The scatterplot( ) function in the car package offers many enhanced features, including fit lines, marginal box plots, conditioning on a factor, and interactive point identification. 
Each of these features is optional. 
# Enhanced Scatterplot of MPG vs. 
Weight 
    #
    by Number of Car Cylinders 
    library(car)
    scatterplot(mpg ~ wt | cyl, data=mtcars, 
       xlab="Weight of Car", ylab="Miles Per Gallon", 
       main="Enhanced Scatter Plot", 
       labels=row.names(mtcars))
  

 click to view

Scatterplot Matrices
There are at least 4 useful functions for creating scatterplot matrices. 
Analysts must love scatterplot matrices! 

# Basic Scatterplot Matrix
    pairs(~mpg+disp+drat+wt,data=mtcars, 
       main="Simple Scatterplot Matrix")

 click to view 

The lattice package provides options to condition the scatterplot matrix on a factor. 
# Scatterplot Matrices from the lattice Package
    library(lattice)
    splom(mtcars[c(1,3,5,6)], groups=cyl, data=mtcars,
       panel=panel.superpose, 
       key=list(title="Three Cylinder Options",
       columns=3,
       points=list(pch=super.sym$pch[1:3],
       col=super.sym$col[1:3]),
       text=list(c("4 Cylinder","6 Cylinder","8 Cylinder"))))
  

 click to view 

The car package can condition the scatterplot matrix on a factor, and optionally include lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells. 
# Scatterplot Matrices from the car Package
    library(car)
    scatterplot.matrix(~mpg+disp+drat+wt|cyl, data=mtcars,
       main="Three Cylinder Options")

 click to view 

The gclus package provides options to rearrange the variables so that those with higher correlations are closer to the principal diagonal. 
It can also color code the cells to reflect the size of the correlations. 
# Scatterplot Matrices from the glus Package
    library(gclus)
    dta <- mtcars[c(1,3,5,6)] # get data 
    dta.r <- abs(cor(dta)) # get correlations
    dta.col <- dmat.color(dta.r) # get colors
    # reorder variables so those with highest correlation
    # are closest to the diagonal
    dta.o <- order.single(dta.r)
    cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
    main="Variables Ordered and Colored by Correlation"
    )
  

 click to view 

High Density Scatterplots
When there are many data points and significant overlap, scatterplots become less useful. 
There are several approaches that be used when this occurs. 
The hexbin(x, y) function in the hexbin package provides bivariate binning into hexagonal cells (it looks better than it sounds). 
# High Density Scatterplot with Binning
    library(hexbin)
    x <- rnorm(1000)
    y <- rnorm(1000)
    bin<-hexbin(x, y, xbins=50)
    plot(bin, main="Hexagonal Binning") 

 click to view 

Another option for a scatterplot with significant point overlap is the sunflowerplot. 
See help(sunflowerplot) for details. 
Finally, you can save the scatterplot in PDF format and use color transparency to allow points that overlap to show through (this idea comes from B.S. 
Everrit in HSAUR). 
# High Density Scatterplot with Color Transparency 
    pdf("c:/scatterplot.pdf")
    x <- rnorm(1000)
    y <- rnorm(1000)
    plot(x,y, main="PDF Scatterplot Example", col=rgb(0,100,0,50,maxColorValue=255), pch=16)
    dev.off()

 click to view 

Note: You can use the col2rgb( ) function to get the rbg values for R colors. 
For example, col2rgb("darkgreen") yeilds r=0, g=100, b=0. 
Then add the alpha transparency level as the 4th number in the color vector. 
A value of zero means fully transparent. 
See help(rgb) for more information. 
3D Scatterplots 
You can create a 3D scatterplot with the scatterplot3d package. 
Use the function scatterplot3d(x, y, z).

# 3D Scatterplot
    library(scatterplot3d)
    attach(mtcars)
    scatterplot3d(wt,disp,mpg, main="3D Scatterplot")

 click to view 

# 3D Scatterplot with Coloring and Vertical Drop Lines
    library(scatterplot3d) 
    attach(mtcars) 
    scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
      type="h", main="3D Scatterplot") 

 click to view 

# 3D Scatterplot with Coloring and Vertical Lines
    # and Regression Plane 
    library(scatterplot3d) 
    attach(mtcars) 
    s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
      type="h", main="3D Scatterplot")
    fit <- lm(mpg ~ wt+disp) 
    s3d$plane3d(fit)

 click to view 

Spinning 3D Scatterplots 
You can also create an interactive 3D scatterplot using the plot3D(x, y, z) function in the rgl package. 
It creates a spinning 3D scatterplot that can be rotated with the mouse. 
The first three arguments are the x, y, and z numeric vectors representing points. 
col= and size= control the color and size of the points respectively. 
# Spinning 3d Scatterplot
    library(rgl)
    plot3d(wt, disp, mpg, col="red", size=3) 

 click to view

You can perform a similar function with the scatter3d(x, y, z) in the Rcmdr package. 
# Another Spinning 3d Scatterplot
    library(Rcmdr)
    attach(mtcars)
    scatter3d(wt, disp, mpg)
  

  click to view 

To Practice
Try the creating scatterplot exercises in  this course on data visualization in R.
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic options. 
One way is to specify these options in through the par( ) function. 
If you set parameter values here, the changes will be in effect for the rest of the session or until you change them again. 
The format is par(optionname=value, optionname=value, ...) 

# Set a graphical parameter using par()
    par()              # view current settings
    opar <- par()      # make a copy of current settings
    par(col.lab="red") # red x and y labels 
    hist(mtcars$mpg)   # create a plot with these new settings 
    par(opar)          # restore original settings
  

A second way to specify graphical parameters is by providing the optionname=value pairs directly to a high level plotting function. 
In this case, the options are only in effect for that specific graph.

# Set a graphical parameter within the plotting function 
    hist(mtcars$mpg, col.lab="red")

See the help for a specific high level plotting function (e.g. 
plot, hist, boxplot) to determine which graphical parameters can be set this way. 
The remainder of this section describes some of the more important graphical parameters that you can set. 
Text and Symbol Size 
The following options can be used to control text and symbol size in graphs.
  
    
      option
      description
    
    
      cex
      number indicating the amount by which plotting text and symbols should be scaled relative to the default. 
1=default, 1.5 is 50% larger, 0.5 is 50% smaller, etc. 

    
    
      cex.axis
      magnification of axis annotation relative to cex 
    
    
      cex.lab
      magnification of x and y labels relative to cex 
    
    
      cex.main
      magnification of titles relative to cex 
    
    
      cex.sub
      magnification of subtitles relative to cex 
    
  
Plotting Symbols 
Use the pch= option to specify symbols to use when plotting points. 
For symbols 21 through 25, specify border color (col=) and fill color (bg=). 

Lines
You can change lines using the following options. 
This is particularly useful for reference lines, axes, and fit lines. 

  
    
      option
      description
    
    
      lty
      line type. 
see the chart below. 

    
    
      lwd
      line width relative to the default (default=1). 
2 is twice as wide. 

    
  
  
     
    Colors
    Options that specify colors include the following.

      

  option
  description


  col
  Default plotting color. 
Some functions (e.g. 
lines) accept a vector of values that are recycled. 



  col.axis
  color for axis annotation 


  col.lab
  color for x and y labels 


  col.main
  color for titles 


  col.sub
  color for subtitles 


  fg
  plot foreground color (axes, boxes - also sets col= to same) 


  bg
  plot background color 

      

    You can specify colors in R by index, name, hexadecimal, or RGB.
      For example col=1, col="white", and col="#FFFFFF" are equivalent. 
    The following chart was produced with code developed by Earl F. 
Glynn. 
See his Color Chart for all the details you would ever need about using colors in R. 
    

    You can also create a vector of n contiguous colors using the functions rainbow(n), heat.colors(n), terrain.colors(n), topo.colors(n), and cm.colors(n). 
    colors() returns all available color names. 
    Fonts
    You can easily set font size and style, but font family is a bit more complicated.

      

  option
  description


  font
  Integer specifying font to use for text. 

    1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol 


  font.axis
  font for axis annotation 


  font.lab
  font for x and y labels 


  font.main
  font for titles 


  font.sub
  font for subtitles 


  ps
  font point size (roughly 1/72 inch)
    text size=ps*cex


  family
  font family for drawing text. 
Standard values are "serif", "sans", "mono", "symbol". 
Mapping is device dependent. 


      

    In windows, mono is mapped to "TT Courier New", serif is mapped to"TT Times New Roman", sans is mapped to "TT Arial", mono is mapped to "TT Courier New", and symbol is mapped to "TT Symbol" (TT=True Type). 
You can add your own mappings. 
    # Type family examples - creating new mappings 
plot(1:10,1:10,type="n")
windowsFonts(
  A=windowsFont("Arial Black"),
  B=windowsFont("Bookman Old Style"),
  C=windowsFont("Comic Sans MS"),
  D=windowsFont("Symbol")
)
text(3,3,"Hello World Default")
text(4,4,family="A","Hello World from Arial Black")
text(5,5,family="B","Hello World from Bookman Old Style")
text(6,6,family="C","Hello World from Comic Sans MS")
text(7,7,family="D", "Hello World from Symbol")
      

     click to view 

    Margins and Graph Size 
    You can control the margin size using the following parameters.

      

  option
  description


  mar
  numerical vector indicating margin size c(bottom, left, top, right) in lines.
    default = c(5, 4, 4, 2) + 0.1 


  mai
  numerical vector indicating margin size c(bottom, left, top, right) in inches 


  pin
  plot dimensions (width, height) in inches 

      

    For complete information on margins, see Earl F. 
Glynn's margin tutorial. 
    Going Further 

    See help(par) for more information on graphical parameters. 
The customization of plotting axes and text annotations are covered next section. 
    To Practice 
    Try this course on making graphs in R.
Many high level plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options (as well as other graphical parameters). 
For example 

# Specify axis options within plot() 
    plot(x, y, main="title", sub="subtitle",
      xlab="X-axis label", ylab="y-axix label",
      xlim=c(xmin, xmax), ylim=c(ymin, ymax)) 

For finer control or for modularization, you can use the functions described below. 
Titles
Use the title( ) function to add labels to a plot. 
title(main="main title", sub="sub-title", 
       xlab="x-axis label", ylab="y-axis label") 

Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in the title( ) function.

# Add a red title and a blue subtitle.
    Make x and y 
    #
    labels 25% smaller
    than the default and green. 

    title(main="My Title", col.main="red",
      sub="My Sub-title", col.sub="blue",
      xlab="My X label", ylab="My Y label",
      col.lab="green", cex.lab=0.75)
  

Text Annotations 
Text can be added to graphs using the text( ) and mtext( ) functions. 
text( ) places text within the graph while mtext( ) places text in one of the four margins.

(To practice adding text to plots in R, try this interactive exercise.)
text(location, "text to place", pos, ...)
    mtext("text to place", side, line=n, ...)
  

 Common options are described below.

  
    option
    description
  
  
    location
    location can be an x,y coordinate. 
Alternatively, the text can be placed interactively via mouse by specifying location as locator(1). 

  
  
    pos
    position relative to location. 
1=below, 2=left, 3=above, 4=right. 
If you specify pos, you can specify offset= in percent of character width. 

  
  
    side
    which margin to place text. 
1=bottom, 2=left, 3=top, 4=right. 
you can specify line= to indicate the line in the margin starting with 0 and moving out. 
you can also specify adj=0 for left/bottom alignment or adj=1 for top/right alignment. 

  

Other common options are cex, col, and font (for size, color, and font style respectively). 

Labeling points 
You can use the text( ) function (see above) for labeling point as well as for adding other text annotations. 
Specify location as a set of x, y coordinates and specify the text to place as a vector of labels. 
The x, y, and label vectors should all be the same length. 
# Example of labeling points
    attach(mtcars)
    plot(wt, mpg, main="Milage vs. 
Car Weight", 
       xlab="Weight", ylab="Mileage", pch=18, col="blue")
    text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red")
  

 click to view 

Math Annotations 
You can add mathematically formulas to a graph using TEX-like rules. 
See help(plotmath) for details and examples. 
Axes
You can create custom axes using the axis( ) function. 
axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ...) 

where


  
    option
    description
  
  
    side
    an integer indicating the side of the graph to draw the axis (1=bottom, 2=left, 3=top, 4=right)
  
  
    at
    a numeric vector indicating where tic marks should be drawn
  
  
    labels
    a character vector of labels to be placed at the tickmarks 
      (if NULL, the at values will be used)
  
  
    pos
    the coordinate at which the axis line is to be drawn. 

      (i.e., the value on the other axis where it crosses)
  
  
    lty
    line type
  
  
    col 
    the line and tick mark color
  
  
    las
     labels are parallel (=0) or perpendicular(=2) to axis
  
  
    tck
    length of tick mark as fraction of plotting region (negative number is outside graph, positive number is inside, 0 suppresses ticks, 1 creates gridlines) default is -0.01 
  
  
    (...)
    other graphical parameters
  

If you are going to create a custom axis, you should suppress the axis automatically generated by your high level plotting function. 
The option axes=FALSE suppresses both x and y axes. 
xaxt="n" and yaxt="n" suppress the x and y axis respectively. 
Here is a (somewhat overblown) example. 
# A Silly Axis Example
    # specify the data
    x <- c(1:10);
    y <- x;
    z <- 10/x
    # create extra margin room on the right for an axis
    par(mar=c(5, 4, 4, 8) + 0.1)
    # plot x vs. 
y
    plot(x, y,type="b", pch=21, col="red", 
       yaxt="n", lty=3, xlab="", ylab="")
    # add x vs. 
1/x
    lines(x, z, type="b", pch=22, col="blue", lty=2)
    # draw an axis on the left 
    axis(2, at=x,labels=x, col.axis="red", las=2)
    # draw an axis on the right, with smaller text and ticks 
    axis(4, at=z,labels=round(z,digits=2),
      col.axis="blue", las=2, cex.axis=0.7, tck=-.01)
    # add a title for the right axis 
    mtext("y=1/x", side=4, line=3, cex.lab=1,las=2, col="blue")
    # add a main title and bottom and left axis labels 
    title("An Example of Creative Axes", xlab="X values",
       ylab="Y=X")

 click to view

Minor Tick Marks
The minor.tick( ) function in the Hmisc package adds minor tick marks. 
# Add minor tick marks
    library(Hmisc)
    minor.tick(nx=n, ny=n, tick.ratio=n)

nx is the number of minor tick marks to place between x-axis major tick marks.
  ny does the same for the y-axis. 
tick.ratio is the size of the minor tick mark relative to the major tick mark. 
The length of the major tick mark is retrieved from par("tck"). 
Reference Lines
Add reference lines to a graph using the abline( ) function. 
abline(h=yvalues, v=xvalues) 

Other graphical parameters (such as line type, color, and width) can also be specified in the abline( ) function. 
# add solid horizontal lines at y=1,5,7 
    abline(h=c(1,5,7))
    # add dashed blue verical lines at x = 1,3,5,7,9
    abline(v=seq(1,10,2),lty=2,col="blue")
  

Note: You can also use the grid( ) function to add reference lines. 
Legend
Add a legend with the legend() function. 
legend(location, title, legend, ...) 

Common options are described below. 

  
    option
    description
  
  
    location
    There are several ways to indicate the location of the legend. 
You can give an x,y coordinate for the upper left hand corner of the legend. 
You can use locator(1), in which case you use the mouse to indicate the location of the legend. 
You can also use the keywords "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "bottomright", or "center". 
If you use a keyword, you may want to use inset= to specify an amount to move the legend into the graph (as fraction of plot region). 

  
  
    title
    A character string for the legend title (optional) 
  
  
    legend
    A character vector with the labels 
  
  
    ...
    Other options. 
If the legend labels colored lines, specify col= and a vector of colors. 
If the legend labels point symbols, specify pch= and a vector of point symbols. 
If the legend labels line width or line style, use lwd= or lty= and a vector of widths or styles. 
To create colored boxes for the legend (common in bar, box, or pie charts), use fill= and a vector of colors. 

  

Other common legend options include bty for box type, bg for background color, cex for size, and text.col for text color. 
Setting horiz=TRUE sets the legend horizontally rather than vertically. 
# Legend Example
    attach(mtcars)
    boxplot(mpg~cyl, main="Milage by Car Weight",
       yaxt="n", xlab="Milage", horizontal=TRUE,
       col=terrain.colors(3))
    legend("topright", inset=.05, title="Number of Cylinders",
       c("4","6","8"), fill=terrain.colors(3), horiz=TRUE)

 click to view 

For more on legends, see help(legend). 
The examples in the help are particularly informative.

To Practice
Try the free first chapter of this online data visualization course in R.
R makes it easy to combine multiple plots into one overall graph, using either the 
    par( ) or layout( ) function. 
With the par( ) function, you can include the option mfrow=c(nrows, ncols) to create a matrix of nrows x ncols plots that are filled in by row. 
mfcol=c(nrows, ncols) fills in the matrix by columns. 
# 4 figures arranged in 2 rows and 2 columns
    attach(mtcars)
    par(mfrow=c(2,2))
    plot(wt,mpg, main="Scatterplot of wt vs. 
mpg")
    plot(wt,disp, main="Scatterplot of wt vs disp")
    hist(wt, main="Histogram of wt")
    boxplot(wt, main="Boxplot of wt")

 click to view 

# 3 figures arranged in 3 rows and 1 column
    attach(mtcars)
    par(mfrow=c(3,1))
    hist(wt)
    hist(mpg)
    hist(disp)

 click to view 

The layout( ) function has the form layout(mat) where 
  mat is a matrix object specifying the location of the N figures to plot. 
# One figure in row 1 and two figures in row 2
    attach(mtcars)
    layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
    hist(wt)
    hist(mpg)
    hist(disp)

 click to view

Optionally, you can include widths= and heights= options in the layout( ) function to control the size of each figure more precisely. 
These options have the form 
  widths= a vector of values for the widths of columns
  heights= a vector of values for the heights of rows. 
Relative widths are specified with numeric values. 
Absolute widths (in centimetres) are specified with the lcm() function.

# One figure in row 1 and two figures in row 2
    # row 1 is 1/3 the height of row 2
    # column 2 is 1/4 the width of the column 1
    attach(mtcars)
    layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE), 
       widths=c(3,1), heights=c(1,2))
    hist(wt)
    hist(mpg)
    hist(disp)

 click to view 

See help(layout) for more details. 
Creating a figure arrangement with fine control 
In the following example, two box plots are added to scatterplot to create an enhanced graph.

# Add boxplots to a scatterplot
    par(fig=c(0,0.8,0,0.8), new=TRUE)
    plot(mtcars$wt, mtcars$mpg, xlab="Car Weight",
      ylab="Miles Per Gallon")
    par(fig=c(0,0.8,0.55,1), new=TRUE)
    boxplot(mtcars$wt, horizontal=TRUE, axes=FALSE)
    par(fig=c(0.65,1,0,0.8),new=TRUE)
    boxplot(mtcars$mpg, axes=FALSE)
    mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3) 

 click to view 

To understand this graph, think of the full graph area as going from (0,0) in the lower left corner to (1,1) in the upper right corner. 
The format of the fig= parameter is a numerical vector of the form c(x1, x2, y1, y2). 
The first fig= sets up the scatterplot going from 0 to 0.8 on the x axis and 0 to 0.8 on the y axis. 
The top boxplot goes from 0 to 0.8 on the x axis and 0.55 to 1 on the y axis. 
I chose 0.55 rather than 0.8 so that the top figure will be pulled closer to the scatter plot. 
The right hand boxplot goes from 0.65 to 1 on the x axis and 0 to 0.8 on the y axis. 
Again, I chose a value to pull the right hand boxplot closer to the scatterplot. 
You have to experiment to get it just right.

fig= starts a new plot, so to add to an existing plot use new=TRUE. 
You can use this to combine several plots in any arrangement into one graph. 
To Practice
Try the free first chapter of this interactive data visualization course, which covers combining plots.
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships. 
In particular, the package supports the creation of trellis graphs - graphs that display a variable or the relationship between variables, conditioned on one or more other variables.

The typical format is

graph_type(formula, data=) 

where graph_type is selected from the listed below. 
formula specifies the variable(s) to display and any conditioning variables . 
For example  ~x|A means display numeric variable x for each level of factor A. 
y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels. 
~x means display numeric variable x alone. 

  
    graph_type
    description
    formula examples
  
  
    barchart
    bar chart 
    x~A or A~x 
  
  
    bwplot
    boxplot
    x~A or A~x 
  
  
    cloud
    3D scatterplot 
    z~x*y|A
  
  
    contourplot
    3D contour plot 
    z~x*y
  
  
    densityplot
    kernal density plot 
    ~x|A*B
  
  
    dotplot
    dotplot
    ~x|A
  
  
    histogram
    histogram
    ~x
  
  
    levelplot
    3D level plot 
    z~y*x
  
  
    parallel
    parallel coordinates plot 
    data frame
  
  
    splom
    scatterplot matrix 
    data frame
  
  
    stripplot
    strip plots 
    A~x or x~A 
  
  
    xyplot
    scatterplot 
    y~x|A
  
  
    wireframe
    3D wireframe graph 
    z~y*x
  

Here are some examples. 
They use the car data (mileage, weight, number of gears, number of cylinders, etc.) from the mtcars data frame. 
# Lattice Examples 
    library(lattice) 
    attach(mtcars)
    # create factors with value labels 
    gear.f<-factor(gear,levels=c(3,4,5),
       labels=c("3gears","4gears","5gears")) 
    cyl.f <-factor(cyl,levels=c(4,6,8),
       labels=c("4cyl","6cyl","8cyl")) 
    # kernel density plot 
    densityplot(~mpg, 
       main="Density Plot", 
       xlab="Miles per Gallon")
    # kernel density plots by factor level 
    densityplot(~mpg|cyl.f, 
       main="Density Plot by Number of Cylinders",
       xlab="Miles per Gallon")
    # kernel density plots by factor level (alternate layout) 
    densityplot(~mpg|cyl.f, 
       main="Density Plot by Numer of Cylinders",
       xlab="Miles per Gallon", 
       layout=c(1,3))
    # boxplots for each combination of two factors 
    bwplot(cyl.f~mpg|gear.f,
       ylab="Cylinders", xlab="Miles per Gallon", 
       main="Mileage by Cylinders and Gears", 
       layout=(c(1,3))
    # scatterplots for each combination of two factors 
    xyplot(mpg~wt|cyl.f*gear.f, 
       main="Scatterplots by Cylinders and Gears", 
       ylab="Miles per Gallon", xlab="Car Weight")
    # 3d scatterplot by factor level 
    cloud(mpg~wt*qsec|cyl.f, 
       main="3D Scatterplot by Cylinders") 
    # dotplot for each combination of two factors 
    dotplot(cyl.f~mpg|gear.f, 
       main="Dotplot Plot by Number of Gears and Cylinders",
       xlab="Miles Per Gallon")
    # scatterplot matrix 
    splom(mtcars[c(1,3,4,5,6)], 
       main="MTCARS Data")



 

click to view 

Note, as in graph 1, that you specifying a conditioning variable is optional. 
The difference between graphs 2 & 3 is the use of the layout option to contol the placement of panels. 
Customizing Lattice Graphs
Unlike base R graphs, lattice graphs are not effected by many of the options set in the par( ) function. 
To view the options that can be changed, look at help(xyplot). 
It is frequently easiest to set these options within the high level plotting functions described above. 
Additionally, you can write functions that modify the rendering of panels. 
Here is an example.

# Customized Lattice Example
    library(lattice)
    panel.smoother <- function(x, y) {
      panel.xyplot(x, y) # show points 
      panel.loess(x, y)  # show smoothed line 
    }
    attach(mtcars)
    hp <- cut(hp,3) # divide horse power into three bands
    xyplot(mpg~wt|hp, scales=list(cex=.8, col="red"),
       panel=panel.smoother,
       xlab="Weight", ylab="Miles per Gallon", 
       main="MGP vs Weight by Horse Power")
  

 click to view 

Going Further
Lattice graphics are a comprehensive graphical system in their own right. 
Deepanyan Sarkar's book Lattice: Multivariate Data Visualization with R  is the definitive reference. 
Additionally, see the Trellis User's Guide. 
Dr. 
Ihaka has created a wonderful set of slides on the subject. 
An excellent early consideration of trellis graphs can be found in W.S. 
Cleveland's classic book Visualizing Data.

To Practice
Try this interactive course on data visualization which covers lattice graphs.
The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots. 
Its popularity in the R community has exploded in recent years. 
Origianlly based on Leland Wilkinson's The Grammar of Graphics, ggplot2 allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner. 
Grouping can be represented by color, symbol, size, and transparency. 
The creation of trellis plots (i.e., conditioning) is relatively simple. 

 
  Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful resources). 
There is a helper function called qplot() (for quick plot) that can hide much of this complexity when creating standard graphs.

qplot()
The qplot() function can be used to create the most common graph types. 
While it does not expose ggplot's full power, it can create a very wide range of useful plots. 
The format is:


    qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)

  

where the options are:


  
    option
    description
  
  
    alpha
    Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity)

  
  
    color, shape, size, fill
    Associates the levels of variable with symbol color, shape, or size. 
For line plots, color associates levels of a variable with line color. 
For density and box plots, fill associates fill colors with a variable. 
Legends are drawn automatically.

  
  
    data
    Specifies a data frame

  
  
    facets
    Creates a trellis graph by specifying conditioning variables. 
Its value is expressed as rowvar ~ colvar. 
To create trellis graphs based on a single conditioning variable, use rowvar~. 
or .~colvar)

  
  
    geom
    Specifies the geometric objects that define the graph type. 
The geom option is expressed as a character vector with one or more entries. 
geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter". 


  
  
    main, sub
    Character vectors specifying the title and subtitle

  
  
    method, formula
    If geom="smooth", a loess fit line and confidence limits are added by default. 
When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. 
Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression. 
The formula parameter gives the form of the fit.
      


      For example, to add simple linear regression lines, you'd specify geom="smooth", method="lm", formula=y~x. 
Changing the formula to y~poly(x,2) would produce a quadratic fit. 
Note that the formula uses the letters x and y, not the names of the variables.
      


      For method="gam", be sure to load the mgcv package. 
For method="rml", load the MASS package.

    

  
  
    x, y
    Specifies the variables placed on the horizontal and vertical axis. 
For univariate plots (for example, histograms), omit y

  
  
    xlab, ylab
    Character vectors specifying horizontal and vertical axis labels

  
  
    xlim,ylim
    Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively
  


Notes:


At present, ggplot2 cannot be used to create 3D graphs or mosaic plots.
Use I(value) to indicate a specific value. 
For example size=z makes the size of the plotted points or lines proporational to the values of a variable z. 
In contrast, size=I(3) sets each point or line to three times the default size. 


Here are some examples using automotive data (car mileage, weight, number of gears, number of cylinders, etc.) contained in the mtcars data frame. 
# ggplot2 examples
    library(ggplot2) 
    # create factors with value labels 
    mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5),
       labels=c("3gears","4gears","5gears")) 
    mtcars$am <- factor(mtcars$am,levels=c(0,1),
       labels=c("Automatic","Manual")) 
    mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8),
       labels=c("4cyl","6cyl","8cyl")) 

    # Kernel density plots for mpg
    # grouped by number of gears (indicated by color)
    qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5), 
       main="Distribution of Gas Milage", xlab="Miles Per Gallon", 
       ylab="Density")

    # Scatterplot of mpg vs. 
hp for each combination of gears and cylinders
    # in each facet, transmittion type is represented by shape and color
    qplot(hp, mpg, data=mtcars, shape=am, color=am, 
       facets=gear~cyl, size=I(3),
       xlab="Horsepower", ylab="Miles per Gallon")
    

    # Separate regressions of mpg on weight for each number of cylinders
    qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"), 
       method="lm", formula=y~x, color=cyl, 
       main="Regression of MPG on Weight", 
       xlab="Weight", ylab="Miles per Gallon")

    # Boxplots of mpg by number of gears 
    # observations (points) are overlayed and jittered
    qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"), 
       fill=gear, main="Mileage by Gear Number",
       xlab="", ylab="Miles per Gallon")

  
  
  
  
  

click to view 

Customizing ggplot2 Graphs
Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par( ) function. 
They can be modified using the theme() function, and by adding graphic parameters within the qplot() function. 
For greater control, use ggplot() and other functions provided by the package. 
Note that ggplot2 functions can be chained with "+" signs to generate the final plot.


  library(ggplot2)

  p <- qplot(hp, mpg, data=mtcars, shape=am, color=am, 
     facets=gear~cyl, main="Scatterplots of MPG vs. 
Horsepower",
     xlab="Horsepower", ylab="Miles per Gallon")

  # White background and black grid lines
  p + theme_bw()


  # Large brown bold italics labels
  # and legend placed at top of plot
  p + theme(axis.title=element_text(face="bold.italic", 
     size="12", color="brown"), legend.position="top")


 click to view 

Going Further
We have only scratched the surface here. 
To learn more, see the ggplot reference site, and Winston Chang's excellent 
    Cookbook for R site. 
Though slightly out of date, ggplot2: Elegant Graphics for Data Anaysis is still the definative book on this subject.

To Practice
Try the free first chapter of this interactive tutorial on ggplot2.
This section describes creating probability plots in R for both didactic purposes and for data analyses. 
Probability Plots for Teaching and Demonstration 
When I was a college professor teaching statistics, I used to have to draw normal distributions by hand. 
They always came out looking like bunny rabbits. 
What can I say?

R makes it easy to draw probability distributions and demonstrate statistical concepts. 
Some of the more common probability distributions available in R are given below. 

  
    distribution
    R name 
    distribution
    R name 
  
  
    Beta
    beta
    Lognormal
    lnorm
  
  
    Binomial
    binom
    Negative Binomial 
    nbinom
  
  
    Cauchy
    cauchy
    Normal
    norm
  
  
    Chisquare
    chisq
    Poisson
    pois
  
  
    Exponential
    exp
    Student t 
    t
  
  
    F
    f
    Uniform
    unif
  
  
    Gamma
    gamma
    Tukey
    tukey
  
  
    Geometric
    geom
    Weibull
    weib
  
  
    Hypergeometric
    hyper
    Wilcoxon
    wilcox
  
  
    Logistic
    logis
     
     
  

For a comprehensive list, see Statistical Distributions on the R wiki. 
The functions available for each distribution follow this format: 


  
    name
    description
  
  
    dname( ) 
    density or probability function 
  
  
    pname( ) 
    cumulative density function 
  
  
    qname( ) 
    quantile function 
  
  
    Rname( ) 
    random deviates 
  

For example, pnorm(0) =0.5 (the area under the standard normal curve to the left of zero). 
qnorm(0.9) = 1.28 (1.28 is the 90th percentile of the standard normal distribution). 
rnorm(100) generates 100 random deviates from a standard normal distribution. 
Each function has parameters specific to that distribution. 
For example, rnorm(100, m=50, sd=10) generates 100 random deviates from a normal distribution with mean 50 and standard deviation 10. 
You can use these functions to demonstrate various aspects of probability distributions. 
Two common examples are given below.

# Display the Student's t distributions with various
    #
    degrees of freedom and compare to the normal distribution
    x <- seq(-4, 4, length=100)
    hx <- dnorm(x)
    degf <- c(1, 3, 8, 30)
    colors <- c("red", "blue", "darkgreen", "gold", "black")
    labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
    plot(x, hx, type="l", lty=2, xlab="x value",
      ylab="Density", main="Comparison of t Distributions")
    for (i in 1:4){
      lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
    }
    legend("topright", inset=.05, title="Distributions",
      labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)

 click to view 

# Children's IQ scores are normally distributed with a
    # mean of 100 and a standard deviation of 15. 
What
    # proportion of children are expected to have an IQ between
    # 80 and 120?
    mean=100; sd=15
    lb=80; ub=120
    x <- seq(-4,4,length=100)*sd + mean
    hx <- dnorm(x,mean,sd)
    plot(x, hx, type="n", xlab="IQ Values", ylab="",
      main="Normal Distribution", axes=FALSE)
    i <- x >= lb & x <= ub
    lines(x, hx)
    polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")
    
    area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
    result <- paste("P(",lb,"< IQ <",ub,") =",
       signif(area, digits=3))
    mtext(result,3)
    axis(1, at=seq(40, 160, 20), pos=0)
  

 click to view 

For a comprehensive view of probability plotting in R, see Vincent Zonekynd's Probability Distributions. 
Fitting Distributions
There are several methods of fitting distributions in R. 
Here are some options. 
You can use the qqnorm( ) function to create a Quantile-Quantile plot evaluating the fit of sample data to the normal distribution. 
More generally, the qqplot( ) function creates a Quantile-Quantile plot for any theoretical distribution. 
# Q-Q plots
    par(mfrow=c(1,2))
    # create sample data 
    x <- rt(100, df=3)
    # normal fit 
    qqnorm(x);
    qqline(x)
    # t(3Df) fit 
    qqplot(rt(1000,df=3), x, main="t(3) Q-Q Plot", 
       ylab="Sample Quantiles")
    abline(0,1) 

 click to view

The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate distributions. 
The format is fitdistr(x, densityfunction) where x is the sample data and densityfunction is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f", "gamma", "geometric", "log-normal", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" or "weibull". 
# Estimate parameters assuming log-Normal distribution 
    # create some sample data
    x <- rlnorm(100)
    # estimate paramters
    library(MASS)
    fitdistr(x, "lognormal")
  

Finally R has a wide range of goodness of fit tests for evaluating if it is reasonable to assume that a random sample comes from a specified theoretical distribution. 
These include chi-square, Kolmogorov-Smirnov, and Anderson-Darling. 
For more details on fitting distributions, see Vito Ricci's Fitting Distributions with R. 
For general (non R) advice, see Bill Huber's Fitting Distributions to Data. 
To Practice 
Try this interactive course on exploratory data analysis.

The vcd package provides a variety of methods for visualizing multivariate categorical data, inspired by Michael Friendly's wonderful "Visualizing Categorical Data". 
Extended mosaic and association plots are described here. 
Each provides a method of visualizng complex data and evaluating deviations from a specified independence model. 
For more details, see The Strucplot Framework.

Mosaic Plots 
For extended mosaic plots, use mosaic(x, condvar=, data=) where x is a table or formula, condvar= is an optional conditioning variable, and data= specifies a data frame or a table. 
Include shade=TRUE to color the figure, and legend=TRUE to display a legend for the Pearson residuals. 
# Mosaic Plot Example
    library(vcd)
    mosaic(HairEyeColor, shade=TRUE, legend=TRUE) 

 click to view

Association Plots 
To produce an extended association plot use assoc(x, row_vars, col_vars) where x is a contingency table, row_vars is a vector of integers giving the indices of the variables to be used for the rows, and col_vars is a vector of integers giving the indices of the variables to be used for the columns of the association plot. 
# Association Plot Example
    library(vcd)
    assoc(HairEyeColor, shade=TRUE) 

 click to view 

Going Further
Both functions are complex and offer multiple input and output options. 
See help(mosaic) and help(assoc) for more details. 
To Practice 
 To practice plotting in R, try this course in data visualization with R.
Correlograms help us visualize the data in correlation matrices. 
For details, see Corrgrams: Exploratory displays for correlation matrices. 
In R, correlograms are implimented through the corrgram(x, order = , panel=, lower.panel=, upper.panel=, text.panel=, diag.panel=) function in the corrgram package. 
Options
x is a data frame with one observation per row. 
order=TRUE will cause the variables to be ordered using principal component analysis of the correlation matrix. 
panel= refers to the off-diagonal panels. 
You can use lower.panel= and upper.panel= to choose different options below and above the main diagonal respectively. 
 text.panel= and diag.panel= refer to the main diagnonal. 
Allowable parameters are given below. 
off diagonal panels
  panel.pie (the filled portion of the pie indicates the magnitude of the correlation)
  panel.shade (the depth of the shading indicates the magnitude of the correlation)
  panel.ellipse (confidence ellipse and smoothed line)
  panel.pts (scatterplot) 

main diagonal panels 
  panel.minmax (min and max values of the variable)
  panel.txt (variable name).

# First Correlogram Example
    library(corrgram)
    corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
      upper.panel=panel.pie, text.panel=panel.txt,
      main="Car Milage Data in PC2/PC1 Order") 

 click to view 
# Second Correlogram Example
    library(corrgram)
    corrgram(mtcars, order=TRUE, lower.panel=panel.ellipse,
      upper.panel=panel.pts,
    text.panel=panel.txt,
      diag.panel=panel.minmax,
      main="Car Mileage Data in PC2/PC1 Order") 

 click to view 
# Third Correlogram Example
    library(corrgram)
    corrgram(mtcars, order=NULL, lower.panel=panel.shade,
      upper.panel=NULL, text.panel=panel.txt,
      main="Car Milage Data (unsorted)")

 click to view 

Changing the colors in a correlogram
You can control the colors in a correlogram by specifying 4 colors in the colorRampPalette( ) function within the col.corrgram( ) function. 
Here is an example.

# Changing Colors in a Correlogram
    library(corrgram) 
    col.corrgram <- function(ncol){   
      colorRampPalette(c("darkgoldenrod4", "burlywood1",
      "darkkhaki", "darkgreen"))(ncol)} 
    corrgram(mtcars, order=TRUE, lower.panel=panel.shade, 
       upper.panel=panel.pie, text.panel=panel.txt, 
       main="Correlogram of Car Mileage Data (PC2/PC1 Order)")

 click to view 

Going Further
Try some of the exercises in this interactive course which covers correlational analysis.
There are a several ways to interact with R graphics in real time. 
Three methods are described below. 
GGobi
GGobi is an open source visualization program for exploring high-dimensional data. 
It is freely available for MS Windows, Linux, and Mac platforms. 
It supports linked interactive scatterplots, barcharts, parallel coordinate plots and tours, with both brushing and identification. 
A good tutorial is included with the GGobi manual. 
You can download the software here. 
Once GGobi is installed, you can use the ggobi( ) function in the package rggobi to run GGobi from within R . This gives you interactive graphics access to all of your R data! See An Introduction to RGGOBI. 
# Interact with R data using GGobi
    library(rggobi)
    g <- ggobi(mydata)
  

 click to view 
iPlots
The iplots package provide interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots and histograms that can be linked and color brushed. 
iplots is implimented through the Java GUI for R. 
For more information, see the iplots website. 
# Install iplots
    install.packages("iplots",dep=TRUE)
    # Create some linked plots 
    library(iplots)
    cyl.f <- factor(mtcars$cyl)
    gear.f <- factor(mtcars$factor)
    attach(mtcars)
    ihist(mpg) # histogram 
    ibar(carb) # barchart
    iplot(mpg, wt) # scatter plot
    ibox(mtcars[c("qsec","disp","hp")]) # boxplots 
    ipcp(mtcars[c("mpg","wt","hp")]) # parallel coordinates
    imosaic(cyl.f,gear.f) # mosaic plot
  

On windows platforms, hold down the cntrl key and move the mouse over each graph to get identifying information from points, bars, etc. 
 click to view 
Interacting with Plots (Identifying Points) 
 R offers two functions for identifying points and coordinate locations in plots. 
With identify(), clicking the mouse over points in a graph will display the row number or (optionally) the rowname for the point. 
This continues until you select stop . 
With locator() you can add points or lines to the plot using the mouse. 
The function returns a list of the (x,y) coordinates. 
Again, this continues until you select stop. 
# Interacting with a scatterplot 
    attach(mydata)
    plot(x, y) # scatterplot
    identify(x, y, labels=row.names(mydata)) # identify points 
    coords <- locator(type="l") # add lines
    coords # display list
  

Other Interactive Graphs 
See scatterplots for a description of rotating 3D scatterplots in R. 
Other Visualization Programs
Explore building interactive plots with ggvis from RStudio in this course.
Function	Description
abs(x)	absolute value
sqrt(x)	square root
ceiling(x)	ceiling(3.475) is 4
floor(x)	floor(3.475) is 3
trunc(x)	trunc(5.99) is 5
round(x, digits=n)	round(3.475, digits=2) is 3.48
signif(x, digits=n)	signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x)	also acos(x), cosh(x), acosh(x), etc.
log(x)	natural logarithm
log10(x)	common logarithm
exp(x)	e^x
Function	Description
substr(x, start=n1, stop=n2)	Extract or replace substrings in a character vector. x <- "abcdef" substr(x, 2, 4) is "bcd" substr(x, 2, 4) <- "22222" is "a222ef"
grep(pattern, x , ignore.case=FALSE, fixed=FALSE)	Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices. grep("A", c("b","A","c"), fixed=TRUE) returns 2
sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE)	Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is a regular expression. If fixed = T then pattern is a text string. sub("\\s",".","Hello There") returns "Hello.There"
strsplit(x, split)	Split the elements of character vector x at split. strsplit("abc", "") returns 3 element vector "a","b","c"
paste(..., sep="")	Concatenate strings after using sep string to seperate them. paste("x",1:3,sep="") returns c("x1","x2" "x3") paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3") paste("Today is", date())
toupper(x)	Uppercase
tolower(x)	Lowercase
Function	Description
dnorm(x)	normal density function (by default m=0 sd=1) # plot standard normal curve x <- pretty(c(-3,3), 30) y <- dnorm(x) plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q)	cumulative normal probability for q (area under the normal curve to the left of q) pnorm(1.96) is 0.975
qnorm(p)	normal quantile. value at the p percentile of normal distribution qnorm(.9) is 1.28 # 90th percentile
rnorm(n, m=0,sd=1)	n random normal deviates with mean m and standard deviation sd. #50 random normal variates with mean=50, sd=10 x <- rnorm(50, m=50, sd=10)
dbinom(x, size, prob) pbinom(q, size, prob) qbinom(p, size, prob) rbinom(n, size, prob)	binomial distribution where size is the sample size and prob is the probability of a heads (pi) # prob of 0 to 5 heads of fair coin out of 10 flips dbinom(0:5, 10, .5) # prob of 5 or less heads of fair coin out of 10 flips pbinom(5, 10, .5)
dpois(x, lamda) ppois(q, lamda) qpois(p, lamda) rpois(n, lamda)	poisson distribution with m=std=lamda #probability of 0,1, or 2 events with lamda=4 dpois(0:2, 4) # probability of at least 3 events with lamda=4 1- ppois(2,4)
dunif(x, min=0, max=1) punif(q, min=0, max=1) qunif(p, min=0, max=1) runif(n, min=0, max=1)	uniform distribution, follows the same pattern as the normal distribution above. #10 uniform random variates x <- runif(10)
Function	Description
mean(x, trim=0, na.rm=FALSE)	mean of object x # trimmed mean, removing any missing values and # 5 percent of highest and lowest scores mx <- mean(x,trim=.05,na.rm=TRUE)
sd(x)	standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute deviation.
median(x)	median
quantile(x, probs)	quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1]. # 30th and 84th percentiles of x y <- quantile(x, c(.3,.84))
range(x)	range
sum(x)	sum
diff(x, lag=1)	lagged differences, with lag indicating which lag to use
min(x)	minimum
max(x)	maximum
scale(x, center=TRUE, scale=TRUE)	column center or standardize a matrix.
Function	Description
seq(from , to, by)	generate a sequence indices <- seq(1,10,2) #indices is c(1, 3, 5, 7, 9)
rep(x, ntimes)	repeat x n times y <- rep(1:3, 2) # y is c(1, 2, 3, 1, 2, 3)
cut(x, n)	divide continuous variable in factor with n levels y <- cut(x, 5)
Data Type	Example	Verify
Logical	TRUE, FALSE	Live Demo v <- TRUE print(class(v)) it produces the following result - [1] "logical"
Numeric	12.3, 5, 999	Live Demo v <- 23.5 print(class(v)) it produces the following result - [1] "numeric"
Integer	2L, 34L, 0L	Live Demo v <- 2L print(class(v)) it produces the following result - [1] "integer"
Complex	3 + 2i	Live Demo v <- 2+5i print(class(v)) it produces the following result - [1] "complex"
Character	'a' , '"good", "TRUE", '23.4'	Live Demo v <- "TRUE" print(class(v)) it produces the following result - [1] "character"
Raw	"Hello" is stored as 48 65 6c 6c 6f	Live Demo v <- charToRaw("Hello") print(class(v)) it produces the following result - [1] "raw"
Variable Name	Validity	Reason
var_name2.	valid	Has letters, numbers, dot and underscore
var_name%	Invalid	Has the character '%'. Only dot(.) and underscore allowed.
2var_name	invalid	Starts with a number
.var_name, var.name	valid	Can start with a dot(.) but the dot(.)should not be followed by a number.
.2var_name	invalid	The starting dot is followed by a number making it invalid.
_var_name	invalid	Starts with _ which is not valid
Operator	Description	Example
+	Adds two vectors	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v+t) it produces the following result - [1] 10.0 8.5 10.0
-	Subtracts second vector from the first	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v-t) it produces the following result - [1] -6.0 2.5 2.0
*	Multiplies both vectors	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v*t) it produces the following result - [1] 16.0 16.5 24.0
/	Divide the first vector with the second	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v/t) When we execute the above code, it produces the following result - [1] 0.250000 1.833333 1.500000
%%	Give the remainder of the first vector with the second	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v%%t) it produces the following result - [1] 2.0 2.5 2.0
%/%	The result of division of first vector with second (quotient)	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v%/%t) it produces the following result - [1] 0 1 1
^	The first vector raised to the exponent of second vector	Live Demo v <- c( 2,5.5,6) t <- c(8, 3, 4) print(v^t) it produces the following result - [1] 256.000 166.375 1296.000
Operator	Description	Example
>	Checks if each element of the first vector is greater than the corresponding element of the second vector.	Live Demo v <- c(2,5.5,6,9) t <- c(8,2.5,14,9) print(v>t) it produces the following result - [1] FALSE TRUE FALSE FALSE
<	Checks if each element of the first vector is less than the corresponding element of the second vector.	Live Demo v <- c(2,5.5,6,9) t <- c(8,2.5,14,9) print(v < t) it produces the following result - [1] TRUE FALSE TRUE FALSE
==	Checks if each element of the first vector is equal to the corresponding element of the second vector.	Live Demo v <- c(2,5.5,6,9) t <- c(8,2.5,14,9) print(v == t) it produces the following result - [1] FALSE FALSE FALSE TRUE
<=	Checks if each element of the first vector is less than or equal to the corresponding element of the second vector.	Live Demo v <- c(2,5.5,6,9) t <- c(8,2.5,14,9) print(v<=t) it produces the following result - [1] TRUE FALSE TRUE TRUE
>=	Checks if each element of the first vector is greater than or equal to the corresponding element of the second vector.	Live Demo v <- c(2,5.5,6,9) t <- c(8,2.5,14,9) print(v>=t) it produces the following result - [1] FALSE TRUE FALSE TRUE
!=	Checks if each element of the first vector is unequal to the corresponding element of the second vector.	Live Demo v <- c(2,5.5,6,9) t <- c(8,2.5,14,9) print(v!=t) it produces the following result - [1] TRUE TRUE TRUE FALSE
Operator	Description	Example
&	It is called Element-wise Logical AND operator. It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if both the elements are TRUE.	Live Demo v <- c(3,1,TRUE,2+3i) t <- c(4,1,FALSE,2+3i) print(v&t) it produces the following result - [1] TRUE TRUE FALSE TRUE
\|	It is called Element-wise Logical OR operator. It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if one the elements is TRUE.	Live Demo v <- c(3,0,TRUE,2+2i) t <- c(4,0,FALSE,2+3i) print(v\|t) it produces the following result - [1] TRUE FALSE TRUE TRUE
!	It is called Logical NOT operator. Takes each element of the vector and gives the opposite logical value.	Live Demo v <- c(3,0,TRUE,2+2i) print(!v) it produces the following result - [1] FALSE TRUE FALSE FALSE
Operator	Description	Example
&&	Called Logical AND operator. Takes first element of both the vectors and gives the TRUE only if both are TRUE.	Live Demo v <- c(3,0,TRUE,2+2i) t <- c(1,3,TRUE,2+3i) print(v&&t) it produces the following result - [1] TRUE
\|\|	Called Logical OR operator. Takes first element of both the vectors and gives the TRUE if one of them is TRUE.	Live Demo v <- c(0,0,TRUE,2+2i) t <- c(0,3,TRUE,2+3i) print(v\|\|t) it produces the following result - [1] FALSE
Operator	Description	Example
<- or = or <<-	Called Left Assignment	Live Demo v1 <- c(3,1,TRUE,2+3i) v2 <<- c(3,1,TRUE,2+3i) v3 = c(3,1,TRUE,2+3i) print(v1) print(v2) print(v3) it produces the following result - [1] 3+0i 1+0i 1+0i 2+3i [1] 3+0i 1+0i 1+0i 2+3i [1] 3+0i 1+0i 1+0i 2+3i
-> or ->>	Called Right Assignment	Live Demo c(3,1,TRUE,2+3i) -> v1 c(3,1,TRUE,2+3i) ->> v2 print(v1) print(v2) it produces the following result - [1] 3+0i 1+0i 1+0i 2+3i [1] 3+0i 1+0i 1+0i 2+3i
Operator	Description	Example
:	Colon operator. It creates the series of numbers in sequence for a vector.	Live Demo v <- 2:8 print(v) it produces the following result - [1] 2 3 4 5 6 7 8
%in%	This operator is used to identify if an element belongs to a vector.	Live Demo v1 <- 8 v2 <- 12 t <- 1:10 print(v1 %in% t) print(v2 %in% t) it produces the following result - [1] TRUE [1] FALSE
%*%	This operator is used to multiply a matrix with its transpose.	Live Demo M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE) t = M %*% t(M) print(t) it produces the following result - [,1] [,2] [1,] 65 82 [2,] 82 117
Sr.No.	Statement & Description
1	if statement An if statement consists of a Boolean expression followed by one or more statements.
2	if...else statement An if statement can be followed by an optional else statement, which executes when the Boolean expression is false.
3	switch statement A switch statement allows a variable to be tested for equality against a list of values.
Sr.No.	Loop Type & Description
1	repeat loop Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.
2	while loop Repeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body.
3	for loop Like a while statement, except that it tests the condition at the end of the loop body.
Function	Platform
windows()	Windows
X11()	Unix
quartz()	Mac
id	time	x1	x2
1	1	5	6
1	2	3	5
2	1	6	1
2	2	2	4
id	time	variable	value
1	1	x1	5
1	2	x1	3
2	1	x1	6
2	2	x1	2
1	1	x2	6
1	2	x2	5
2	1	x2	1
2	2	x2	4
	to one long vector	to matrix	to data frame
from vector	c(x,y)	cbind(x,y) rbind(x,y)	data.frame(x,y)
from matrix	as.vector(mymatrix)		as.data.frame(mymatrix)
from data frame		as.matrix(myframe)
Sr.No.	Control Statement & Description
1	break statement Terminates the loop statement and transfers execution to the statement immediately following the loop.
2	Next statement The next statement simulates the behavior of R switch.
Operator	Description
+	addition
-	subtraction
*	multiplication
/	division
^ or **	exponentiation
Operator	Description
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
Operator	Description
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
!x	Not x
x \| y	x OR y
x & y	x AND y
isTRUE(x)	test if X is TRUE
Function	Output to
pdf("mygraph.pdf")	pdf file
win.metafile("mygraph.wmf")	windows metafile
png("mygraph.png")	png file
jpeg("mygraph.jpg")	jpeg file
bmp("mygraph.bmp")	bmp file
postscript("mygraph.ps")	postscript file
Function	Description
*odbcConnect(dsn, uid="", pwd="")*	Open a connection to an ODBC database
*sqlFetch(channel, sqtable)*	Read a table from an ODBC database into a data frame
*sqlQuery(channel, query)*	Submit a query to an ODBC database and return the results
*sqlSave(channel, mydf, tablename = sqtable, append = FALSE)*	Write or update (append=True) a data frame to a table in the ODBC database
*sqlDrop(channel, sqtable)*	Remove a table from the ODBC database
*close(channel)*	Close the connection
Symbol	Meaning	Example
%d	day as a number (0-31)	01-31
%a %A	abbreviated weekday unabbreviated weekday	Mon Monday
%m	month (00-12)	00-12
%b %B	abbreviated month unabbreviated month	Jan January
%y %Y	2-digit year 4-digit year	07 2007
Option	Description
x	Matrix or data frame
use	Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
method	Specifies the type of correlation. Options are pearson, spearman or kendall.
function	power calculations for
pwr.2p.test	two proportions (equal n)
pwr.2p2n.test	two proportions (unequal n)
pwr.anova.test	balanced one way ANOVA
pwr.chisq.test	chi-square test
pwr.f2.test	general linear model
pwr.p.test	proportion (one sample)
pwr.r.test	correlation
pwr.t.test	t-tests (one sample, 2 sample, paired)
pwr.t2n.test	t-test (two samples with unequal n)
Family	Default Link Function
binomial	(link = "logit")
gaussian	(link = "identity")
Gamma	(link = "inverse")
inverse.gaussian	(link = "1/mu^2")
poisson	(link = "log")
quasi	(link = "identity", variance = "constant")
quasibinomial	(link = "logit")
quasipoisson	(link = "log")