R examples

R Basics
Variables
Comments
print and cat
if statements
Functions
Defining functions
Return values
Exponentiation
Named arguments
Automatic printing at the console
Countdown
While loops
Logical operators
Sys.sleep
Reading user input
The readline function
The as.integer function
Preventing failure if no number is entered
Guess a random number game
Rounding
Getting a random number
Lists
Generating lists
Accessing list members
Appending to lists
Operating on lists
Predefined lists
What are vectors?
Reading data
Reading a CSV file
Indexing data frames
A short equivalent of the code
The summary function
Filtering data
Picking specific columns out of a data frame
Finding the lowest value in a list
Selecting the row with the lowest population value
More Examples
Function examples
How to run the code
Using RStudio
Using the R console
Running a source file with R
Finding data sources
mtcars
Quickly generating a data set
Real US data: data.gov
International data


R Basics

To run the code paste it into an R console window.

a <- 42
A <- a * 2  # R is case sensitive
print(a)
cat(A, "\n") # "84" is concatenated with "\n"
if(A>a) # true, 84 > 42
{
  cat(A, ">", a, "\n")
} 
[1] 42
84 
84 > 42

Variables

R uses <- for variable assignment.

Don't call your variables any of the following:

Howard Seltman provides more information about reserved terms in this "Learning R" lesson.
You can use underscores and periods in your identifiers. Google suggests tree.count for variables and DoSomething for functions.

Comments

Comments start with a # sign. Block Comments can't be be done nicely.

print and cat

print() automatically appends a new line character to the output. With cat() you have to append it manually. print() can also show more types of content, such as functions:

print(cat)
> print(cat)
function (..., file = "", sep = " ", fill = FALSE, labels = NULL, 
    append = FALSE) 
{
    if (is.character(file)) 
        if (file == "") 
  file <- stdout()
        else if (substring(file, 1L, 1L) == "|") {
  file <- pipe(substring(file, 2L), "w")
  on.exit(close(file))
        }
        else {
  file <- file(file, ifelse(append, "a", "w"))
  on.exit(close(file))
        }
    .Internal(cat(list(...), file, sep, fill, labels, append))
}
<environment: namespace:base>
You can get the documentation of a function from the R console if you type help(print).

if statements

If statements are quite straightforward:

if(condition)
{
	doSth()
}

Functions

Square <- function(x) {
  return(x^2)
}

print(Square(4))
print(Square(x=4)) # same thing
[1] 16
[1] 16

Defining functions

Functions can be defined like this: function(parameter1, parameter2, parameter3){code}.
For use they are assigned to a variable (using normal <- assignment operator).

Return values

The return function is used for return values: return(value). If no value is given NULL is returned.

Exponentiation

Raise a to the power of b: a^b. It's also possible to use ** instead of ^.

Named arguments

Named arguments work like this: DoSomething(color="red",number=55).

You can also give the first arguments in order and then use named arguments: DoSth(value1,value2,arg4=value4,arg3=value3).

Automatic printing at the console

When using the console directly R automatically prints the return value of a statement if you don't assign it to a variable. This doesn't work in loops or functions.

You can use invisible(CalculateSth()) if you don't want the return value to be printed.

Countdown

countdown <- function(from)
{
  print(from)
  while(from!=0)
  {
    Sys.sleep(1)
    from <- from - 1
    print(from)
  }
}

countdown(5)
> countdown(5)
[1] 5
[1] 4
[1] 3
[1] 2
[1] 1
[1] 0

While loops

Use while(condition){code}.

Logical operators

You can use the usual C-style operators: ==,<,||,&&,....
However, there are differences under some circumstances.

Sys.sleep

Suspends execution for the given amount of seconds.

Reading user input

readinteger <- function()
{ 
  n <- readline(prompt="Enter an integer: ")
  return(as.integer(n))
}

print(readinteger())
Enter an integer: 88
[1] 88

The readline function

readline() lets the user enter a one-line string at the terminal.
The prompt argument is printed in front of the user input. It usually ends on ": ".

The as.integer function

as.integer makes an integer out of the string.

Preventing failure if no number is entered

Right now if the user doesn't enter an integer, as.integer will return NA (Not Available). We can avoid this by using is.na to check the user input and asking again if the value is NA:

readinteger <- function()
{ 
  n <- readline(prompt="Enter an integer: ")
  n <- as.integer(n)
  if (is.na(n)){
    n <- readinteger()
  }
  return(n)
}

print(readinteger())
 
Enter an integer: 
Enter an integer: boo
Enter an integer: 44
[1] 44
Warning message:
In readinteger() : NAs introduced by coercion
 

However, a warning message is still shown. This is how to avoid it:

readinteger <- function()
{ 
  n <- readline(prompt="Enter an integer: ")
  if(!grepl("^[0-9]+$",n))
  {
    return(readinteger())
  }
  
  return(as.integer(n))
}

print(readinteger())
 
Enter an integer: 
Enter an integer: 31r132weq
Enter an integer: effasdf
Enter an integer: 222
[1] 222 
 

grepl returns TRUE if the regular expression "^[0-9]+$" is matched. (The expression checks for a string that consists of nothing but one or more digits.)

! negates the result and the if branch is executed if the user-entered string isn't an integer.

Guess a random number game

Use source("filename.r") to run this.

#utility functions

readinteger <- function()
{ 
  n <- readline(prompt="Enter an integer: ")
  if(!grepl("^[0-9]+$",n))
  {
    return(readinteger())
  }
  return(as.integer(n))
}

# real program start here
  
num <- round(runif(1) * 100, digits = 0)
guess <- -1

cat("Guess a number between 0 and 100.\n")

while(guess != num)
{ 
  guess <- readinteger()
  if (guess == num)
  {
    cat("Congratulations,", num, "is right.\n")
  }
  else if (guess < num)
  {
    cat("It's bigger!\n")
  }
  else if(guess > num)
  {
    cat("It's smaller!\n")
  }
}
> source("random-number-game.r")
Guess a number between 0 and 100.
Enter an integer: 50
It's smaller!
Enter an integer: 20
It's bigger!
Enter an integer: 40
It's bigger!
Enter an integer: 45
It's smaller!
Enter an integer: 43
It's bigger!
Enter an integer: 44
Congratulations, 44 is right.

The readinteger function has been explained in a previous example.

Rounding

The round function rounds the first argument to the specified number of digits.

> round(22.5,0)  # rounds to even number
[1] 22
> round(3.14,1)
[1] 3.1

Getting a random number

runif generates random numbers between 0 and 1. The first argument specifies how many numbers you want.

runif(2)
[1] 0.8379240 0.1773677

The "unif" part of the function indicates that the numbers are uniformly distributed:

This isn't completely uniform at the sides because of our use of round. A round to 0 only happens between 0 and 0.5 because we don't have negative numbers.

A non-uniform normal distribution would look like this:

Learn how to make these graphs here.

Lists

sum(0:9)
append(LETTERS[1:13],letters[14:26])
c(1,6,4,9)*2
something <- c(1,4,letters[2])  # indices start at one, you get (1,4,"b")
length(something)
sum(0:9)
[1] 45
> append(LETTERS[1:13],letters[14:26])
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "n" "o"
[16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> c(1,6,4,9)*2
[1]  2 12  8 18
> something <- c(1,4,letters[2]) #indices start at one, you get (1,4,"b")
> length(something)
[1] 3

Generating lists

You can use a colon to generate a list of numbers:
-5:5
[1] -5 -4 -3 -2 -1  0  1  2  3  4  5

The c function

The c function combines the parameters into a list and converts them to the same type.

c("test",3)
[1] "test" "3"
typeof("3")
[1] "character"
Here 3 is converted to a string.

The seq function

seq generates more complex regular sequences:

> seq(from=1,to=4,by=.6)
[1] 1.0 1.6 2.2 2.8 3.4 4.0  

Accessing list members

List members can be accessed using brackets as in most languages: (3:5)[2]. This returns 4 because indices start with 1.
You can also extract multiple list members from a list: letters[2:4] returns [1] "b" "c" "d".

Appending to lists

You can use the append function for this. Its return value has to be reassigned to the variable.
By default the new value is appended at the end of the list. You can use the after argument to change that:

a <- 1:4
append(a,2.4,after=2)
[1] 1.0 2.0 2.4 3.0 4.0

Operating on lists

R allows you to easily operate on all list values at once.

c(1,2,3) + 3

This and the apply function allow you to avoid most for loops.

[1] 4 5 6  

Predefined lists

Lists for letters and month names are predefined:

letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
[16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"
[11] "Nov" "Dec"

month.name
[1] "January"   "February"  "March"     "April"     "May"      
 [6] "June"      "July"      "August"    "September" "October"  
[11] "November"  "December" 

What are vectors?

Vectors are lists in which all elements have the same type. For example, the c function creates a vector.

Reading data

This code uses a dataset file with population estimates by the US Census Bureau (more info).

tbl <- read.table(file.choose(),header=TRUE,sep=",")
population <- tbl["POPESTIMATE2009"]
print(summary(population[-1:-5,]))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  544300  1734000  4141000  5980000  6613000 36960000 

Reading a CSV file

read.table can read a variety of basic data formats into tables or "data frames".
sep specifies the separator for the data, which is a comma for CSV files.
header indicates whether the first row contains the names of the data columns.

The first argument contains the file name. In this case file.choose is used to show a dialog.

(The user's home folder is the default working directory in RStudio.)

Indexing data frames

See the R documentation for more information.

Getting a specific column

You can use the column name as a string in brackets: tbl["POPESTIMATE2009"]:

   POPESTIMATE2009
1        307006550
2         55283679
3         66836911
[...]
	
Using the column number also works: tbl[17].

Getting a column as a list

You can use the dollar sign for this: tbl$POPESTIMATE2009

[1] 307006550  55283679  66836911 113317879  71568081   4708708    698473
[8]   6595778   2889450  36961664   5024748   3518288    885122    599657
[...]

Fetching specific rows and columns

Here the table will be treated as a 2-dimensional matrix.
To get the first 5 rows from the population table:

population[1:5,]  #  first the rows, then the columns
[1] 307006550  55283679  66836911 113317879  71568081

The comma after the row information indicates that we want all columns. In this case we could also have written [1:5,1] because we only have 1 column in population.

Look at this data from the first 5 rows in the population column:

[1] 307006550  55283679  66836911 113317879  71568081

These are too big to be population values for US States. They are the total US population and that of the US Census Bureau regions: Northeast, Midwest, South and West.
Since we are only interested in the states we can drop them like this:

population[-1:-5,]

Negative numbers in matrix indices can be used to omit specific rows or columns.

A short equivalent of the code

You can also fetch the population column at the same time as you remove the multi-state rows. Replace

population <- tbl["POPESTIMATE2009"]
print(summary(population[-1:-5,]))
with
print(summary(tbl[-1:-5,"POPESTIMATE2009"]))

The summary function

summary calculates a few values based on the data passed as the first argument. The exact values calculated depend on the class of the data.

summary(1:10)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00 

Filtering data

This code uses a dataset file with population estimates by the US Census Bureau (more info).

tbl <- read.table(file.choose(),header=TRUE,sep=',')
population <- tbl[c("NAME","POPESTIMATE2009","NPOPCHG_2009")]
smallest.state.pop <- min(population$POPESTIMATE2009)
print(population[population$POPESTIMATE2009==smallest.state.pop,])
      NAME POPESTIMATE2009 NPOPCHG_2009
56 Wyoming          544270        11289

This piece of code extracts the data about the smallest state from the data frame.
The first line has reads the data from the CSV file (as explained here).

Picking specific columns out of a data frame

The second line limits the rows to the state name, the population estimate for 2009 and the total population change for 2009.
Let's use the head function to look at what we get:

head(population)
 NAME POPESTIMATE2009 NPOPCHG_2009
1 United States       307006550      2631704
2     Northeast        55283679       223483
3       Midwest        66836911       241314
4         South       113317879      1296857
5          West        71568081       870050
6       Alabama         4708708        31244

Finding the lowest value in a list

First the POPESTIMATE2009 column is selected:

population$POPESTIMATE2009
 [1] 307006550  55283679  66836911 113317879  71568081   4708708    698473
 [8]   6595778   2889450  36961664   5024748   3518288    885122    599657
[...]
[50]   2784572    621760   7882590   6664195   1819777   5654774    544270
[57]   3967288

Then the min function is used to find the minimum:

min(population$POPESTIMATE2009)
[1] 544270

Selecting the row with the lowest population value

You use something like a WHERE clause in data frame indices:

data.frame[condition]
This condition works because it creates an array of booleans depending on whether the field value is a match:

population$POPESTIMATE2009==smallest.state.pop
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

In this case only the second to last row should be selected. We use a comma after the row index because we want all the columns:

population[population$POPESTIMATE2009==smallest.state.pop,]
      NAME POPESTIMATE2009 NPOPCHG_2009
56 Wyoming          544270        11289

More Examples

Plotting a uniform distribution

Function examples

Match()
Sample()

How to run the code

I find it easiest to use RStudio. You can also paste the code in a normal R console or let R run a source file. However these approaches are a bit less fail-safe.

Using RStudio

Download and install RStudio.
Open RStudio and do this:

  1. Click on the menu: File -> New -> R Script
  2. Paste the code in the new source code area
  3. Click the "Source" button above the code area:

You can also use the console in RStudio. If you click "Run" instead of "Source" user input might not work properly.

You can use the R documentation like this: help(function.name).

Using the R console

Running the r program on the command line or elsewhere will start the console. You can paste your code there.

Problems with this approach

This will generally work if you use source("filename.r") to run your code. If you paste the code some of it might be read as user input.

Running a source file with R

You can run a source file like this: r -f filename.r.
R also provides a lot of other command line arguments.

Problems with this approach

I have not managed to get user input to work here.

Finding data sources

Here are a few data sets to play around with.

mtcars

print(head(mtcars))
         mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

This is a built-in data set. Use it for sorting, filtering etc.

Quickly generating a data set

You can use matrix to generate a quick dataset. The first argument must contain your data. You can use a distribution function or sample for it.

print(matrix(runif(6*3), nrow=6, ncol=3))
[,1]       [,2]       [,3]
[1,] 0.94210093 0.23582446 0.19571104
[2,] 0.45026399 0.77989358 0.69763985
[3,] 0.03567169 0.40572983 0.83394039
[4,] 0.31246289 0.08076585 0.74957412
[5,] 0.61316957 0.94886782 0.90769685
[6,] 0.94545758 0.48658449 0.03396954

Real US data: data.gov

The US government published data sets on data.gov. This includes US census data. Most files are in CSV format and you can use read.table to read them.

International data

The UN publishes data at data.un.org. This data is also available in CSV format.