Regular Expressions in R




We will use data from Gapminder as our example to demonstrate using regular expression in R.
All solutions are at the end of this file. We load the stringr package, read in the Gapminder data, and also define a vector of strings of hypothetical filenames.

library(stringr)
gDat <- read.delim("gapminderDataFiveYear.txt")
files <- c("block0_dplyr-fake.rmd", "block000_dplyr-fake.rmd.txt", "gapminderDataFiveYear.txt", 
"regex.html", "regex.md", "regex.R", "regex.Rmd", "regex.Rpres", 
"xblock000_dplyr-fake.rmd")

String functions related to regular expression

Regular expression is a pattern that describes a specific set of strings with a common structure.
It is heavily used for string matching / replacing in all programming languages, although specific syntax may differ a bit.
It is truly the heart and soul for string operations.
In R, many string functions in base R as well as in stringr package use regular expressions, even Rstudio’s search and replace allows regular expression.
There are base R commands and stringr package commands to achieve this (indicated with stringr:: below):

Regular expression syntax

Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string.
This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \.
We will use some small examples to introduce regular expression syntax and what these metacharacters mean.

Escape sequences

There are some special characters in R that cannot be directly coded in a string.
For example, apostrophes.
Apostrophes can be used in R to define strings (as well as quotation marks).
For example name <- 'Cote d'Ivore'' will return an error.
When we want to use an apostrophe as an apostrophe and not a string delimiter, we need to use the “escape” character \'.
You would have to “escape” the single quote in the pattern, by preceding it with , so it’s clear it is not part of the string-specifying machinery.
So name <- 'Cote d\'Ivore\'' will work. Let’s search the country names for those with an apostrophe:

grep('\'', levels(gDat$country))
grep('\'', levels(gDat$country), value = TRUE)
str_detect(levels(gDat$country), '\'')
str_detect(levels(gDat$country), '\'') %>% levels(gDat$country)[.]

There are other characters in R that require escaping, and this rule applies to all string functions in R, including regular expressions:

Quantifiers

Quantifiers specify how many repetitions of the pattern.

Run the following while taking the time to understand the logic:

strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
strings
grep("ac*b", strings, value = TRUE)
grep("ac+b", strings, value = TRUE)
grep("ac?b", strings, value = TRUE)
grep("ac{2}b", strings, value = TRUE)
grep("ac{2,}b", strings, value = TRUE)
grep("ac{2,3}b", strings, value = TRUE)
stringr::str_extract_all(strings, "ac{2,3}b", simplify = TRUE)

Exercise

Using quantifiers find all countries with ee, but not eee, in its name

Position of pattern within the string

For the last example, \b is not a recognized escape character, so we need to double slash it \\b.

strings <- c("abcd", "cdab", "cabd", "c abd")
strings
grep("ab", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
grep("\\bab", strings, value = TRUE)

Exercise

Find the string of country names that

  • Start with “South”
  • End in “land”
  • Have a word in its name that starts with “Ga”

Character classes

Character classes allows to – surprise! – specify entire classes of characters, such as numbers, letters, etc.
There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character.
They are sometimes interchangeable.

Note:

Exercise

  • Find all countries that use punctuation in its name
  • Rewrite the clean.text() function from HW04 that takes in a string and
    • keeps only alpha-numeric characters
    • removes all spaces
    • converts to lower case and returns it the newly formatted string. For example clean.text("Coeur d'Alene") should return coeurdalene
clean.text <- function(string){
 
  return(string)
}

Advanced: Operators

strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12")
strings
grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)
gsub("(ab) 12", "\\1 34", strings)

Exercise

Find countries in Gapminder with letter i or t, and ends with land, and replace land with LAND using backreference.

## [1] "FinLAND"     "IceLAND"     "IreLAND"     "SwaziLAND"   "SwitzerLAND"
## [6] "ThaiLAND"

Resources

Solutions

# Countries with `ee` but not `eee`
grep("e{2}", levels(gDat$country), value = TRUE)

# Countries
grep("^South", levels(gDat$country), value = TRUE)
grep("land$", levels(gDat$country), value = TRUE)
grep("\\bGa", levels(gDat$country), value = TRUE)

# Function to clean text
clean.text <- function(text){
  text <- gsub("[^[:alnum:]]", "", text)
  text <- gsub(" ", "", text)
  text <- tolower(text)
  return(text)
}

# Punctuation in its name
grep("[[:punct:]]", levels(gDat$country), value = TRUE)

# Backreference
countries <- gsub("(.*[it].*)land$", "\\1LAND", levels(gDat$country), ignore.case = T)
grep("LAND", countries, value = TRUE)