RList Turorial

functions on non-tabular data, rlist is a set of tools for working with list objects.

Must Watch!
MustWatch




Mapping
Suppose we load the data which is represented by the following table:



Name Age Interests Expertise


Ken 24 reading, music, movies R:2, C#:4, Python:3
James 25 sports, music R:3, Java:2, C++:5
Penny 24 movies, reading R:1, C++:4, Python:2



list.load() is designed for loading data from given data source. The data source can be either local or remote and the function by default uses the file extension to decide the way to read it.

library(rlist)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")
str(people)

# List of 3
#  $ :List of 4
#   ..$ Name     : chr "Ken"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:3] "reading" "music" "movies"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 2
#   .. ..$ CSharp: int 4
#   .. ..$ Python: int 3
#  $ :List of 4
#   ..$ Name     : chr "James"
#   ..$ Age      : int 25
#   ..$ Interests: chr [1:2] "sports" "music"
#   ..$ Expertise:List of 3
#   .. ..$ R   : int 3
#   .. ..$ Java: int 2
#   .. ..$ Cpp : int 5
#  $ :List of 4
#   ..$ Name     : chr "Penny"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:2] "movies" "reading"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 1
#   .. ..$ Cpp   : int 4
#   .. ..$ Python: int 2


NOTE: str() previews the structure of an object. We may use this function more often to avoid verbose representation of list objects.



To extract the name of each people (list element), traditionally we can call lapply() like the following:

lapply(people, function(x) {
  x$Name
})

# [[1]]
# [1] "Ken"
# 
# [[2]]
# [1] "James"
# 
# [[3]]
# [1] "Penny"

Using rlist's list.map() the task is made extremely easy:

list.map(people, Name)

# [[1]]
# [1] "Ken"
# 
# [[2]]
# [1] "James"
# 
# [[3]]
# [1] "Penny"

List mapping is to evaluate an expression for each list member. It is the fundamental operation in rlist functionality. Almost all functions in this package that work with expressions are using mapping but in different ways. The following examples demonstrate several types of mapping in more details.

 - list.map

The simplest way of mapping is provided by list.map() as we have just demonstrated. Basically, it evaluates an expression for each list element. 

The function makes it easier to query a list by putting all fields of the list member in mapping to the environment where the expression is evaluated. In other words, the expression is evaluated in the context of one list member each time.

For example, the following code maps each list member in people by expression age. Therefore, it results in a list where each item becomes the value of that expression for each member of people.

list.map(people, Age)

# [[1]]
# [1] 24
# 
# [[2]]
# [1] 25
# 
# [[3]]
# [1] 24

Since the expression does not have to be a field name of the list member, we can evaluate whatever we want in the context of a list member.

The following code maps each list member to the sum of years of the programming languages they use.

list.map(people, sum(as.numeric(Expertise)))

# [[1]]
# [1] 9
# 
# [[2]]
# [1] 10
# 
# [[3]]
# [1] 7

If we need more than one values for each member, we can evaluate a vector or list expression.

The following code maps each list member to a new list of the age and range of number of years using programming languages.

list.map(people, list(age=Age, range=range(as.numeric(Expertise))))

# [[1]]
# [[1]]$age
# [1] 24
# 
# [[1]]$range
# [1] 2 4
# 
# 
# [[2]]
# [[2]]$age
# [1] 25
# 
# [[2]]$range
# [1] 2 5
# 
# 
# [[3]]
# [[3]]$age
# [1] 24
# 
# [[3]]$range
# [1] 1 4

In some cases we need to refer to the item itself, or its index in the list, or even its name. In the expression, . represents the item itself, .i represents its index, and .name represents its name.

For example,

nums <- c(a=3, b=2, c=1)
list.map(nums, . + 1)

# $a
# [1] 4
# 
# $b
# [1] 3
# 
# $c
# [1] 2

list.map(nums, .i)

# $a
# [1] 1
# 
# $b
# [1] 2
# 
# $c
# [1] 3

list.map(nums, paste0("name: ", .name))

# $a
# [1] "name: a"
# 
# $b
# [1] "name: b"
# 
# $c
# [1] "name: c"

If the default symbols clash with the data, we can use lambda expression to specify other symbols. We will cover this later.



NOTE: rlist functions are general enough to work smoothly with vectors. list.map() works very much like lapply() so that the input will be finally transformed to list.


 - list.mapv

If we want to get the mapping results as a vector rather than a list, we can use list.mapv(), which basically calls unlist() to the list resulted from list.map().

list.mapv(people, Age)

# [1] 24 25 24

list.mapv(people, sum(as.numeric(Expertise)))

# [1]  9 10  7
 - list.select

In contrast to list.map(), list.select() provides an easier way to map each list member to a new list. This functions basically evaluates all given expressions and put the results into a list.

If a field name a list member is selected, its name will automatically preserved. If a list item evaluated from other expression is selected, we may better give it a name, or otherwise it will only have an index.

list.select(people, Name, Age)

# [[1]]
# [[1]]$Name
# [1] "Ken"
# 
# [[1]]$Age
# [1] 24
# 
# 
# [[2]]
# [[2]]$Name
# [1] "James"
# 
# [[2]]$Age
# [1] 25
# 
# 
# [[3]]
# [[3]]$Name
# [1] "Penny"
# 
# [[3]]$Age
# [1] 24

list.select(people, Name, Age, nlang=length(Expertise))

# [[1]]
# [[1]]$Name
# [1] "Ken"
# 
# [[1]]$Age
# [1] 24
# 
# [[1]]$nlang
# [1] 3
# 
# 
# [[2]]
# [[2]]$Name
# [1] "James"
# 
# [[2]]$Age
# [1] 25
# 
# [[2]]$nlang
# [1] 3
# 
# 
# [[3]]
# [[3]]$Name
# [1] "Penny"
# 
# [[3]]$Age
# [1] 24
# 
# [[3]]$nlang
# [1] 3
 - list.iter

Sometimes we don't really need the result of a mapping but its side effects. For example, if we only need to print out something about each list member, we don't need to carry on the output of mapping.

list.iter() performs iterations over a list and returns the input data invisibly for further data transformation.

list.iter(people, cat(Name, ":", Age, "\n"))

# Ken : 24 
# James : 25 
# Penny : 24
 - list.maps

All the previous functions work with a single list. However, there are scenarios where mapping multiple lists is needed. list.maps() evaluates an expression with multiple lists each of which is represented by a user-defined symbol at the function call.

l1 <- list(p1=list(x=1,y=2), p2=list(x=3,y=4), p3=list(x=1,y=3))
l2 <- list(2, 3, 5)
list.maps(a$x*b+a$y, a=l1, b=l2)

# $p1
# [1] 4
# 
# $p2
# [1] 13
# 
# $p3
# [1] 8

list.maps() does not follow the conventions of many other functions like list.map() and list.iter() where the data comes first and expression comes the second. Since list.maps() supports multi-mapping with a group of lists, only implicit lambda expression is supported to avoid ambiguity. After that the function still allows users to define the symbol that represents each list being mapped in ....

In the example above, ... means a = l1, b = l2, so that a and b are meaningful in the first expression a$x*b+a$y where a and b mean the iterating element of each list, respectively.

The function does not require named be supplied with the lists as arguments. In this case, we can use ..1, ..2, etc. to refer to the first, second or other lists.

list.maps(..1$x*..2 + ..1$y, l1, l2)

# $p1
# [1] 4
# 
# $p2
# [1] 13
# 
# $p3
# [1] 8

                    

Filtering

List filtering is to select list elements by given criteria. In rlist package, more than ten functions are related with list filtering. Basically, they all perform mapping first but then aggregate the results in different ways.

First, we load the sample data.

library(rlist)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")

 - list.filter

list.filter() filters a list by an expression that returns TRUE or FALSE. The results only contain the list elements for which the value of that expression turns out to be TRUE.

Different from list mapping which evaluates an expression given each list element, list filtering evaluates an expression to decide whether to include the entire element in the results.

str(list.filter(people, Age >= 25))

# List of 1
#  $ :List of 4
#   ..$ Name     : chr "James"
#   ..$ Age      : int 25
#   ..$ Interests: chr [1:2] "sports" "music"
#   ..$ Expertise:List of 3
#   .. ..$ R   : int 3
#   .. ..$ Java: int 2
#   .. ..$ Cpp : int 5

Note that list.filter() filters the data with given conditions and the list elements that satisfy the conditions will be returned. We call str() on the results to shorten the output.

Using pipeline, we can first filter the data and then map the resulted elements by expression. For example, we can get the names of those whose age is no less than 25.

library(pipeR)
people %>>%
  list.filter(Age >= 25) %>>%
  list.mapv(Name)

# [1] "James"

If one has to write the code in traditional approach, it can be 

list.mapv(list.filter(people, Age >= 25), Name)

or 

people_filtered <- list.filter(people, Age >= 25)
list.mapv(people_filtered, Name)

It is obvious that both versions are quite redundant. Therefore, we will heavily use pipeline in the demonstration the features from now on to make the data processing look more elegant, and reduce the amount of information in the output.

Similarly, we can also get the names of those who are interested in music.

people %>>%
  list.filter("music" %in% Interests) %>>%
  list.mapv(Name)

# [1] "Ken"   "James"

We can get the names of those who have been using programming languages for at least three years on average.

people %>>%
  list.filter(mean(as.numeric(Expertise)) >= 3) %>>%
  list.mapv(Name)

# [1] "Ken"   "James"

Meta-symbols like ., .i, and .name can also be used. The following code will pick up the list element whose index is even.

people %>>%
  list.filter(.i %% 2 == 0) %>>%
  list.mapv(Name)

# [1] "James"
 - list.find

In some cases, we don't need to find all the instances given the criteria. Rather, we only need to find a few, sometimes only one. list.find() avoids searching across all list element but stops at a specific number of items found.

people %>>%
  list.find(Age >= 25, 1) %>>%
  list.mapv(Name)

# [1] "James"
 - list.findi

Similar with list.find(), list.findi() only returns the index of the elements found.

list.findi(people, Age >= 23, 2)

# [1] 1 2

You may verify that if the number of instances to find is greater than the actual number of instances in the data, all qualified instances will be returned.

 - list.first, list.last

list.first() and list.last() are used to find the first and last element that meets certain condition if specified, respectively.

str(list.first(people, Age >= 23))

# List of 4
#  $ Name     : chr "Ken"
#  $ Age      : int 24
#  $ Interests: chr [1:3] "reading" "music" "movies"
#  $ Expertise:List of 3
#   ..$ R     : int 2
#   ..$ CSharp: int 4
#   ..$ Python: int 3

str(list.last(people, Age >= 23))

# List of 4
#  $ Name     : chr "Penny"
#  $ Age      : int 24
#  $ Interests: chr [1:2] "movies" "reading"
#  $ Expertise:List of 3
#   ..$ R     : int 1
#   ..$ Cpp   : int 4
#   ..$ Python: int 2

These two functions also works when the condition is missing. In this case, they simply take out the first/last element from the list or vector.

list.first(1:10)

# [1] 1

list.last(1:10)

# [1] 10
 - list.take

list.take() takes at most a given number of elements from a list. If the number is even larger than the length of the list, the function will by default return all elements in the list.

list.take(1:10, 3)

# [1] 1 2 3

list.take(1:5, 8)

# [1] 1 2 3 4 5
 - list.skip

As opposed to list.take(), list.skip() skips at most a given number of elements in the list and take all the rest as the results. If the number of elements to skip is equal or greater than the length of that list, an empty one will be returned.

list.skip(1:10, 3)

# [1]  4  5  6  7  8  9 10

list.skip(1:5, 8)

# integer(0)
 - list.takeWhile

Similar to list.take(), list.takeWhile() is also designed to take out some elements from a list but subject to a condition. Basically, it keeps taking elements while a condition holds true.

people %>>%
  list.takeWhile(Expertise$R >= 2) %>>%
  list.map(list(Name = Name, R = Expertise$R)) %>>%
  str

# List of 2
#  $ :List of 2
#   ..$ Name: chr "Ken"
#   ..$ R   : int 2
#  $ :List of 2
#   ..$ Name: chr "James"
#   ..$ R   : int 3
 - list.skipWhile

list.skipWhile() keeps skipping elements while a condition holds true.

people %>>%
  list.skipWhile(Expertise$R <= 2) %>>%
  list.map(list(Name = Name, R = Expertise$R)) %>>%
  str

# List of 2
#  $ :List of 2
#   ..$ Name: chr "James"
#   ..$ R   : int 3
#  $ :List of 2
#   ..$ Name: chr "Penny"
#   ..$ R   : int 1
 - list.is

list.is() returns a logical vector that indicates whether a condition holds for each member of a list.

list.is(people, "music" %in% Interests)

# [1]  TRUE  TRUE FALSE

list.is(people, "Java" %in% names(Expertise))

# [1] FALSE  TRUE FALSE
 - list.which

list.which() returns a integer vector of the indices of the elements of a list that meet a given condition.

list.which(people, "music" %in% Interests)

# [1] 1 2

list.which(people, "Java" %in% names(Expertise))

# [1] 2
 - list.all

list.all() returns TRUE if all the elements of a list satisfy a given condition, or FALSE otherwise.

list.all(people, mean(as.numeric(Expertise)) >= 3)

# [1] FALSE

list.all(people, "R" %in% names(Expertise))

# [1] TRUE
 - list.any

list.any() returns TRUE if at least one of the elements of a list satisfies a given condition, or FALSE otherwise.

list.any(people, mean(as.numeric(Expertise)) >= 3)

# [1] TRUE

list.any(people, "Python" %in% names(Expertise))

# [1] TRUE
 - list.count

list.count() return a scalar integer that indicates the number of elements of a list that satisfy a given condition.

list.count(people, mean(as.numeric(Expertise)) >= 3)

# [1] 2

list.count(people, "R" %in% names(Expertise))

# [1] 3
 - list.match

list.match() filters a list by matching the names of the list elements by a regular expression pattern.

data <- list(p1 = 1, p2 = 2, a1 = 3, a2 = 4)
list.match(data, "p[12]")

# $p1
# [1] 1
# 
# $p2
# [1] 2
 - list.remove

list.remove() removes list elements by index or name.

list.remove(data, c("p1","p2"))

# $a1
# [1] 3
# 
# $a2
# [1] 4

list.remove(data, c(2,3))

# $p1
# [1] 1
# 
# $a2
# [1] 4
 - list.exclude

list.exclude() removes list elements that satisfy given condition.

people %>>%
  list.exclude("sports" %in% Interests) %>>%
  list.mapv(Name)

# [1] "Ken"   "Penny"
 - list.clean

list.clean() is used to clean a list by a function either recursively or not. The function can be built-in function like is.null() to remove all NULL values from the list, or can be user-defined function like function(x) length(x) == 0 to remove all empty objects like NULL, character(0L), etc.

x <- list(a=1, b=NULL, c=list(x=1,y=NULL,z=logical(0L),w=c(NA,1)))
str(x)

# List of 3
#  $ a: num 1
#  $ b: NULL
#  $ c:List of 4
#   ..$ x: num 1
#   ..$ y: NULL
#   ..$ z: logi(0) 
#   ..$ w: num [1:2] NA 1

To clear all NULL values in the list recursively, we can call

str(list.clean(x, recursive = TRUE))

# List of 2
#  $ a: num 1
#  $ c:List of 3
#   ..$ x: num 1
#   ..$ z: logi(0) 
#   ..$ w: num [1:2] NA 1

To remove all empty values including NULL and zero-length vectors, we can call

str(list.clean(x, function(x) length(x) == 0L, recursive = TRUE))

# List of 2
#  $ a: num 1
#  $ c:List of 2
#   ..$ x: num 1
#   ..$ w: num [1:2] NA 1

The function can also be related to missing values. For example, exclude all empty values and vectors with at least NAs.

str(list.clean(x, function(x) length(x) == 0L || anyNA(x), recursive = TRUE))

# List of 2
#  $ a: num 1
#  $ c:List of 1
#   ..$ x: num 1
 - subset

subset() is implemented for list object in a way that combines list.filter() and list.map(). This function basically filters a list while at the same time maps the qualified list elements by an expression.

people %>>%
  subset(Age >= 24, Name)

# [[1]]
# [1] "Ken"
# 
# [[2]]
# [1] "James"
# 
# [[3]]
# [1] "Penny"

people %>>%
  subset("reading" %in% Interests, sum(as.numeric(Expertise)))

# [[1]]
# [1] 9
# 
# [[2]]
# [1] 7

                    

Updating

list.update() partially modifies the given list by a number of lists resulted from expressions.

First, we load the data without any modification.

library(rlist)
library(pipeR)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")
people %>>%
  list.select(Name, Age) %>>%
  list.stack

#    Name Age
# 1   Ken  24
# 2 James  25
# 3 Penny  24

list.stack() converts a list to a data frame with equivalent structure. We will introduce this function later.

Suppose we find that the age of each people is mistakenly recorded, say, 1 year less than their actual ages, respectively, we need to update the original data by refresh the age of each element.

people %>>%
  list.update(Age = Age + 1) %>>%
  list.select(Name, Age) %>>%
  list.stack

#    Name Age
# 1   Ken  25
# 2 James  26
# 3 Penny  25

list.update() can also be used to exclude certain fields of the elements. Once we update the fields we want to exclude to NULL, those fields are removed.

people %>>%
  list.update(Interests = NULL, Expertise = NULL, N = length(Expertise)) %>>%
  list.stack

#    Name Age N
# 1   Ken  24 3
# 2 James  25 3
# 3 Penny  24 3

                    

Sorting

rlist package provides functions for sorting list elements by a series of criteria.

library(rlist)
library(pipeR)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")

 - list.order

list.order() evaluates the given lambda expressions and find out the order by default ascending. If the values for some members tie, the next values of the next expression, if any, will count.

To get the order in descending, use () to enclose the expression or simply write a minus operator (-) before the expression if its value is numeric.

Get the order of people by Age in ascending order.

list.order(people, Age)

# [1] 1 3 2

Get the order of people by number of interests in ascending order.

list.order(people, length(Interests))

# [1] 2 3 1

Get the order of people by the number of years using R in descending order.

list.order(people, (Expertise$R))

# [1] 2 1 3

Get the order of people by the maximal number of years using a programming language in ascending order.

list.order(people, max(unlist(Expertise)))

# [1] 1 3 2

Get the order of people by the number of interests in descending order. If two people have the same number of interests, then the one who has been using R for more years should rank higher, thus ordering by R descending.

list.order(people, (length(Interests)), (Expertise$R))

# [1] 1 2 3
 - list.sort

list.sort() produces a sorted list of the original list members. Its usage is exactly the same as list.order().

people %>>%
  list.sort(Age) %>>%
  list.select(Name, Age) %>>%
  str

# List of 3
#  $ :List of 2
#   ..$ Name: chr "Ken"
#   ..$ Age : int 24
#  $ :List of 2
#   ..$ Name: chr "Penny"
#   ..$ Age : int 24
#  $ :List of 2
#   ..$ Name: chr "James"
#   ..$ Age : int 25

people %>>%
  list.sort(length(Interests)) %>>%
  list.select(Name, nint = length(Interests)) %>>%
  str

# List of 3
#  $ :List of 2
#   ..$ Name: chr "James"
#   ..$ nint: int 2
#  $ :List of 2
#   ..$ Name: chr "Penny"
#   ..$ nint: int 2
#  $ :List of 2
#   ..$ Name: chr "Ken"
#   ..$ nint: int 3

people %>>%
  list.sort((Expertise$R)) %>>%
  list.select(Name, R = Expertise$R) %>>%
  str

# List of 3
#  $ :List of 2
#   ..$ Name: chr "James"
#   ..$ R   : int 3
#  $ :List of 2
#   ..$ Name: chr "Ken"
#   ..$ R   : int 2
#  $ :List of 2
#   ..$ Name: chr "Penny"
#   ..$ R   : int 1

people %>>%
  list.sort(max(unlist(Expertise))) %>>%
  list.mapv(Name)

# [1] "Ken"   "Penny" "James"

people %>>%
  list.sort((length(Interests)), (Expertise$R)) %>>%
  list.select(Name, nint = length(Interests), R = Expertise$R) %>>%
  str

# List of 3
#  $ :List of 3
#   ..$ Name: chr "Ken"
#   ..$ nint: int 3
#   ..$ R   : int 2
#  $ :List of 3
#   ..$ Name: chr "James"
#   ..$ nint: int 2
#   ..$ R   : int 3
#  $ :List of 3
#   ..$ Name: chr "Penny"
#   ..$ nint: int 2
#   ..$ R   : int 1

                    

Grouping

rlist supports multiple types of grouping. 

First, we load the sample data.

library(rlist)
library(pipeR)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")

 - list.group

list.group() is used to put list elements into subgroups by evaluating an expression. The expression often produces a scalar value such as a logical value, a character value, or a number. Each group denotes a unique value that expression takes for at least one list element, and all elements are put into one and only one group.

Divide numbers from 1 to 10 into even and odd numbers.

list.group(1:10, . %% 2 == 0)

# $`FALSE`
# [1] 1 3 5 7 9
# 
# $`TRUE`
# [1]  2  4  6  8 10

The result is a list of two elements, which are the two possible outcome of evaluating . %% 2 == 0L given each number in 1:10. FALSE group contains all odd numbers in 1:10 and TRUE group contains all even numbers in 1:10. 

This simple example demonstrates that the result of list.group() is always a list containing sublists with names of all possible outcomes, and the value of each sub-list is a subset of the original data in which each element evaluates the grouping expression to the same value.

With the same logic, we can divide all elements in people into groups by their ages:

str(list.group(people, Age))

# List of 2
#  $ 24:List of 2
#   ..$ :List of 4
#   .. ..$ Name     : chr "Ken"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:3] "reading" "music" "movies"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 2
#   .. .. ..$ CSharp: int 4
#   .. .. ..$ Python: int 3
#   ..$ :List of 4
#   .. ..$ Name     : chr "Penny"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:2] "movies" "reading"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 1
#   .. .. ..$ Cpp   : int 4
#   .. .. ..$ Python: int 2
#  $ 25:List of 1
#   ..$ :List of 4
#   .. ..$ Name     : chr "James"
#   .. ..$ Age      : int 25
#   .. ..$ Interests: chr [1:2] "sports" "music"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R   : int 3
#   .. .. ..$ Java: int 2
#   .. .. ..$ Cpp : int 5

The result is another list whose first-level elements are the groups and the elements in each group are exactly the elements that belong to the group.

Since the grouped result is also a list, we can always use rlist functions on the sublists as groups. Therefore, to get the names of people in each group, we can map each group to the names in it.

people %>>%
  list.group(Age) %>>%
  list.map(. %>>% list.mapv(Name))

# $`24`
# [1] "Ken"   "Penny"
# 
# $`25`
# [1] "James"

The mapping runs at the first-level, that is, for each group. The mapper expression . %>>% list.mapv(Name) means that each people in the group maps to the name.

The same logic allows us to do another grouping by the number of Interests and then to see their names.

people %>>%
  list.group(length(Interests)) %>>%
  list.map(. %>>% list.mapv(Name))

# $`2`
# [1] "James" "Penny"
# 
# $`3`
# [1] "Ken"
 - list.ungroup

list.group() produces a nested list in which the first level are groups and the second level are the original list elements put into different groups. 

list.ungroup() reverts this process. In other words, the function eradicates the group level of a list.

ageGroups <- list.group(people, Age)
str(list.ungroup(ageGroups))

# List of 3
#  $ :List of 4
#   ..$ Name     : chr "Ken"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:3] "reading" "music" "movies"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 2
#   .. ..$ CSharp: int 4
#   .. ..$ Python: int 3
#  $ :List of 4
#   ..$ Name     : chr "Penny"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:2] "movies" "reading"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 1
#   .. ..$ Cpp   : int 4
#   .. ..$ Python: int 2
#  $ :List of 4
#   ..$ Name     : chr "James"
#   ..$ Age      : int 25
#   ..$ Interests: chr [1:2] "sports" "music"
#   ..$ Expertise:List of 3
#   .. ..$ R   : int 3
#   .. ..$ Java: int 2
#   .. ..$ Cpp : int 5
 - list.cases

In non-relational data structures, a field can be a vector of multiple values. list.cases() is used to find out all possible cases by evaluating a vector-valued expression for each list element.

In data people, field Interests is usually a character vector of multiple values. The following code will find out all possible Interests for all list elements.

list.cases(people, Interests)

# [1] "movies"  "music"   "reading" "sports"

Or use similar code to find out all programming Expertise the developers use.

list.cases(people, names(Expertise))

# [1] "Cpp"    "CSharp" "Java"   "Python" "R"
 - list.class

list.class() groups list elements by cases, that is, it categorizes them by examining if the value of a given expression for each list element inlcudes the case. As a result, the function produces a long and nested list in which the first-level denotes all the cases, and the second-level includes the original list elements.

Since each list element may belong to multiple cases, the classification of the cases for each element is not exclusive. You may find one list element belong to multiple cases in the resulted list.

If the expression is itself single-valued and thus exclusive, then the result is the same with that produced by list.group(). For example,

1:10 %>>%
  list.class(. %% 2 == 0)

# $`FALSE`
# [1] 1 3 5 7 9
# 
# $`TRUE`
# [1]  2  4  6  8 10

If the value of the expression is not single-valued, then list.class() and list.group() behaves differently. For example, we perform case classification by Interests:

str(list.class(people, Interests))

# List of 4
#  $ movies :List of 2
#   ..$ :List of 4
#   .. ..$ Name     : chr "Ken"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:3] "reading" "music" "movies"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 2
#   .. .. ..$ CSharp: int 4
#   .. .. ..$ Python: int 3
#   ..$ :List of 4
#   .. ..$ Name     : chr "Penny"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:2] "movies" "reading"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 1
#   .. .. ..$ Cpp   : int 4
#   .. .. ..$ Python: int 2
#  $ music  :List of 2
#   ..$ :List of 4
#   .. ..$ Name     : chr "Ken"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:3] "reading" "music" "movies"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 2
#   .. .. ..$ CSharp: int 4
#   .. .. ..$ Python: int 3
#   ..$ :List of 4
#   .. ..$ Name     : chr "James"
#   .. ..$ Age      : int 25
#   .. ..$ Interests: chr [1:2] "sports" "music"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R   : int 3
#   .. .. ..$ Java: int 2
#   .. .. ..$ Cpp : int 5
#  $ reading:List of 2
#   ..$ :List of 4
#   .. ..$ Name     : chr "Ken"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:3] "reading" "music" "movies"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 2
#   .. .. ..$ CSharp: int 4
#   .. .. ..$ Python: int 3
#   ..$ :List of 4
#   .. ..$ Name     : chr "Penny"
#   .. ..$ Age      : int 24
#   .. ..$ Interests: chr [1:2] "movies" "reading"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R     : int 1
#   .. .. ..$ Cpp   : int 4
#   .. .. ..$ Python: int 2
#  $ sports :List of 1
#   ..$ :List of 4
#   .. ..$ Name     : chr "James"
#   .. ..$ Age      : int 25
#   .. ..$ Interests: chr [1:2] "sports" "music"
#   .. ..$ Expertise:List of 3
#   .. .. ..$ R   : int 3
#   .. .. ..$ Java: int 2
#   .. .. ..$ Cpp : int 5

We get a list containing sub-lists named by all possible interests, and each sub-lists contains all list elements that whose interests include the corresponding interest.

Similar to building nested pipelines in list.group() examples, we can get the people's names in each class.

people %>>%
  list.class(Interests) %>>%
  list.map(. %>>% list.mapv(Name))

# $movies
# [1] "Ken"   "Penny"
# 
# $music
# [1] "Ken"   "James"
# 
# $reading
# [1] "Ken"   "Penny"
# 
# $sports
# [1] "James"

The exactly same logic also applies when we want to know the people's names classified by the name of programming languages as expertise:

people %>>%
  list.class(names(Expertise)) %>>%
  list.map(. %>>% list.mapv(Name))

# $Cpp
# [1] "James" "Penny"
# 
# $CSharp
# [1] "Ken"
# 
# $Java
# [1] "James"
# 
# $Python
# [1] "Ken"   "Penny"
# 
# $R
# [1] "Ken"   "James" "Penny"
 - list.common

This function returns the common cases by evaluating a given expression for all list elements.

Get the common Interests of all developers.

list.common(people, Interests)

# character(0)

It concludes that no interests are common to every one. Let's see if there is any common programming language they all use.

list.common(people, names(Expertise))

# [1] "R"
 - list.table

table() builds a contingency table of the counts at each combination of factor levels using cross-classifying factors. list.table() is a wrapper that creates a table in which each dimension results from the values for an expression.

The function is very handy to serve as a counter. The following examples shows an easy way to know the remainders and the number of integers from 1 to 1000 when each is divided by 3.

list.table(1:1000, . %% 3)

# 
#   0   1   2 
# 333 334 333

For people dataset, we can build a two-dimensional table to show the distribution of number of interests and age.

list.table(people, Interests=length(Interests), Age)

#          Age
# Interests 24 25
#         2  1  1
#         3  1  0

                    

Joining

list.join() joins two lists by certain expressions and list.merge() merges a series of named lists.

library(rlist)
library(pipeR)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json") %>>%
  list.names(Name)

 - list.join

list.join() is used to join two lists by a key evaluated from either a common expression for the two lists or two separate expressions for each list.

newinfo <-
  list(
    list(Name="Ken", Email="ken@xyz.com"),
    list(Name="Penny", Email="penny@xyz.com"),
    list(Name="James", Email="james@xyz.com"))
str(list.join(people, newinfo, Name))

# List of 3
#  $ Ken  :List of 5
#   ..$ Name     : chr "Ken"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:3] "reading" "music" "movies"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 2
#   .. ..$ CSharp: int 4
#   .. ..$ Python: int 3
#   ..$ Email    : chr "ken@xyz.com"
#  $ James:List of 5
#   ..$ Name     : chr "James"
#   ..$ Age      : int 25
#   ..$ Interests: chr [1:2] "sports" "music"
#   ..$ Expertise:List of 3
#   .. ..$ R   : int 3
#   .. ..$ Java: int 2
#   .. ..$ Cpp : int 5
#   ..$ Email    : chr "james@xyz.com"
#  $ Penny:List of 5
#   ..$ Name     : chr "Penny"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:2] "movies" "reading"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 1
#   .. ..$ Cpp   : int 4
#   .. ..$ Python: int 2
#   ..$ Email    : chr "penny@xyz.com"
 - list.merge

list.merge() is used to recursively merge a series of lists with the later always updates the former. It works with two lists, as shown in the example below, in which a revision is merged with the original list.

More specifically, the merge works in a way that lists are partially updated, which allows us to specify only the fields we want to update or add for a list element, or use NULL to remove a field.

rev1 <-
  list(
    Ken = list(Age=25),
    James = list(Expertise = list(R=2, Cpp=4)),
    Penny = list(Expertise = list(R=2, Python=NULL)))
str(list.merge(people,rev1))

# List of 3
#  $ Ken  :List of 4
#   ..$ Name     : chr "Ken"
#   ..$ Age      : num 25
#   ..$ Interests: chr [1:3] "reading" "music" "movies"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 2
#   .. ..$ CSharp: int 4
#   .. ..$ Python: int 3
#  $ James:List of 4
#   ..$ Name     : chr "James"
#   ..$ Age      : int 25
#   ..$ Interests: chr [1:2] "sports" "music"
#   ..$ Expertise:List of 3
#   .. ..$ R   : num 2
#   .. ..$ Java: int 2
#   .. ..$ Cpp : num 4
#  $ Penny:List of 4
#   ..$ Name     : chr "Penny"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:2] "movies" "reading"
#   ..$ Expertise:List of 2
#   .. ..$ R  : num 2
#   .. ..$ Cpp: int 4

The function also works with multiple lists. When the second revision is obtained, the three lists can be merged in order.

rev2 <-
  list(
    James = list(Expertise=list(CSharp = 5)),
    Penny = list(Age = 24,Expertise=list(R = 3)))
str(list.merge(people,rev1, rev2))

# List of 3
#  $ Ken  :List of 4
#   ..$ Name     : chr "Ken"
#   ..$ Age      : num 25
#   ..$ Interests: chr [1:3] "reading" "music" "movies"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 2
#   .. ..$ CSharp: int 4
#   .. ..$ Python: int 3
#  $ James:List of 4
#   ..$ Name     : chr "James"
#   ..$ Age      : int 25
#   ..$ Interests: chr [1:2] "sports" "music"
#   ..$ Expertise:List of 4
#   .. ..$ R     : num 2
#   .. ..$ Java  : int 2
#   .. ..$ Cpp   : num 4
#   .. ..$ CSharp: num 5
#  $ Penny:List of 4
#   ..$ Name     : chr "Penny"
#   ..$ Age      : num 24
#   ..$ Interests: chr [1:2] "movies" "reading"
#   ..$ Expertise:List of 2
#   .. ..$ R  : num 3
#   .. ..$ Cpp: int 4

Note that list.merge() only works with lists with names; otherwise the merging function will not know the correspondence between the list elements to merge.

                    

Searching

rlist provides searching capabilities, that is, to find values within a list recursively. list.search() handles a variety of search demands. 

library(rlist)
library(pipeR)
friends <- list.load("http://renkun.me/rlist-tutorial/data/friends.json")

If the expression results in a single-valued logical vector and its value is TRUE, the whole vector will be collected. If it results in multi-valued non-logical vector, the non-NA results will be collected. 

Search all elements equal to Ken recursively.

list.search(friends, . == "Ken")

# $Ken.Name
# [1] "Ken"
# 
# $James.Friends
# [1]  TRUE FALSE
# 
# $Penny.Friends
# [1] FALSE FALSE

Note that . represents every atomic vector in the list and sublists. For single-valued vector, the search expression results in TRUE or FALSE indicating whether or not to return the text of the character vector. For multi-valued vector, the search expression instead results in mutli-valued logical vector which will be considered invalid as search results.

To find out all vectors that includes Ken, we can use %in%, which always returns TRUE or FALSE for this dataset.

list.search(friends, "Ken" %in% .)

# $Ken.Name
# [1] "Ken"
# 
# $James.Friends
# [1] "Ken"   "Penny"

If the search expression returns a non-logical vector with non-NA values, then these values are returned. For example, search all values of Ken.

list.search(friends, .[. == "Ken"])

# $Ken.Name
# [1] "Ken"
# 
# $James.Friends
# [1] "Ken"

The selector can be very flexible. We can use regular expression in the search expression. For example, search all values that matches the pattern en, that is, includes en in the text.

list.search(friends, .[grepl("en",.)])

# $Ken.Name
# [1] "Ken"
# 
# $James.Friends
# [1] "Ken"   "Penny"
# 
# $Penny.Name
# [1] "Penny"
# 
# $David.Friends
# [1] "Penny"

The above examples demonstrate how searching can be done recursively using list.search(). However, the function by defaults evaluate with all types of sub-elements. For example, if we look for character values of 24,

list.search(friends, . == "24")

# $Ken.Age
# [1] 24
# 
# $James.Friends
# [1] FALSE FALSE
# 
# $Penny.Age
# [1] 24
# 
# $Penny.Friends
# [1] FALSE FALSE

the integer value will be returned too. It is because when R evaluates the following expression

24 == "24"

# [1] TRUE

number 24 is coerced to string 24 which then are equal. This is also known as the result of comparison of atomic vectors. However, this behavior is not always desirable in practice. If we want to limit the search to the range of character vectors rather than any, we have to specify classes = argument for list.search().

list.search(friends, . == "24", classes = "character")

# $James.Friends
# [1] FALSE FALSE
# 
# $Penny.Friends
# [1] FALSE FALSE

This time no character value is found to equal 24. To improve the search performance and safety, it is always recommended to explicitly specify the classes to search so as to avoid undesired coercion which might lead to unexpected results.

In some cases, the search results are deeply nested. In this case, we need to unlist it so that the results are better viewed. In this case, we can set unlist = TRUE so that an atomic vector will be returned.

list.search(friends, .[grepl("en",.)], "character", unlist = TRUE)

#       Ken.Name James.Friends1 James.Friends2     Penny.Name  David.Friends 
#          "Ken"          "Ken"        "Penny"        "Penny"        "Penny"

Sometimes, we don't need that many results to be found. We can set n = to limit the number of results to show.

list.search(friends, .[grepl("en",.)], "character", n = 3, unlist = TRUE)

#       Ken.Name James.Friends1 James.Friends2     Penny.Name 
#          "Ken"          "Ken"        "Penny"        "Penny"

Like other rlist functions, the search expression can be a lambda expression. However, list.search() does not name meta-sybmol in search expression yet. In other words, you cannot use .name to represent the name of the element. You can use .i to represent the number of vectors that has been checked, and .n to represent the number of vectors that satisfy the condition.

                    

Comparers

list.filter() and list.search() are two major functions to find values that meet certain conditions. The condition is most likely to be a comparison, which can be done by exact comparing, atomic comparing, pattern matching by regular expression, string distance comparing, and so on.

In this page, we will introduce the usage of these comparers with filtering and searching functions and you will know more about how to perform logical and fuzzy data selection.

library(rlist)
library(pipeR)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")
friends <- list.load("http://renkun.me/rlist-tutorial/data/friends.json")

 - Precise comparers

Precise comparers include functions that compare the source value and target value precisely and see whether they are equal. The target value represents a certain value.

 - Exact comparer

Exact comparing can be done with identical() which is built-in function that tells if two objects are exactly the same in terms of type, value, and attributes.



NOTE: identical() is perhaps the strictest comparer that returns FALSE if any difference is spotted.



Two vectors that have equal values may not be identical: they may not have the same type or the same attributes. For example, two vectors having equal values

c(1,2,3) == 1:3

# [1] TRUE TRUE TRUE

may not be identical.

identical(c(1,2,3), 1:3)

# [1] FALSE

This happens because c(1,2,3) is a numeric vector while 1:3 produces an integer vector. == first coerce the integer vector to numeric vector and then compare the values but identical() will directly check if they are exactly the same.

In addition, the names of the vector make a difference too. Even if the values are exactly the same, a difference in names will also fails the check in identical().

c(a=1,b=2,c=3) == c(1,2,3)

#    a    b    c 
# TRUE TRUE TRUE

identical(c(a=1,b=2,c=3), c(1,2,3))

# [1] FALSE

This happens because the names is in fact one of the common attributes of a vector object, which is in the checklist of identical().

Having known the difference between exact comparing (identical()) and value comparing (==), we filter the people by whether their Name is exactly identical to Ken.

people %>>%
  list.filter(identical(Name, "Ken")) %>>%
  str

# List of 1
#  $ :List of 4
#   ..$ Name     : chr "Ken"
#   ..$ Age      : int 24
#   ..$ Interests: chr [1:3] "reading" "music" "movies"
#   ..$ Expertise:List of 3
#   .. ..$ R     : int 2
#   .. ..$ CSharp: int 4
#   .. ..$ Python: int 3

Only people whose Name is exactly the same with character vector "Ken" will be singled out.

We can also use it in searching. For example, search all vectors exactly identical to "Ken".

list.search(friends, identical(., "Ken"))

# $Ken.Name
# [1] "Ken"

Only values that are identical to character vector "Ken" will be put in the resulting list. We can also unlist the result.

list.search(friends, identical(., "Ken"), unlist = TRUE)

# Ken.Name 
#    "Ken"

Then, we search all values identical to c("Ken","Penny").

list.search(friends, identical(., c("Ken","Penny")))

# $James.Friends
# [1] "Ken"   "Penny"

Next, we search values exactly identical to numeric value 24.

list.search(friends, identical(., 24))

# named list()

The result is none. If you are familiar with how function identical() works as we described, you should not feel surprised. If you take a look at the data,

str(friends)

# List of 4
#  $ Ken  :List of 3
#   ..$ Name   : chr "Ken"
#   ..$ Age    : int 24
#   ..$ Friends: chr "James"
#  $ James:List of 3
#   ..$ Name   : chr "James"
#   ..$ Age    : int 25
#   ..$ Friends: chr [1:2] "Ken" "Penny"
#  $ Penny:List of 3
#   ..$ Name   : chr "Penny"
#   ..$ Age    : int 24
#   ..$ Friends: chr [1:2] "James" "David"
#  $ David:List of 3
#   ..$ Name   : chr "David"
#   ..$ Age    : int 25
#   ..$ Friends: chr "Penny"

you will find that the ages are all stored as integers rather than numerics. Therefore, searching exact integers will work.

list.search(friends, identical(., 24L))

# $Ken.Age
# [1] 24
# 
# $Penny.Age
# [1] 24
 - Value comparer

== compares two atomic vectors by value and returns a logical vector indicating whether each pair of value coerced to a common type equal to each other. This comparison mechanism allows for more flexibility and can be useful in a wide range of situations.

For example, we search all values at least include one of "Ken" and "Penny".

list.search(friends, any(c("Ken","Penny") %in% .), unlist = TRUE)

#       Ken.Name James.Friends1 James.Friends2     Penny.Name  David.Friends 
#          "Ken"          "Ken"        "Penny"        "Penny"        "Penny"

Similarly, we search all numeric and integer values equal to 24.

list.search(friends, . == 24, c("numeric","integer"), unlist = TRUE)

#   Ken.Age Penny.Age 
#        24        24

When the code above is being evaluated, all numeric vectors and integer vectors are evaluated by . == 24 recursively in friends where . represents the vector.

 - Fuzzy comparers

Fuzzy comparers can be useful in a wide range of situations. In many cases, the filtering of data, mainly string data, is not clear-cut, that is, we don't know exactly the value or range to select. We will cover two main types of fuzzy comparers.

 - Regular expression

One type of fuzzy filtering device is string pattern. It uses meta-symbols to represent a range of possible strings. Then all values that match this pattern can be selected.

For example, if we need to find all companies in a list with a domain name that ends up with .com or .org, we can use regular expression to tell whether a string matches a specific pattern. 

For people dataset, we can find out the names and ages of all those who have a name that includes "en" using grepl() which returns TRUE or FALSE indicating whether the string matches a given pattern.

people %>>%
  list.filter(grepl("en", Name)) %>>%
  list.select(Name, Age) %>>%
  list.stack

#    Name Age
# 1   Ken  24
# 2 Penny  24

Regular expression is flexible enough to represent a wide range of string patterns. There are plentiful websites introducing regular expressions:


RegexOne: An interative tutorial
RegExr: An online string pattern tester


If you get to know more about it, it would certainly be rewarding in string-related data manipulation.

 - String distance

The other type of fuzzy comparer is string distance measure. It is particularly useful if the quality of the data source is not high enough to only contain consistent texts. For example, if an object has a rich variants of names with very close spellings but slight differences or mis-spellings, a string-distance comparer can be useful.

stringdist is an R package that implements a rich collection of string distance measures. Basically, a string distance measure can tell you if two strings are close or not.

library(stringdist)
stringdist("a","b")

# [1] 1

The distance between "a" and "b" is 1 because, basically speaking, "a" can be transformed to "b" in no more than 1 elementary steps in terms of restricted Damerau-Levenshtein distance which is the default string distance meausres chosen by stringdist() function.

stringdist("helo","hello")

# [1] 1

The distance between "helo" and "hello" is also 1 because one only needs to add a letter to transform the first string to the second, or vice versa. The string distance measure largely tolerates minor mis-spellings or slight variants between strings.

If you prefer other distance measure, you can specify method= argument. All possible values are listed in the documentation of stringdist package.

stringdist("hao","hello",method = "dl")

# [1] 3

The string distance functions work with filtering functions in rlist. Consider the following data.

people1 <- list(
    p1 = list(name="Ken",age=24),
    p2 = list(name="Kent",age=26),
    p3 = list(name="Sam",age=24),
    p4 = list(name="Keynes",age=30),
    p5 = list(name="Kwen",age=31))

We can use stringdist() in stringdist with list.filter(). For example, find all list elements whose name is like "Ken" with maximum distance 1, and output their pasted names as a named character vector.

people1 %>>%
  list.filter(stringdist(name,"Ken") <= 1) %>>%
  list.mapv(name)

#     p1     p2     p5 
#  "Ken" "Kent" "Kwen"

Consider the following list.

people2 <- list(
    p1 = list(name=c("Ken", "Ren"),age=24),
    p2 = list(name=c("Kent", "Potter"),age=26),
    p3 = list(name=c("Sam", "Lee"),age=24),
    p4 = list(name=c("Keynes", "Bond"),age=30),
    p5 = list(name=c("Kwen", "Hu"),age=31))

If we want to find out names either is similar with "ken" with maximum distance 2, we can run

people2 %>>%
  list.search(any(stringdist(., "ken") <= 2), "character") %>>%
  str

# List of 4
#  $ p1.name: chr [1:2] "Ken" "Ren"
#  $ p2.name: chr [1:2] "Kent" "Potter"
#  $ p3.name: chr [1:2] "Sam" "Lee"
#  $ p5.name: chr [1:2] "Kwen" "Hu"

We can also search the terms in the character vectors like "Ken" with distance 1 and single out the values alike.

people2 %>>%
  list.search(.[stringdist(., "Ken") <= 1], "character") %>>%
  str

# List of 3
#  $ p1.name: chr [1:2] "Ken" "Ren"
#  $ p2.name: chr "Kent"
#  $ p5.name: chr "Kwen"

stringdist even provides a Soundex-based string distance measure. We can use use it to find texts that sounds alike. For example, we can find out all people whose first name or last name sounds like Li.

people2 %>>%
  list.filter(any(stringdist(name, "Li", method = "soundex") == 0)) %>>%
  list.mapv(name %>>% paste0(collapse = " "))

#        p3 
# "Sam Lee"

                    

Input/Output

rlist provides various mechanisms for list data input and output. 

 - list.parse

list.parse() is used to convert an object to list. For example, this function can convert data.frame, matrix to a list with identical structure.

library(rlist)
df1 <- data.frame(name=c("Ken","Ashley","James"),
  age=c(24,25,23), stringsAsFactors = FALSE)
str(list.parse(df1))

# List of 3
#  $ 1:List of 2
#   ..$ name: chr "Ken"
#   ..$ age : num 24
#  $ 2:List of 2
#   ..$ name: chr "Ashley"
#   ..$ age : num 25
#  $ 3:List of 2
#   ..$ name: chr "James"
#   ..$ age : num 23

This function also parses JSON or YAML format text.

jsontext <- '
[{ "name": "Ken", "age": 24 },
 { "name": "Ashley", "age": 25},
 { "name": "James", "age": 23 }]'
str(list.parse(jsontext, "json"))

# List of 3
#  $ :List of 2
#   ..$ name: chr "Ken"
#   ..$ age : int 24
#  $ :List of 2
#   ..$ name: chr "Ashley"
#   ..$ age : int 25
#  $ :List of 2
#   ..$ name: chr "James"
#   ..$ age : int 23

yamltext <- "
p1:
  name: Ken
  age: 24
p2:
  name: Ashley
  age: 25
p3:
  name: James
  age: 23
"
str(list.parse(yamltext, "yaml"))

# List of 3
#  $ p1:List of 2
#   ..$ name: chr "Ken"
#   ..$ age : int 24
#  $ p2:List of 2
#   ..$ name: chr "Ashley"
#   ..$ age : int 25
#  $ p3:List of 2
#   ..$ name: chr "James"
#   ..$ age : int 23
 - list.stack

list.stack() reverses list.parse() on a data frame, that is, it converts a list of homogeneous elements to a data frame with corresponding columns. In other words, the function stacks all list elements together, resulting in a data frame.

jsontext <- '
[{ "name": "Ken", "age": 24 },
 { "name": "Ashley", "age": 25},
 { "name": "James", "age": 23 }]'
data <- list.parse(jsontext, "json")
list.stack(data)

#     name age
# 1    Ken  24
# 2 Ashley  25
# 3  James  23

Note that data frame is much more efficient to store tabular data that has different columns, each of which is a vector storing values of the same type. In R, a data frame is in essence a list of vectors. However, the data rlist functions are designed to deal with can be non-tabular to allow more flexible and more loose data structure.

If we are sure about the data structure of the resulted list and want to convert it to a data frame with equivalent structure, list.stack() does the work.

 - list.load, list.save

list.load() loads data from a JSON, YAML, RData, or RDS file. Its default behavior is to first look at file extension and then determine which data loader is used. If the file extension does not match JSON or YAML, it will use RData loader.

list.save() saves a list to a JSON, YAML, RData, or RDS file. Its default behavior is similar with that of list.load().

If the data are read or written by these two functions in JSON or YAML format, the data will be text-based and thus friendly for human reader. However, if a list contains complex objects such as S4 objects and language objects, the text-based format may not be appropriate to store such objects. You should consider storing them in binary format, i.e. RData or RDS file.



NOTE: RData file is created by save() and can be loaded by load(). It usually stores an environment in which multiple objects are binded. RDS file is created by saveRDS() and can be loaded by readRDS(). It usually stores an R object directly.



list.load() in the latest version supports loading files specified by a character vector. It also supports loading files without file extensions by iteratively loading files by JSON and YAML loader.

 - list.serialize, list.unserialize

Serialization is the process that stores an object into fully-recoverable data format. list.serialize() and list.deserialize() provides the mechanism to capitalize the R native serializer/unserializer and JSON serializer/unserializer provided by jsonlite.

                    

Misc functions

rlist provides miscellaneous functions to assist data manipulation. These functions are mainly designed to alter the structure of an list object.

 - list.append, list.prepend

list.append() appends an element to a list and list.prepend() prepends an element to a list.

library(rlist)
list.append(list(a=1, b=1), c=1)

# $a
# [1] 1
# 
# $b
# [1] 1
# 
# $c
# [1] 1

list.prepend(list(b=1, c=2), a=0)

# $a
# [1] 0
# 
# $b
# [1] 1
# 
# $c
# [1] 2

The function also works with vector.

list.append(1:3, 4)

# [1] 1 2 3 4

list.prepend(1:3, 0)

# [1] 0 1 2 3

The names of the vector can be well handled.

list.append(c(a=1,b=2), c=3)

# a b c 
# 1 2 3

list.prepend(c(b=2,c=3), a=1)

# a b c 
# 1 2 3
 - list.reverse

list.reverse() simply reverses a list or vector.

list.reverse(1:10)

#  [1] 10  9  8  7  6  5  4  3  2  1
 - list.zip

list.zip() combines multiple lists element-wisely. In other words, the function takes the first element from all parameters, and then the second, and so on.

str(list.zip(a=c(1,2,3), b=c(4,5,6)))

# List of 3
#  $ :List of 2
#   ..$ a: num 1
#   ..$ b: num 4
#  $ :List of 2
#   ..$ a: num 2
#   ..$ b: num 5
#  $ :List of 2
#   ..$ a: num 3
#   ..$ b: num 6

The list elements need not be atomic vectors. They can be any lists.

str(list.zip(x=list(1,"x"), y=list("y",2)))

# List of 2
#  $ :List of 2
#   ..$ x: num 1
#   ..$ y: chr "y"
#  $ :List of 2
#   ..$ x: chr "x"
#   ..$ y: num 2

The parameters do not have to be the same type.

str(list.zip(x=c(1,2), y=list("x","y")))

# List of 2
#  $ :List of 2
#   ..$ x: num 1
#   ..$ y: chr "x"
#  $ :List of 2
#   ..$ x: num 2
#   ..$ y: chr "y"
 - list.rbind, list.cbind

list.rbind() binds atomic vectors by row and list.cbind() by column.

scores <- list(score1=c(10,9,10),score2=c(8,9,6),score3=c(9,8,10))
list.rbind(scores)

#        [,1] [,2] [,3]
# score1   10    9   10
# score2    8    9    6
# score3    9    8   10

list.cbind(scores)

#      score1 score2 score3
# [1,]     10      8      9
# [2,]      9      9      8
# [3,]     10      6     10

Note that the two functions finally call rbind() and cbind(), respectively, which result in matrix or data frame.

If a list of lists are supplied, then a matrix of list will be created.

scores2 <- list(score1=list(10,9,10),
  score2=list(8,9,6),type=list("a","b","a"))
rscores2 <- list.rbind(scores2)
rscores2

#        [,1] [,2] [,3]
# score1 10   9    10  
# score2 8    9    6   
# type   "a"  "b"  "a"

rscores2 is a matrix of lists rather than atomic values.

rscores2[1,1]

# $score1
# [1] 10

rscores2[,1]

# $score1
# [1] 10
# 
# $score2
# [1] 8
# 
# $type
# [1] "a"

This is not a common practice and may lead to unexpected mistakes if one is not fully aware of it and take for granted that the extracted value should be an atomic value like a number or string. Therefore, it is not recommended to either list.rbind() or list.cbind() a list of lists.

 - list.stack

To create a data.frame from a list of lists, use list.stack(). It is particularly useful when we want to transform a non-tabular data to a stage where it actually fits a tabular form.

For example, a list of lists with the same single-entry fields can be transformed to a equivalent data frame.

nontab <- list(list(type="A",score=10),list(type="B",score=9))
list.stack(nontab)

#   type score
# 1    A    10
# 2    B     9

For non-tabular data, we can select fields or columns in the data and stack the records together to create a data frame.

library(pipeR)
list.load("http://renkun.me/rlist-tutorial/data/sample.json") %>>%
  list.select(Name, Age) %>>%
  list.stack

#    Name Age
# 1   Ken  24
# 2 James  25
# 3 Penny  24
 - list.flatten

list is powerful in its recursive nature. Sometimes, however, we don't need its recursive feature but want to flatten it so that all its child elements are put to the first level. 

list.flatten() recursively extract all elements at all levels and put them to the first level.

data <- list(list(a=1,b=2),list(c=1,d=list(x=1,y=2)))
str(data)

# List of 2
#  $ :List of 2
#   ..$ a: num 1
#   ..$ b: num 2
#  $ :List of 2
#   ..$ c: num 1
#   ..$ d:List of 2
#   .. ..$ x: num 1
#   .. ..$ y: num 2

list.flatten(data)

# $a
# [1] 1
# 
# $b
# [1] 2
# 
# $c
# [1] 1
# 
# $d.x
# [1] 1
# 
# $d.y
# [1] 2
 - list.names

list.names() can be used to set names of list elements by expression.

people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json") %>>%
  list.select(Name, Age)
str(people)

# List of 3
#  $ :List of 2
#   ..$ Name: chr "Ken"
#   ..$ Age : int 24
#  $ :List of 2
#   ..$ Name: chr "James"
#   ..$ Age : int 25
#  $ :List of 2
#   ..$ Name: chr "Penny"
#   ..$ Age : int 24

Note that the elements in people currently do not have names. In some cases, it would be nice to assign appropriate names to those elements so that the distinctive information can be preserved in list transformations.

npeople <- people %>>% 
  list.names(Name)
str(npeople)

# List of 3
#  $ Ken  :List of 2
#   ..$ Name: chr "Ken"
#   ..$ Age : int 24
#  $ James:List of 2
#   ..$ Name: chr "James"
#   ..$ Age : int 25
#  $ Penny:List of 2
#   ..$ Name: chr "Penny"
#   ..$ Age : int 24

The names of the list elements can be preserved in various types of data manipulation. For example,

npeople %>>%
  list.mapv(Age)

#   Ken James Penny 
#    24    25    24

The names of the resulted vector exactly come from the names of the list elements.

 - list.sample

Sometimes it is useful to take a sample from a list. If it is a weighted sampling, the weights are in most cases related with individual subjects. list.sample() is a wrapper function of the built-in sample() but provides weight argument as an expression to evaluate for each list element to determine the weight of that element.

The following example shows a simple sampling from integers 1-10 by weight of squares.

set.seed(0)
list.sample(1:10, size = 3, weight = .^2)

# [1]  5 10  8

                    

Lambda expression

Although the fields of each list element are directly accessible in the expression, sometimes we still need to access the list element itself, usually for its meta-information. Lambda expressions provide a mechanism that allows you to use default or customized meta-symbols to access the meta-information of the list element.

In rlist package, all functions that work with expressions support implicit lambda expressions, that is, an ordinary expression with no special syntax yet the fields of elements are directly accessible. All functions working with expressions except list.select() also support explicit lambda expression including


Univariate lambda expression: In contrast to implicit lambda expression, the symbol that refers to the element is customized in the following formats:
x ~ expression
f(x) ~ expression


Multivariate lambda expression: In contrast to univariate lambda expression, the symbols of element, index, and member name are customized in the following formats:
f(x,i) ~ expression
f(x,i,name) ~ expression




library(rlist)

 - Implicit lambda expression

Implicit lambda expression is an ordinary expression with no special syntax like ~. In this case, meta symbols are implicitly defined in default, that is, . represents the element, .i represents the index, and .name represents the name of the element.

For example,

x <- list(a=list(x=1,y=2),b=list(x=2,y=3))
list.map(x,y)

# $a
# [1] 2
# 
# $b
# [1] 3

list.map(x,sum(as.numeric(.)))

# $a
# [1] 3
# 
# $b
# [1] 5

In the second mapping above, . represents each element. For the first member, the meta-symbols take the following values:

. = x[[1]] = list(x=1,y=2)
.i = 1
.name = "a"

 - Explicit lambda expression

To use other symbols to represent the metadata of a element, we can use explicit lambda expressions.

x <- list(a=list(x=1,y=2),b=list(x=2,y=3))
list.map(x, f(item,index) ~ unlist(item) * index)

# $a
# x y 
# 1 2 
# 
# $b
# x y 
# 4 6

list.map(x, f(item,index,name) ~ list(name=name,sum=sum(unlist(item))))

# $a
# $a$name
# [1] "a"
# 
# $a$sum
# [1] 3
# 
# 
# $b
# $b$name
# [1] "b"
# 
# $b$sum
# [1] 5

For unnamed vector members, it is almost necessary to use lambda expressions.

x <- list(a=c(1,2),b=c(3,4))
list.map(x,sum(.))

# $a
# [1] 3
# 
# $b
# [1] 7

list.map(x,item ~ sum(item))

# $a
# [1] 3
# 
# $b
# [1] 7

list.map(x,f(m,i) ~ m+i)

# $a
# [1] 2 3
# 
# $b
# [1] 5 6

For named vector members, their name can also be directly used in the expression.

x <- list(a=c(x=1,y=2),b=c(x=3,y=4))
list.map(x,sum(y))

# $a
# [1] 2
# 
# $b
# [1] 4

list.map(x,x*y)

# $a
# [1] 2
# 
# $b
# [1] 12

list.map(x,.i)

# $a
# [1] 1
# 
# $b
# [1] 2

list.map(x,x+.i)

# $a
# [1] 2
# 
# $b
# [1] 5

list.map(x,f(.,i) ~ . + i)

# $a
# x y 
# 2 3 
# 
# $b
# x y 
# 5 6

list.map(x,.name)

# $a
# [1] "a"
# 
# $b
# [1] "b"


NOTE: list.select does not support explicit lambda expressions.



                    

List environment

List environment is an alternative construct designed for easier command chaining. List() function wraps a list within an environment where almost all functions in this package are defined but result in the next List environment for further operations.

Suppose we work with the following list.

library(rlist)
library(pipeR)
people <- list.load("http://renkun.me/rlist-tutorial/data/sample.json")

To create a List environment, run

m <- List(people)

then we can operate the environment-based object m with map(), filter() and other functions, or extract the inner data with m$data. All inner functions return List environment, which facilities command chaining.

For example, map each member to their name.

m$map(Name)

# $data : list 
# ------
# [[1]]
# [1] "Ken"
# 
# [[2]]
# [1] "James"
# 
# [[3]]
# [1] "Penny"

Note that the resulted object is also a List environment although its printed results include the inner data. To use the result with external functions, we need to extract the inner data by calling m$data.

Get all the possible cases of interests for those whose R experience is longer than 1 year.

m$filter(Expertise$R > 1)$
  cases(Interests)$
  data

# [1] "movies"  "music"   "reading" "sports"

Calculate an integer vector of the average number of years using R for each interest class.

m$class(Interests)$
  map(case ~ length(case))$
  call(unlist)$
  data

#  movies   music reading  sports 
#       2       2       2       1

A more handy way to extract data from the List environment is to use [].

m$class(Interests)$
  map(case ~ length(case))$
  call(unlist) []

#  movies   music reading  sports 
#       2       2       2       1

                    

Examples

The power of rlist functionality does not lie in a single function but in the combination of functions in a chain, which makes non-tabular data manipulation much easier. 

This chapter contains a few examples to demonstrate comprehensive uses of rlist functionality. The examples use real-world non-tabular data sources from GitHub API and OpenWeatherMap.

                    

GitHub API

GitHub is the most famous web-based source code hosting service in the world. Millions of developers choose GitHub to host their public code repositories. Many R packages are hosted by GitHub too.

In addition to the rich features it provides in project development and collaboration, GitHub also opens its API for developers to query the meta-data of users and repos. For example, we can directly get the following data:


My public profile on GitHub
Profile of rlist package
All my repos on GitHub


If you visit the links in your web browser, you will see the data presented in JSON format.

To make the data exploration more interesting, in the following examples we will explore Hadley Wickham's GitHub data with functions provided in rlist and see how rlist makes it easier to work with such non-tabular data structures.

We load rlist and pipeR packages first and then retrieve the repos.

library(rlist)
library(pipeR)
repos <- "https://api.github.com/users/hadley/repos?per_page=100&page=%d" %>>%
  sprintf(1:2) %>>%
  list.load("json") %>>%
  list.ungroup

Since GitHub API limits the amount of data an ordinary user can retrieve at a time, we use page=%d to specify the page of data and we take the first several pages that are non-empty. Finally we turn the list of pages to a list of repos by list.ungroup().

Before walking into details, we review some figures and statistics first. First, the number of repos:

list.count(repos)

# [1] 150

Then the structure of repos in terms of forks and non-forks:

repos %>>%
  list.table(fork)

# fork
# FALSE  TRUE 
#   114    36

GitHub shows the language structure of each individual repo. Here we summarize the language structure of Hadley's all projects.

repos %>>% 
  list.filter(!is.null(language)) %>>%
  list.table(language) %>>%
  list.sort(-.)

# language
#          R        C++ JavaScript          C        TeX       Ruby 
#         95          9          8          6          5          3 
#      Shell       HTML     Python      Rebol        CSS     Turing 
#          3          2          2          2          1          1

or show the table of language by fork:

repos %>>%
  list.table(language, fork)

#             fork
# language     FALSE TRUE
#   C              3    3
#   C++            7    2
#   CSS            0    1
#   HTML           1    1
#   JavaScript     5    3
#   Python         0    2
#   R             79   16
#   Rebol          2    0
#   Ruby           2    1
#   Shell          1    2
#   TeX            5    0
#   Turing         0    1
#   <NA>           9    4

Hadley has created several top-ranked popular packages. Let's build a bar chart to show the top 10 R repos with most stargazers.

repos %>>%
  list.filter(!fork, language == "R") %>>%
  list.names(name) %>>%
  list.mapv(stargazers_count) %>>%
  list.sort(-.) %>>%
  list.take(10) %>>%
  print %>>%
  barplot(main = "Hadley's top 10 R repos with most stargazers")

#   ggplot2  devtools      plyr     rvest      httr  testthat     tidyr 
#      1223       976       368       308       285       204       167 
# lubridate   reshape     purrr 
#       148       108        94



The pipeline itself is clear enough to show what happens in each step. We first filter the repos and pick out the non-fork R repos. Then we give names to the repo elements by their name field. Next we map each element to the count of stargazers, sort them in descending order, and take the top 10 elements. Finally, we build a bar chart from the named integer vector we created.

Hadley is famous for his great contribution of ggplot2 so there should not be surprise as the bar chart shows that the package with most stargazers is ggplot2. 

Using exactly the same method, we can see the the repos with most open issues.

repos %>>%
  list.filter(has_issues, !fork, language == "R") %>>%
  list.names(name) %>>%
  list.mapv(open_issues) %>>%
  list.sort(-.) %>>%
  list.take(10) %>>%
  print %>>%
  barplot(main = "Hadley's top 10 R repos with most open issues")

#    ggplot2   devtools staticdocs  lubridate       plyr      tidyr 
#        113         62         41         37         34         32 
#     scales     gtable   testthat   roxygen3 
#         26         24         23         22



This time you should be able to figure out what is done in each step.

In addition to ggplot2, Hadley's has some other visualization-related repos too. To find out, we can filter the repo names and description by plot and vis with regular expression.

repos %>>%
  list.filter(any(grepl("plot|vis", c(name, description)))) %>>%
  list.sort(-stargazers_count) %>>%
  list.mapv(name)

#  [1] "ggplot2"         "bigvis"          "r2d3"           
#  [4] "ggplot2-book"    "gg2v"            "productplots"   
#  [7] "boxplots-paper"  "clusterfly"      "lvplot"         
# [10] "bigvis-infovis"  "densityvis"      "ggplot2-bayarea"
# [13] "layers"          "r-travis"        "toc-vis"        
# [16] "lvplot-paper"    "prodplotpaper"   "rblocks"        
# [19] "rminds"          "spatialVis"      "classifly"      
# [22] "fortify"         "ggplot"          "ggplot2-docs"   
# [25] "vis-migration"   "ggmap"           "imvisoned"      
# [28] "syuzhet"         "vega"

The quality of data filtering depends on your conditions. Not every repo shown above is related to data visualization. For example, r-travis has nothing to do with visualization although it contains vis. To do better data analysis, we would have to think hard about the data. rlist functions attempt to release the big burden from our shoulders so that we won't be easily stuck by such data processing problems.

To compute the sums of the stargazers, watchers and forks of all repos, we can first select the fiedls, stack them, and sum by column.

repos %>>%
  list.select(stargazers_count, watchers_count, forks_count) %>>%
  list.stack %>>%
  colSums

# stargazers_count   watchers_count      forks_count 
#             7402             7402             3375

We can also use fuzzy matching devices when we are not exactly sure about the term we need to find. For example, if you hear from a friend that Hadley's dplayer package is awesome but you cannot find the package by its name. To find out the exact name of the that package we can use soundex measurement in stringdist package.

repos %>>%
  list.filter(stringdist::stringdist("dplayer", name, method = "soundex") == 0) %>>%
  list.mapv(name)

# [1] "dplyr"         "dplyrimpaladb"

Cheers! Now we know the package that sounds like dplayer is actually named dplyr.

                    

Weather API

OpenWeatherMap provides a set of weather API that is simple, clear and free. Using the API, we get access to not only the current weather data, forecasts, historical data, and so on. The returned data is by default presented in JSON format, which can be easily loaded and processed by rlist functions.

 - Current weather data

The following code downloads the latest weather data of New York and London.

library(rlist)
library(pipeR)
weather <- "http://api.openweathermap.org/data/2.5/weather?q=%s" %>>%
  sprintf(c("New York,us", "London,uk")) %>>%
  list.load("json") %>>%
  list.names(name)

list.load() in the latest development version of rlist supports loading multiple files given by a character vector. Here we use sprintf() to construct a character vector provided the URL template of a weather data query.

str(weather)

# List of 2
#  $ New York:List of 12
#   ..$ coord  :List of 2
#   .. ..$ lon: num -75.5
#   .. ..$ lat: int 43
#   ..$ sys    :List of 6
#   .. ..$ type   : int 3
#   .. ..$ id     : int 54023
#   .. ..$ message: num 0.361
#   .. ..$ country: chr "US"
#   .. ..$ sunrise: int 1427626102
#   .. ..$ sunset : int 1427671495
#   ..$ weather:List of 1
#   .. ..$ :List of 4
#   .. .. ..$ id         : int 800
#   .. .. ..$ main       : chr "Clear"
#   .. .. ..$ description: chr "sky is clear"
#   .. .. ..$ icon       : chr "02n"
#   ..$ base   : chr "stations"
#   ..$ main   :List of 5
#   .. ..$ temp    : num 266
#   .. ..$ pressure: int 1022
#   .. ..$ temp_min: num 264
#   .. ..$ temp_max: num 266
#   .. ..$ humidity: int 46
#   ..$ wind   :List of 3
#   .. ..$ speed: num 3.08
#   .. ..$ gust : num 4.11
#   .. ..$ deg  : int 293
#   ..$ snow   :List of 1
#   .. ..$ 3h: int 0
#   ..$ clouds :List of 1
#   .. ..$ all: int 8
#   ..$ dt     : int 1427590591
#   ..$ id     : int 5128638
#   ..$ name   : chr "New York"
#   ..$ cod    : int 200
#  $ London  :List of 12
#   ..$ coord  :List of 2
#   .. ..$ lon: num -0.13
#   .. ..$ lat: num 51.5
#   ..$ sys    :List of 6
#   .. ..$ type   : int 3
#   .. ..$ id     : int 40047
#   .. ..$ message: num 0.556
#   .. ..$ country: chr "GB"
#   .. ..$ sunrise: int 1427607706
#   .. ..$ sunset : int 1427653719
#   ..$ weather:List of 1
#   .. ..$ :List of 4
#   .. .. ..$ id         : int 802
#   .. .. ..$ main       : chr "Clouds"
#   .. .. ..$ description: chr "scattered clouds"
#   .. .. ..$ icon       : chr "03n"
#   ..$ base   : chr "stations"
#   ..$ main   :List of 5
#   .. ..$ temp    : num 283
#   .. ..$ humidity: int 68
#   .. ..$ pressure: num 1006
#   .. ..$ temp_min: num 282
#   .. ..$ temp_max: num 284
#   ..$ wind   :List of 3
#   .. ..$ speed: num 3.7
#   .. ..$ gust : num 6.5
#   .. ..$ deg  : int 181
#   ..$ rain   :List of 1
#   .. ..$ 3h: int 0
#   ..$ clouds :List of 1
#   .. ..$ all: int 48
#   ..$ dt     : int 1427590573
#   ..$ id     : int 2643743
#   ..$ name   : chr "London"
#   ..$ cod    : int 200

We can see that weather includes the the information of the city as well as the weather.

The weather API also supports box searching, that is, search data from cities within the defined rectangle specified by the geographic coordinates. bbox indicates the bounding box of the following parameters: lat of the top left point, lon of the top left point, lat of the bottom right point, lon of the bottom right point, map zoom.

zone <- "http://api.openweathermap.org/data/2.5/box/city?bbox=%s&cluster=yes" %>>%
  sprintf("12,32,15,37,10") %>>%
  list.load("json")

# Error in open.connection(con, "rb"): HTTP error 510.

Once we get the data, we can see the names of the cities in the zone.

zone$list %>>% 
  list.mapv(name)

# Error in zone$list %>>% list.mapv(name): object 'zone' not found

We can also build a table that shows the weather condition of these cities.

zone$list %>>% 
  list.table(weather[[1L]]$main)

# Error in zone$list %>>% list.table(weather[[1L]]$main): object 'zone' not found

For more details, we can group the data by weather condition and see the name list for each type of weather.

zone$list %>>%
  list.group(weather[[1L]]$main) %>>%
  list.map(. %>>% list.mapv(name))

# Error in zone$list %>>% list.group(weather[[1L]]$main): object 'zone' not found

Sometimes it is easier to work with data frame for vectorization and model research. For example, we can build a data frame from the non-tabular data by stacking the list elements with selected fields.

zonedf <- zone$list %>>%
  list.select(id, name, 
    coord_lon = coord$lon, coord_lat = coord$lat, 
    temp = main$temp, weather = weather[[1L]]$main) %>>%
  list.stack %>>%
  print

# Error in zone$list %>>% list.select(id, name, coord_lon = coord$lon, coord_lat = coord$lat, : object 'zone' not found

The data frame well fits the input of most models.

zonedf %>>%
  lm(formula = temp ~ coord_lon + coord_lat) %>>%
  summary

# Error in zonedf %>>% lm(formula = temp ~ coord_lon + coord_lat): object 'zonedf' not found
 - Forecast data

The weather API provides give access to the forecast data. Here we get the forecast data of the London city.

forecast <- "http://api.openweathermap.org/data/2.5/forecast?q=London,uk" %>>%
  list.load("json")

The forecast incorporates some meta-information such as the city data and message retrieval data. We can easily transform the forecast points to an xts object as a time series.

fxts <- forecast$list %>>%
  list.select(dt = as.POSIXct(dt_txt), 
    temp = main$temp, humidity = main$humidity) %>>%
  list.stack %>>%
  (xts::xts(x = .[-1L], order.by = .$dt))
head(fxts)

#                        temp humidity
# 2015-03-29 00:00:00 283.100       71
# 2015-03-29 03:00:00 283.590       76
# 2015-03-29 06:00:00 282.810       73
# 2015-03-29 09:00:00 282.280       86
# 2015-03-29 12:00:00 283.600       96
# 2015-03-29 15:00:00 283.708       91

As long as the data we are interested in is converted to a time series, we can easily create graphics from it.

par(mfrow=c(2,1))
plot(fxts$temp, main = "Forecast temperature of London")
plot(fxts$humidity, main = "Forecast humidity of London")



 - Historical data

The weather API allows us to access the historical weather database. The database adopts UNIX Date/Time standard for which we define unixdt() to better transform human-readable date/time to numbers included in the data query.

unixdt <- function(date) {
  as.integer(as.POSIXct(date, tz = "UTC"))
}

The following code queries the hourly historical data of New York from 2014-10-01 00:00:00 and get the maximal number of records a free account is allowed.

history <- "http://api.openweathermap.org/data/2.5/history/city?&q=%s&start=%d&cnt=200" %>>%
  sprintf("New York,us", unixdt("2014-10-01 00:00:00")) %>>%
  list.load("json")

Once the historical data is ready, we can get some simple impression on it. For example, we can see the weather distribution.

history$list %>>%
  list.table(weather = weather[[1L]]$main) %>>%
  list.sort(-.)

# integer(0)

We can also inspect the location statistics of humidity data for each weather condition.

history$list %>>%
  list.group(weather[[1L]]$main) %>>%
  list.map(. %>>% 
      list.mapv(main$humidity) %>>% 
      summary)

# Warning in is.na(x): is.na() applied to non-(list or vector) of type
# 'NULL'

# list()

Or we can create an xts object from it.

nyxts <- history$list %>>%
  list.select(dt = as.POSIXct(dt, origin = "1970-01-01"), 
    temp = main$temp, humidity = main$humidity) %>>%
  list.stack %>>%
  (xts::xts(x = .[-1L], order.by = .$dt))

# Error in xts::xts(x = .[-1L], order.by = .$dt): order.by requires an appropriate time-based object

head(nyxts)

# Error in head(nyxts): object 'nyxts' not found

The object facilitates time series operations but also can be used in time series model fitting.

forecast::auto.arima(nyxts$temp)

# Error in as.ts(x): object 'nyxts' not found
Name	Age	Interests	Expertise
Ken	24	reading, music, movies	R:2, C#:4, Python:3
James	25	sports, music	R:3, Java:2, C++:5
Penny	24	movies, reading	R:1, C++:4, Python:2