The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames()
. It allows for code such as the following:
library("seplyr") names <- c('a', 'b') names := c('x', 'y') #> a b #> "x" "y"
This can be very useful when programming in R
, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike c(a = 'x', b = 'y')
, where all left-hand-sides are concrete values even if not quoted).
A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R
‘s “c()
” combine/concatenate function. That is: c('a' := 'x', 'b' := 'y')
is the same as c('a', 'b') := c('x', 'y')
. Roughly this means the two operations play well with each other.
The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “:=
” notation for this (and, as is common in R
, pick the correct behavior based on runtime types).
The function construction is written as: “variables := { code }
” (the braces are required) and the semantics are roughly the same as “function(variables) { code }
“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the seplyr
notation is not as concise as is possible).
This notation allows us to write the squares of 1
through 4
as:
sapply(1:4, x:={x^2})
instead of writing:
sapply(1:4, function(x) x^2)
It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambda-calculus notation such as “(λx.x^2)
“. In the development version of seplyr
we are experimenting with the following additional notations:
sapply(1:4, lambda(x)(x^2))
sapply(1:4, λ(x, x^2))
(Both of these currenlty work in the development version, though we are not sure about submitting source files with non-ASCII characters to CRAN.)
This posting includes an audio/video/photo media file: Download Now
How to Create an Online Choice Simulator
A choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator.
Choice simulators have many names: decision support systems, market simulators, preference simulators, desktop simulators, conjoint simulators, and choice model simulators.
In this post, I show how to create an online choice simulator, with the calculations done using R, and the simulator is hosted in Displayr.
First of all, choice simulators are based on models. So, the first step in building a choice simulator is to obtain the model results that are to be used in the simulator. For example, here I use respondent-level parameters from a latent class model, but there are many other types of data that could have been used (e.g., parameters from a GLM, draws from the posterior distribution, beta draws from a maximum simulated likelihood model).
If practical, it is usually a good idea to have model results at the case level (e.g., respondent level), as the resulting simulator can then be easily automatically weighted and/or filtered. If you have case level data, the model results should be imported into Displayr as a Data Set. See Introduction to Displayr 2: Getting your data into Displayr for an overview of ways of getting data into Displayr.
The table below shows estimated parameters of respondents from a discrete choice experiment of the market for eggs. You can work your way through the choice simulator example used in this post here (the link will first take you to a login page in Displayr and then to a document that contains the data in the variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM).
Variable sets are a novel and very useful aspect of Displayr. Variable sets are related variables that are grouped. We can simplify the calculations of a choice simulator by using the variable sets, with one variable set for each attribute.
In this step, we group the variables for each attribute into separate variable sets, so that they appear as shown on the right. This is done as follows:
In my choice simulator, I have separate columns of controls (i.e., combo boxes) for each of the brands. The fast way to do this is to first create them for the first alternative (column), and then copy and paste them:
See also Adding a Combo Box to a Displayr Dashboard for an intro to creating combo boxes.
The code below can easily be modified for other models. A few key aspects of the code:
# Computing the utility for each alternative u1 = Weight[, Weight.1] + Organic[, Organic.1] + Charity[, Charity.1] + Quality[, Quality.1] + Uniformity[, Uniformity.1] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.1)) u2 = Weight[, Weight.2] + Organic[, Organic.2] + Charity[, Charity.2] + Quality[, Quality.2] + Uniformity[, Uniformity.2] + Feed[, Feed.2] + Price*as.numeric(gsub("\\$", "", Price.2)) u3 = Weight[, Weight.3] + Organic[, Organic.3] + Charity[, Charity.3] + Quality[, Quality.3] + Uniformity[, Uniformity.3] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.3)) u4 = Weight[, Weight.4] + Organic[, Organic.4] + Charity[, Charity.4] + Quality[, Quality.4] + Uniformity[, Uniformity.4] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.4)) # Computing preference shares utilities = as.matrix(cbind(u1, u2, u3, u4)) eutilities = exp(utilities) shares = prop.table(eutilities, 1) # Filtering the shares, if a filter is applied. shares = shares[QFilter, ] # Filtering the weight variable, if required. weight = if (is.null(QPopulationWeight)) rep(1, length(u1)) else QPopulationWeight weight = weight[QFilter] # Computing shares for the total sample shares = sweep(shares, 1, weight, "*") shares = as.matrix(apply(shares, 2, sum)) shares = 100 * prop.table(shares, 2)[1]
If you wish, you can make your choice simulator prettier. The R Outputs and the controls all have formatting options. In my example, I got our designer, Nat, to create the pretty background screen, which she did in Photoshop, and then added using Insert > Image.
If you have stored the data as variable sets, you can quickly create filters. Note that the calculations will automatically update when the viewer selects the filters.
To share the dashboard, go to the Export tab in the ribbon (at the top of the screen), and click on the black triangle under the Web Page button. Next, check the option for Hide page navigation on exported page and then click Export… and follow the prompts.
Note, the URL for the choice simulator I am using in this example is https://app.displayr.com/
You can see the choice simulator in View Mode here (as an end-user will see it), or you can create your own choice simulator here (first log into Displayr and then edit or modify a copy of the document used to create this post).
This posting includes an audio/video/photo media file: Download Now
Understanding gender roles in movies with text mining
I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film.
The R code behind this analysis in publicly available on GitHub.
I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time!
This posting includes an audio/video/photo media file: Download Now
Tidyer BLS data with the blscarpeR package
The recent release of the blscrapeR package brings the “tidyverse” into the fold. Inspired by my recent collaboration with Kyle Walker on his excellent tidycensus package, blscrapeR has been optimized for use within the tidyverse as of the current version 3.0.0.
All data now returned as tibbles.
dplyr and purrr are now imported packages, along with magrittr and ggplot, which were imported from the start.
No need to call any packages other than tidyverse and blscrapeR.
Switched from base R to dplyr in instances where performance could be increased.
Standard apply functions replaced with purrr map()
functions where performance could be increased.
install.packages("blscrapeR")
The American Time Use Survey is one of the BLS’ more interesting data sets. Below is an API query that compares the time Americans spend watching TV on a daily basis compared to the time spent socializing and communicating.
It should be noted, some familiarity with BLS series id numbers is required here. The BLS Data Finder is a nice tool to find series id numbers.
library(blscrapeR)
library(tidyverse)
tbl <- bls_api(c("TUU10101AA01014236", "TUU10101AA01013951")) %>%
spread(seriesID, value) %>%
dateCast() %>%
rename(watching_tv = TUU10101AA01014236, socializing_communicating = TUU10101AA01013951)
tbl
## # A tibble: 3 x 7
## year period periodName footnotes socializing_communicating watching_tv date
## *
## 1 2014 0.71 2.82 2014-01-01
## 2 2015 0.68 2.78 2015-01-01
## 3 2016 0.65 2.73 2016-01-01
The main attraction of the BLS are the monthly employment and unemployment data. Below is an API query and plot of three of the major BLS unemployment rates.
U-3: The “official unemployment rate.” Total unemployed, as a percent of the civilian labor force.
U-5: Total unemployed, plus discouraged workers, plus all other marginally attached workers, as a percent of the civilian labor force plus all marginally attached workers.
U-6: Total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers.
library(blscrapeR)
library(tidyverse)
tbl <- bls_api(c("LNS14000000", "LNS13327708", "LNS13327709")) %>%
spread(seriesID, value) %>%
dateCast() %>%
rename(u3_unemployment = LNS14000000, u5_unemployment = LNS13327708, u6_unemployment = LNS13327709)
ggplot(data = tbl, aes(x = date)) +
geom_line(aes(y = u3_unemployment, color = "U-3 Unemployment")) +
geom_line(aes(y = u5_unemployment, color = "U-5 Unemployment")) +
geom_line(aes(y = u6_unemployment, color = "U-6 Unemployment")) +
labs(title = "Monthly Unemployment Rates") + ylab("value") +
theme(legend.position="top", plot.title = element_text(hjust = 0.5))
For more information and examples, please see the package vignettes.
This posting includes an audio/video/photo media file: Download Now
Learning things we already know about stocks
This example groups stocks together in a network that highlights associations within and between the groups using only historical price data. The result is far from ground-breaking; you can already guess the output. For the most part, the stocks get grouped together into pretty obvious business sectors.
Despite the obvious result, the process of teasing out latent groupings from historic price data is interesting. That’s the focus of this example. A central idea of the approach taken here comes from the great paper of Ledoit and Wolf, “Honey, I Shrunk the Sample Covariance Matrix” (http://www.ledoit.net/honey.
This note follows an informal, how-to format. Rather than focus on mathematical analysis, which is well-detailed in the references, I try to spell out the hows and whys: how to do things step by step (using R), and a somewhat non-rigorous rationale for each step that’s hopefully at least convincing and intuitive.
For emphasis, allow me to restate the first sentence as an objective:
That’s what the rest of this example will do, hopefully illuminating some key ideas about regularization along the way.
The example uses R, of course, and the following R packages, all available on CRAN (some of the packages themselves have dependencies):
quantmod
(at least version 0.4-10)igraph
(at least version 1.1.2)threejs
(at least version 0.3.1)NOTE: You can skip ahead to the Sample correlation section by simply downloading a sample copy of processed log(return) data as follows:
library(quantmod)
load(url("http://illposed.net/logreturns.rdata"))
Otherwise, follow the next two sections to download the raw stock daily price data and process those data into log(returns).
The quantmod
package (Ulrich and Ryan, http://www.quantmod.com/) makes it ridiculously easy to download (and visualize) financial time series data. The following code uses quantmod
to download daily stock price data for about 100 companies with the largest market capitalizations listed on the Standard & Poor’s 500 index at the time of this writing. The code downloads daily closing prices from 2012 until the present. Modify the code to experiment with different time periods or stocks as desired!
Because stock symbol names may change and companies my come and go, it’s possible that some of the data for some time periods are not available. The tryCatch()
block in the code checks for a download error and flags problems by returning NA
, later removed from the result. The upshot is that the output number of columns of stock price time series may be smaller than the input list of stock symbols.
The output of the following code is an xts time series matrix of stock prices called prices
, whose rows correspond to days and columns to stock symbols.
library(quantmod)
from="2012-05-17"
sym = c("AAPL", "ABBV", "ABT", "ACN", "AGN", "AIG", "ALL", "AMGN", "AMZN", "AXP",
"BA", "BAC", "BIIB", "BK", "BLK", "BMY", "BRK.B", "C", "CAT", "CELG", "CL",
"CMCSA", "COF", "COP", "COST", "CSCO", "CVS", "CVX", "DD", "DHR", "DIS", "DOW",
"DUK", "EMR", "EXC", "F", "FB", "FDX", "FOX", "FOXA", "GD", "GE", "GILD", "GM",
"GOOG", "GOOGL", "GS", "HAL", "HD", "HON", "IBM", "INTC", "JNJ", "JPM", "KHC",
"KMI", "KO", "LLY", "LMT", "LOW", "MA", "MCD", "MDLZ", "MDT", "MET", "MMM",
"MO", "MON", "MRK", "MS", "MSFT", "NEE", "NKE", "ORCL", "OXY", "PCLN", "PEP",
"PFE", "PG", "PM", "PYPL", "QCOM", "RTN", "SBUX", "SLB", "SO", "SPG", "T",
"TGT", "TWX", "TXN", "UNH", "UNP", "UPS", "USB", "UTX", "V", "VZ", "WBA",
"WFC", "WMT", "XOM")
prices = Map(function(n)
{
print(n)
tryCatch(getSymbols(n, src="google", env=NULL, from=from)[, 4], error = function(e) NA)
}, sym)
N = length(prices)
# identify symbols returning valid data
i = ! unlist(Map(function(i) is.na(prices[i]), seq(N)))
# combine returned prices list into a matrix, one column for each symbol with valid data
prices = Reduce(cbind, prices[i])
colnames(prices) = sym[i]
Not every stock symbol may have prices available for every day. Trading can be suspended for some reason, companies get acquired or go private, new companies form, etc.
Let’s fill in missing values going forward in time using the last reported price (piecewise constant interpolation) – a reasonable approach for stock price time series. After that, if there are still missing values, just remove those symbols that contain them, possibly further reducing the universe of stock symbols we’re working with.
for(j in 1:ncol(prices)) prices[, j] = na.locf(prices[, j]) # fill in
prices = prices[, apply(prices, 2, function(x) ! any(is.na(x)))] # omit stocks with missing data
Now that we have a universe of stocks with valid price data, convert those prices to log(returns) for the remaining analysis (by returns I mean simply the ratio of prices relative to the first price).
Why log(returns) instead of prices?
The log(returns) are closer to normally distributed than prices especially in the long run. Pat Burns wrote a note about this (with a Tom Waits soundtrack): http://www.portfolioprobe.com/
But why care about getting data closer to normally distributed?
That turns out to be important to us because later we’ll use a technique called partial correlation. That technique generally works better for normally distributed data than otherwise, see for example a nice technical discussion about this by Baba, Shibata, and Sibuya here: https://doi.org/10.1111%2Fj.
The following simple code converts our prices
matrix into a matrix of log(returns):
log_returns = apply(prices, 2, function(x) diff(log(x)))
It’s easy to convert the downloaded log(returns) data into a Pearson’s sample correlation matrix X
:
X = cor(log_returns)
The (i, j)th entry of the sample correlation matrix X
above is a measurement of the degree of linear dependence between the log(return) series for the stocks in columns i and j.
There exist at least two issues that can lead to serious problems with the interpretation of the sample correlation values:
A Nobel-prize winning approach to dealing with the second problem considers cointegration between series instead of correlation; see for example notes by Eric Zivot (https://faculty.washington.
Cointegration is a wonderful but fairly technical topic. Instead, let’s try a simpler approach.
We can try to address issue 2 above by controlling for confounding variables, at least partially. One approach considers partial correlation instead of correlation (see for example the nice description in Wikipedia https://en.wikipedia.org/wiki/
It’s worth stating that our simple approach basically treats the log(returns) series as a bunch of vectors and not so much bona fide time series, and can’t handle as many pathologies that might occur as well as cointegration can. But as we will see, this simple technique is still pretty effective at finding structure in our data. (And, indeed, related methods as discussed by Ledoit and Wolf and elsewhere are widely used in portfolio and risk analyses in practice.)
The partial correlation coefficients between all stock log(returns) series are the entries of the inverse of the sample correlation matrix (https://www.statlect.com/
Market trading of our universe of companies, with myriad known and unknown associations between them and the larger economy, produced the stock prices we downloaded. Our objective is a kind of inverse problem: given a bunch of historical stock prices, produce a network of associations.
You may recall from some long ago class that, numerically speaking, inverting matrices is generally a bad idea. Even worse, issue 1 above says that our estimated correlation coefficients contain error (noise). Even a tiny amount noise can be hugely amplified if we invert the matrix. That’s because, as we will soon see, the sample correlation matrix contains tiny eigenvalues, and matrix inversion effectively divides the noise by those tiny values. Simply stated, dividing by a tiny number returns a big number; that is, matrix inversion tends to blow the noise up. This is a fundamental issue (in a sense, the fundamental issue) common to many inverse problems.
Ledoit and Wolf’s sensible answer to reducing the influence of noise is regularization. Regularization replaces models with different, but related, models designed to reduce the influence of noise on their output. LW use a form of regularization related to ridge regression (a.k.a., Tikhonov regularization) with a peculiar regularization operator based on a highly structured estimate of the covariance. We will use a simpler kind of regularization based on an eigenvalue decomposition of the sample correlation matrix X
.
Here is an eigenvalue decomposition of the sample correlation matrix:
L = eigen(X, symmetric=TRUE)
Note that R’s eigen()
function takes care to return the (real-valued) eigenvalues of a symmetric matrix in decreasing order for us. (Technically, the correlation matrix is symmetric positive semi-definite, and will have only non-negative real eigenvalues.)
Each eigenvector represents an orthogonal projection of the sample correlation matrix into a line (a 1-d shadow of the data). The first two eigenvectors define a projection of the sample correlation matrix into a plane (2-d), and so on. The eigenvalues estimate the proportion of information (or variability, if you prefer) from the original sample correlation matrix contained in each eigenvector. Because the eigenvectors are orthogonal, these measurements of projected information are additive.
Here is a plot of all the sample correlation matrix eigenvalues (along with a vertical line that will be explained in a moment):
plot(L$values, ylab="eigenvalues")
abline(v=10)
The eigenvalues fall off rather quickly in our example! That means that a lot of the information in the sample correlation matrix is contained in the first few eigenvectors.
Let’s assume, perhaps unreasonably, that the errors in our estimate of the correlation matrix are equally likely to occur in any direction (that the errors are white noise, basically). As we can see above, most of the information is concentrated in the subspace corresponding to the first few eigenvectors. But white noise will have information content in all the dimensions more or less equally.
One regularization technique replaces the sample correlation matrix with an approximation defined by only its first few eigenvectors. Because they represent a large amount of the information content, the approximation can be pretty good. More importantly, because we assumed noise to be more or less equally represented across the eigenvector directions and we’re cutting most of those off, this approximation tends to damp the noise more than the underlying information. Most importantly, we’re cutting off the subspace associated with tiny eigenvalues, avoiding the problem of division by tiny values and significantly reducing amplified noise in the inverse of the sample correlation matrix (the precision matrix).
The upshot is, we regularize the sample correlation matrix by approximating it by a low-rank matrix that substantially reduces the influence of noise on the precision matrix. See Per Christian Hansen’s classic paperback, “Rank-Deficient and Discrete Ill-Posed Problems” (http://epubs.siam.org/doi/
There is substantial mathematical literature for just this topic (regularization parameter choice selection), complete with deep theory as well as lots of heuristics. Let’s keep things simple for this example and form our approximation by cutting off eigenvectors beyond where the eigenvalue plot starts to flatten out – close to the vertical line in the above plot.
Alternatively, consider the lovely short 2004 paper by Chris Ding and Xiaofeng He (http://dl.acm.org/citation.
Finally, we form the precision matrix P
from the regularized sample correlation matrix. The inversion is less numerically problematic now because of regularization. Feel free to experiment with the projected rank N
below!
N = 10 # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))
I’m not qualified to write about them, but you should be aware that Bayesian approaches to solving problems like this are also effectively (and effective!) regularization methods. I hope to someday better understand the connections between classical inverse problem solution methods that I know a little bit about, and Bayesian methods that I know substantially less about.
There is a carefully written R package to construct regularized correlation and precision matrices: the corpcor
package (https://cran.r-project.org/corpcor
package, like Ledoit Wolf, includes ways to use sophisticated regularization operators, and can apply more broadly than the simple approach taken in this post.
You can use the corpcor
package to form a Ledoit-Wolf-like regularized precision matrix P
, and you should try it! The result is pretty similar to what we get from our simple truncated eigenvalue decomposition regularization in this example.
The (i, j)th entry of the precision matrix P
is a measure of association between the log(return) time series for the stocks in columns i and j, with larger values corresponding to more association.
An interesting way to group related stocks together is to think of the precision matrix as an adjacency matrix defining a weighted, undirected network of stock associations. Thresholding entries of the precision matrix to include, say, only the top 10% results in a network of only the most strongly associated stocks.
Thinking in terms of networks opens up a huge and useful toolbox: graph theory. We gain access to all kinds of nifty ways to analyze and visualize data, including methods for clustering and community detection.
R’s comprehensive igraph
package by Gábor Csárdi (https://cran.r-project.org/igraph
’s cluster_louvain()
function to segment the thresholded precision matrix of stocks into groups. The code produces an igraph
graph object g
, with vertices colored by group membership.
suppressMessages(library(igraph))
threshold = 0.90
Q = P * (P > quantile(P, probs=threshold)) # thresholded precision matrix
g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph
# The rest of the code lumps any singletons lacking edges into a single 'unassociated' group shown in gray
# (also assigning distinct colors to the other groups).
x = groups(cluster_louvain(g))
i = unlist(lapply(x, length))
d = order(i, decreasing=TRUE)
x = x[d]
i = i[d]
j = i > 1
s = sum(j)
names(x)[j] = seq(1, s)
names(x)[! j] = s + 1
grp = as.integer(rep(names(x), i))
clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]
g = set_vertex_attr(g, "color", value=clrs)
Use the latest threejs
package to make a nice interactive visualization of the network (you can use your mouse/trackpad to rotate, zoom and pan the visualization).
library(threejs)
graphjs(g, vertex.size=0.2, vertex.shape=colnames(X), edge.alpha=0.5)
The stock groups identified by this method are uncanny, but hardly all that surprising really. Look closely and you will see clusters made up of bank-like companies (AIG, BAC, BK, C, COF, GS, JPM, MET, MS, USB, WFC), pharmaceutical companies (ABT, AMGN, BIIB, BMY, CELG, GILD, JNJ, LLY, MRK, PFE), computer/technology-driven companies (AAPL, ACN, CSCO, IBM, INTC, MSFT, ORCL, QCOM, T, TXN, VZ – except oddly, the inclusion of CAT in this list), and so on. With the threshold value of 0.9 above, a few stocks aren’t connected to any others; they appear in gray.
The groups more or less correspond to what we already know!
The group that includes FB, GOOG, and AMZN (Facebook, Alphabet/Google, and Amazon) is interesting and a bit mysterious. It includes credit card companies V (Visa), MA (Mastercard) and American Express (AXP). Perhaps the returns of FB, GOOG and AMZN are more closely connected to consumer spending than technology! But oddly, this group also includes a few energy companies (DUK, EXC, NEE), and I’m not sure what to make of that…
This way of looking at things also nicely highlights connections between groups. For instance, we see that a group containing consumer products companies (PEP, KO, PG, CL, etc.) is connected to both the Pharma group, and the credit card company group. And see the appendix below for a visualization that explores different precision matrix threshold values, including lower values with far greater network connectivity.
We downloaded daily closing stock prices for 100 stocks from the S&P 500, and, using basic tools of statistics and analysis like correlation and regularization, we grouped the stocks together in a network that highlights associations within and between the groups. The structure teased out of the stock price data is reasonably intuitive.
threejs
tricksThe following self-contained example shows how the network changes with threshold value. It performs the same steps as we did above, but uses some tricks in threejs
and an experimental extension to the crosstalk
package and a few additional R packages to present an interactive animation. Enjoy!
suppressMessages({
library(quantmod)
library(igraph)
library(threejs)
library(crosstalk)
library(htmltools)
# using an experimental extension to crosstalk:
library(crosstool) # devtools::install_github('bwlewis/crosstool')
})
# Download the processed log(returns) data:
suppressMessages(load(url("http://illposed.net/logreturns.rdata")))
X = cor(log_returns)
L = eigen(X, symmetric=TRUE)
N = 10 # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))
colnames(P) = colnames(X)
# A function that creates a network for a given threshold and precision matrix
f = function(threshold, P)
{
Q = P * (P > quantile(P, probs=threshold)) # thresholded precision matrix
g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph
x = groups(cluster_louvain(g))
i = unlist(lapply(x, length))
d = order(i, decreasing=TRUE)
x = x[d]
i = i[d]
j = i > 1
s = sum(j)
names(x)[j] = seq(1, s)
names(x)[! j] = s + 1
grp = as.integer(rep(names(x), i))
clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]
g = set_vertex_attr(g, "color", value=clrs)
set_vertex_attr(g, "shape", value=colnames(P))
}
threshold = c(0.99, 0.95, 0.90, 0.85, 0.8)
g = Map(f, threshold, MoreArgs=list(P=P)) # list of graphs, one for each threshold
# Compute force-directed network layouts for each threshold value
# A bit expensive to compute, so run in parallel!
library(parallel)
l = mcMap(function(x) layout_with_fr(x, dim=3, niter=150), g, mc.cores=detectCores())
sdf = SharedData$new(data.frame(key=paste(seq(0, length(threshold) - 1))), key=~key)
slider = crosstool(sdf, "transmitter",
sprintf("",
length(threshold) - 1), width="100%", height=20, channel="filter")
vis = graphjs(g, l, vertex.size=0.2, main=as.list(threshold), defer=TRUE, edge.alpha=0.5, deferfps=30,
crosstalk=sdf, width="100%", height=900)
browsable(div(list(HTML(""), tags$h3("Precision matrix quantile threshold (adjust slider to change)"), slider, vis)))
This posting includes an audio/video/photo media file: Download Now
Using regression trees for forecasting double-seasonal time series with trend in R
After blogging break caused by writing research papers, I managed to secure time to write something new about time series forecasting. This time I want to share with you my experiences with seasonal-trend time series forecasting using simple regression trees. Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because of these factors:
But! There is always some “but”, they poorly adapt when new unexpected situations (values) appears. In other words, they can not detect and adapt to change or concept drift well (absolutely not). This is due to the fact that tree creates during learning just simple rules based on training data. Simple decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. You would ask now, so why we are talking about time series forecasting with regression tree together, right? I will explain how to deal with it in more detail further in this post.
You will learn in this post how to:
As in previous posts, I will use smart meter data of electricity consumption for demonstrating forecasting of seasonal time series. I created a new dataset of aggregated electricity load of consumers from an anonymous area. Time series data have the length of 17 weeks.
Firstly, let’s scan all of the needed packages for data analysis, modeling and visualizing.
library(feather) # data import
library(data.table) # data handle
library(rpart) # decision tree method
library(rpart.plot) # tree plot
library(party) # decision tree method
library(forecast) # forecasting methods
library(ggplot2) # visualizations
library(ggforce) # visualization tools
library(plotly) # interactive visualizations
library(grid) # visualizations
library(animation) # gif
Now read the mentioned time series data by read_feather
to one data.table
.
DT <- as.data.table(read_feather("DT_load_17weeks" ))
And store information of the date and period of time series that is 48.
n_date <- unique(DT[, date])
period <- 48
For data visualization needs, store my favorite ggplot theme settings by function theme
.
theme_ts <- theme(panel.border = element_rect(fill = NA,
colour = "grey10"),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey85"),
panel.grid.major = element_line(colour = "grey85"),
panel.grid.major.x = element_line(colour = "grey85"),
axis.text = element_text(size = 13, face = "bold"),
axis.title = element_text(size = 15, face = "bold"),
plot.title = element_text(size = 16, face = "bold"),
strip.text = element_text(size = 16, face = "bold"),
strip.background = element_rect(colour = "black"),
legend.text = element_text(size = 15),
legend.title = element_text(size = 16, face = "bold"),
legend.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white"))
Now, pick some dates of the length 3 weeks from dataset to split data on the train and test part. Test set has the length of only one day because we will perform one day ahead forecast of electricity consumption.
data_train <- DT[date %in% n_date[43:63]]
data_test <- DT[date %in% n_date[64]]
Let’s plot the train set and corresponding average weekly values of electricity load.
averages <- data.table(value = rep(sapply(0:2, function(i)
mean(data_train[((i*period*7)+1 ):((i+1)*period*7), value])),
each = period * 7),
date_time = data_train$date_time)
ggplot(data_train, aes(date_time, value)) +
geom_line() +
geom_line(data = averages, aes(date_time, value),
linetype = 5, alpha = 0.75, size = 1.2, color = "firebrick2") +
labs(x = "Date", y = "Load (kW)") +
theme_ts
We can see some trend increasing over time, maybe air conditioning is more used when gets hotter in summer. The double-seasonal (daily and weekly) character of time series is obvious.
A very useful method for visualization and analysis of time series is STL decomposition.
STL decomposition is based on Loess regression, and it decomposes time series to three parts: seasonal, trend and remainder.
We will use results from the STL decomposition to model our data as well.
I am using stl()
from stats
package and before computation we must define weekly seasonality to our time series object. Let’s look on results:
data_ts <- ts(data_train$value, freq = period * 7)
decomp_ts <- stl(data_ts, s.window = "periodic", robust = TRUE)$time.series
decomp_stl <- data.table(Load = c(data_train$value, as.numeric(decomp_ts)),
Date = rep(data_train[,date_time], ncol(decomp_ts)+1),
Type = factor(rep(c("original data", colnames(decomp_ts)),
each = nrow(decomp_ts)),
levels = c("original data", colnames(decomp_ts))))
ggplot(decomp_stl, aes(x = Date, y = Load)) +
geom_line() +
facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Date", y = NULL,
title = "Time Series Decomposition by STL") +
theme_ts
As was expected from the previous picture, we can see that there is “slight” trend increasing and decreasing (by around 100 kW so slightly large ).
Remainder part (noise) is very fluctuate and not seems like classical white noise (we obviously missing additional information like weather and other unexpected situations).
In this section I will do feature engineering for modeling double-seasonal time series with trend best as possible by just available historical values.
Classical way to handle seasonality is to add seasonal features to a model as vectors of form \( (1, \dots, DailyPeriod, 1, …, DailyPeriod,…) \) for daily season or \( (1, \dots, 1, 2, \dots, 2, \dots , 7, 1, \dots) \) for weekly season. I used it this way in my previous post about GAM and somehow similar also with multiple linear regression.
A better way to model seasonal variables (features) with nonlinear regression methods like tree is to transform it to Fourier terms (sinus and cosine). It is more effective to tree models and also other nonlinear machine learning methods. I will explain why it is like that further of this post.
Fourier daily signals (terms) are defined as:
where \( ds \) is number of daily seasonality Fourier pairs and Fourier weekly terms are defines as:
where \( ws \) is a number of weekly seasonality Fourier pairs.
Another great feature (most of the times most powerful) is a lag of original time series. We can use lag by one day, one week, etc…
The lag of time series can be preprocessed by removing noise or trend for example by STL decomposition method to ensure stability.
As was earlier mentioned, regression trees can’t predict trend because they logically make rules and predict future values only by rules made by training set.
Therefore original time series that inputs to regression tree as dependent variable must be detrended (removing the trend part of the time series). The acquired trend part then can be forecasted by for example ARIMA model.
Let’s go to constructing mentioned features and trend forecasting.
Double-seasonal Fourier terms can be simply extracted by fourier
function from forecast
package.
Firstly, we must create multiple seasonal object with function msts
.
data_msts <- msts(data_train$value, seasonal.periods = c(period, period*7))
Now use fourier
function using two conditions for a number of K terms.
Set K for example just to 2.
K <- 2
fuur <- fourier(data_msts, K = c(K, K))
It made 2 pairs (sine and cosine) of daily and weekly seasonal signals.
If we compare it with approach described in previous posts, so simple periodic vectors, it looks like this:
Daily <- rep(1:period, 21) # simple daily vector
Weekly <- data_train[, week_num] # simple weekly vector
data_fuur_simple <- data.table(value = c(scale(Daily), fuur[,2], scale(Weekly), fuur[,6]),
date = rep(data_train$date_time, 4),
method = rep(c("simple-daily", "four-daily",
"simple-weekly", "four-weekly"),
each = nrow(fuur)),
type = rep(c("Daily season", "Weekly season"),
each = nrow(fuur)*2))
ggplot(data_fuur_simple, aes(x = date, y = value, color = method)) +
geom_line(size = 1.2, alpha = 0.7) +
facet_grid(type ~ ., scales = "free_y", switch = "y") +
labs(x = "Date", y = NULL,
title = "Features Comparison") +
theme_ts
where four-daily is the Fourier term for daily season, simple-daily is the simple feature for daily season, four-weekly is the Fourier term for weekly season, and simple-weekly is the simple feature for weekly season. The advantage of Fourier terms is that there is much more closeness between ending and starting of a day or a week, which is more natural.
Now, let’s use data from STL decomposition to forecast trend part of time series. I will use auto.arima
procedure from the forecast
package to perform this.
trend_part <- ts(decomp_ts[,2])
trend_fit <- auto.arima(trend_part)
trend_for <- forecast(trend_fit, period)$mean
Let’s plot it:
trend_data <- data.table(Load = c(decomp_ts[,2], trend_for),
Date = c(data_train$date_time, data_test$date_time),
Type = c(rep("Real", nrow(data_train)), rep("Forecast",
nrow(data_test))))
ggplot(trend_data, aes(Date, Load, color = Type)) +
geom_line(size = 1.2) +
labs(title = paste(trend_fit)) +
theme_ts
Function auto.arima
chose ARIMA(0,2,0) model as best for trend forecasting.
Next, make the final feature to the model (lag) and construct train matrix (model matrix).
I am creating lag by one day and just taking seasonal part from STL decomposition (for having smooth lag time series feature).
N <- nrow(data_train)
window <- (N / period) - 1 # number of days in train set minus lag
new_load <- rowSums(decomp_ts[, c(1,3)]) # detrended load
lag_seas <- decomp_ts[1:(period*window), 1] # seasonal part of time series as lag feature
matrix_train <- data.table(Load = tail(new_load, window*period),
fuur[(period + 1):N,],
Lag = lag_seas)
The accuracy of forecast (or fitted values of a model) will be measured by MAPE, let’s defined it:
mape <- function(real, pred){
return(100 * mean(abs((real - pred)/real))) # MAPE - Mean Absolute Percentage Error
}
In the next two sections, I will describe two regression tree methods. The first is RPART, or CART (Classification and Regression Trees), the second will be CTREE. RPART is recursive partitioning type of binary tree for classification or regression tasks. It performs a search over all possible splits by maximizing an information measure of node impurity, selecting the covariate showing the best split.
I’m using rpart
implementation from the same named package. Let’s go forward to modeling and try default settings of rpart
function:
tree_1 <- rpart(Load ~ ., data = matrix_train)
It makes many interesting outputs to check, for example we can see a table of nodes and corresponding errors by printcp(tree_1)
or see a detailed summary of created nodes by summary(tree_1)
. We will check variable importance and number of created splits:
tree_1$variable.importance
## Lag C2-48 C1-336 S1-48 C1-48 S1-336 C2-336
## 100504751 45918330 44310331 36245736 32359598 27831258 25385506
## S2-336 S2-48
## 15156041 7595266
paste("Number of splits: ", tree_1$cptable[dim(tree_1$cptable )[1], "nsplit"])
## [1] "Number of splits: 10"
We can see that most important variables are Lag and cosine forms of the daily and weekly season. The number of splits is 10, ehm, is it enough for time series of length 1008 values?
Let’s plot created rules with fancy rpart.plot
function from the same named package:
rpart.plot(tree_1, digits = 2,
box.palette = viridis::viridis(10, option = "D", begin = 0.85, end = 0),
shadow.col = "grey65", col = "grey99")
We can see values, rules, and percentage of values split each time. Pretty simple and interpretable.
Now plot fitted values to see results of the tree_1
model.
datas <- data.table(Load = c(matrix_train$Load,
predict(tree_1)),
Time = rep(1:length(matrix_train$Load), 2),
Type = rep(c("Real", "RPART"), each = length(matrix_train$Load)))
ggplot(datas, aes(Time, Load, color = Type)) +
geom_line(size = 0.8, alpha = 0.75) +
labs(y = "Detrended load", title = "Fitted values from RPART tree") +
theme_ts
And see the error of fitted values against real values.
mape(matrix_train$Load, predict(tree_1))
## [1] 180.6669
Whups. It’s a little bit simple (rectangular) and not really accurate, but it’s logical result from a simple tree model.
The key to achieving better results and have more accurate fit is to set manually control hyperparameters of rpart.
Check ?rpart.control
to get more information.
The “hack” is to change cp (complexity) parameter to very low to produce more splits (nodes). The cp
is a threshold deciding if each branch fulfills conditions for further processing, so only nodes with fitness larger than factor cp are processed. Other important parameters are the minimum number of observations in needed in a node to split (minsplit
) and the maximal depth of a tree (maxdepth
).
Set the minsplit
to 2 and set the maxdepth
to its maximal value – 30.
tree_2 <- rpart(Load ~ ., data = matrix_train,
control = rpart.control(minsplit = 2,
maxdepth = 30,
cp = 0.000001))
Now make simple plot to see depth of the created tree…
plot(tree_2, compress = TRUE)
That’s little bit impressive difference than previous one, isn’t it?
Check also number of splits.
tree_2$cptable[dim(tree_2$cptable )[1], "nsplit"] # Number of splits
## [1] 600
600 is higher than 10
Let’s plot fitted values from the model tree_2
:
datas <- data.table(Load = c(matrix_train$Load,
predict(tree_2)),
Time = rep(1:length(matrix_train$Load), 2),
Type = rep(c("Real", "RPART"), each = length(matrix_train$Load)))
ggplot(datas, aes(Time, Load, color = Type)) +
geom_line(size = 0.8, alpha = 0.75) +
labs(y = "Detrended load", title = "Fitted values from RPART") +
theme_ts
And see the error of fitted values against real values.
mape(matrix_train$Load, predict(tree_2))
## [1] 16.0639
Much better, but obviously the model can be overfitted now.
Add together everything that we got till now, so forecast load one day ahead.
Let’s create testing data matrix:
test_lag <- decomp_ts[((period*window)+1):N , 1]
fuur_test <- fourier(data_msts, K = c(K, K), h = period)
matrix_test <- data.table(fuur_test,
Lag = test_lag)
Predict detrended time series part with tree_2
model + add the trend part of time series forecasted by ARIMA model.
for_rpart <- predict(tree_2, matrix_test) + trend_for
Let’s plot the results and compare it with real values from data_test
.
data_for <- data.table(Load = c(data_train$value, data_test$value, for_rpart),
Date = c(data_train$date_time, rep(data_test$date_time, 2)),
Type = c(rep("Train data", nrow(data_train)),
rep("Test data", nrow(data_test)),
rep("Forecast", nrow(data_test))))
ggplot(data_for, aes(Date, Load, color = Type)) +
geom_line(size = 0.8, alpha = 0.75) +
facet_zoom(x = Date %in% data_test$date_time, zoom.size = 1.2) +
labs(title = "Forecast from RPART") +
theme_ts
Not bad. For clarity, compare forecasting results with model without separate trend forecasting and detrending.
matrix_train_sim <- data.table(Load = tail(data_train$value, window*period),
fuur[(period+1):N,],
Lag = lag_seas)
tree_sim <- rpart(Load ~ ., data = matrix_train_sim,
control = rpart.control(minsplit = 2,
maxdepth = 30,
cp = 0.000001))
for_rpart_sim <- predict(tree_sim, matrix_test)
data_for <- data.table(Load = c(data_train$value, data_test$value, for_rpart, for_rpart_sim),
Date = c(data_train$date_time, rep(data_test$date_time, 3)),
Type = c(rep("Train data", nrow(data_train)),
rep("Test data", nrow(data_test)),
rep("Forecast with trend", nrow(data_test)),
rep("Forecast simple", nrow(data_test))))
ggplot(data_for, aes(Date, Load, color = Type, linetype = Type)) +
geom_line(size = 0.8, alpha = 0.7) +
facet_zoom(x = Date %in% data_test$date_time, zoom.size = 1.2) +
labs(title = "Forecasts from RPARTs with and without trend forecasting") +
scale_linetype_manual(values = c(5,6,1,1)) +
theme_ts
We can see that RPART model without trend manipulation has higher values of the forecast.
Evaluate results with MAPE forecasting measure.
mape(data_test$value, for_rpart)
## [1] 3.727473
mape(data_test$value, for_rpart_sim)
## [1] 6.976259
We can see the large difference in MAPE. So detrending original time series and forecasting separately trend part really works, but not generalize the result now. You can read more about RPART method in its great package vignette.
The second simple regression tree method that will be used is CTREE. Conditional inference trees (CTREE) is a statistical approach to recursive partitioning, which takes into account the distributional properties of the data. CTREE performs multiple test procedures that are applied to determine whether no significant association between any of the feature and the response (load in the our case) can be stated and the recursion needs to stop.
In R CTREE is implemented in the package party
in the function ctree
.
Let’s try fit simple ctree
with a default values.
ctree_1 <- ctree(Load ~ ., data = matrix_train)
Constructed tree can be again simply plotted by plot
function, but it made many splits so it’s disarranged.
Let’s plot fitted values from ctree_1
model.
datas <- data.table(Load = c(matrix_train$Load,
predict(ctree_1)),
Time = rep(1:length(matrix_train$Load), 2),
Type = rep(c("Real", "CTREE"), each = length(matrix_train$Load)))
ggplot(datas, aes(Time, Load, color = Type)) +
geom_line(size = 0.8, alpha = 0.75) +
labs(y = "Detrended load", title = "Fitted values from CTREE") +
theme_ts
And see the error of fitted values against real values.
mape(matrix_train$Load, predict(ctree_1))
## [1] 87.85983
Actually, this is pretty nice, but again, it can be tuned.
For available hyperparameters tuning check ?ctree_control
. I changed hyperparameters minsplit
and minbucket
that have similar meaning like the cp parameter in RPART. The mincriterion
can be tuned also, and it is significance level (1 – p-value) that must be exceeded in order to implement a split. Let’s plot results.
ctree_2 <- ctree(Load ~ ., data = matrix_train,
controls = party::ctree_control(teststat = "quad",
testtype = "Teststatistic",
mincriterion = 0.925,
minsplit = 1,
minbucket = 1))
datas <- data.table(Load = c(matrix_train$Load,
predict(ctree_2)),
Time = rep(1:length(matrix_train$Load), 2),
Type = rep(c("Real", "CTREE"), each = length(matrix_train$Load)))
ggplot(datas, aes(Time, Load, color = Type)) +
geom_line(size = 0.8, alpha = 0.75) +
labs(y = "Detrended load", title = "Fitted values from CTREE") +
theme_ts
And see the error of fitted values against real values.
mape(matrix_train$Load, predict(ctree_2))
## [1] 39.70532
It’s better. Now forecast values with ctree_2
model.
for_ctree <- predict(ctree_2, matrix_test) + trend_for
And compare CTREE with RPART model.
data_for <- data.table(Load = c(data_train$value, data_test$value, for_rpart, for_ctree),
Date = c(data_train$date_time, rep(data_test$date_time, 3)),
Type = c(rep("Train data", nrow(data_train)),
rep("Test data", nrow(data_test)),
rep("RPART", nrow(data_test)),
rep("CTREE", nrow(data_test))))
ggplot(data_for, aes(Date, Load, color = Type, linetype = Type)) +
geom_line(size = 0.8, alpha = 0.7) +
facet_zoom(x = Date %in% data_test$date_time, zoom.size = 1.2) +
labs(title = "Forecasts from RPART and CTREE models") +
scale_linetype_manual(values = c(5,6,1,1)) +
theme_ts
mape(data_test$value, for_rpart)
## [1] 3.727473
mape(data_test$value, for_ctree)
## [1] 4.020834
Slightly better MAPE value with RPART, but again now it can not be anything to generalize. You can read more about CTREE method in its great package vignette.
Try to forecast future values with all available electricity load data with sliding window approach (window of the length of three weeks) for a period of more than three months (98 days).
Define functions that produce forecasts, so add up everything that we learned so far.
RpartTrend <- function(data, set_of_date, K, period = 48){
data_train <- data[date %in% set_of_date]
N <- nrow(data_train)
window <- (N / period) - 1
data_ts <- msts(data_train$value, seasonal.periods = c(period, period*7))
fuur <- fourier(data_ts, K = c(K, K))
fuur_test <- as.data.frame(fourier(data_ts, K = c(K, K), h = period))
data_ts <- ts(data_train$value, freq = period*7)
decomp_ts <- stl(data_ts, s.window = "periodic", robust = TRUE)
new_load <- rowSums(decomp_ts$time.series[, c(1,3)])
trend_part <- ts(decomp_ts$time.series[,2])
trend_fit <- auto.arima(trend_part)
trend_for <- as.vector(forecast(trend_fit, period)$mean)
lag_seas <- decomp_ts$time.series[1:(period *window), 1]
matrix_train <- data.table(Load = tail(new_load, window*period),
fuur[(period+1):N,],
Lag = lag_seas)
tree_1 <- rpart(Load ~ ., data = matrix_train,
control = rpart.control(minsplit = 2,
maxdepth = 30,
cp = 0.000001))
test_lag <- decomp_ts$time.series[((period* (window))+1):N, 1]
matrix_test <- data.table(fuur_test,
Lag = test_lag)
# prediction
pred_tree <- predict(tree_1, matrix_test) + trend_for
return(as.vector(pred_tree))
}
CtreeTrend <- function(data, set_of_date, K, period = 48){
# subsetting the dataset by dates
data_train <- data[date %in% set_of_date]
N <- nrow(data_train)
window <- (N / period) - 1
data_ts <- msts(data_train$value, seasonal.periods = c(period, period*7))
fuur <- fourier(data_ts, K = c(K, K))
fuur_test <- as.data.frame(fourier(data_ts, K = c(K, K), h = period))
data_ts <- ts(data_train$value, freq = period*7)
decomp_ts <- stl(data_ts, s.window = "periodic", robust = TRUE)
new_load <- rowSums(decomp_ts$time.series[, c(1,3)])
trend_part <- ts(decomp_ts$time.series[,2])
trend_fit <- auto.arima(trend_part)
trend_for <- as.vector(forecast(trend_fit, period)$mean)
lag_seas <- decomp_ts$time.series[1:(period *window), 1]
matrix_train <- data.table(Load = tail(new_load, window*period),
fuur[(period+1):N,],
Lag = lag_seas)
tree_2 <- party::ctree(Load ~ ., data = matrix_train,
controls = party::ctree_control(teststat = "quad",
testtype = "Teststatistic",
mincriterion = 0.925,
minsplit = 1,
minbucket = 1))
test_lag <- decomp_ts$time.series[((period* (window))+1):N, 1]
matrix_test <- data.table(fuur_test,
Lag = test_lag)
pred_tree <- predict(tree_2, matrix_test) + trend_for
return(as.vector(pred_tree))
}
I created plotly
boxplots graph of MAPE values from four models – CTREE simple, CTREE with detrending, RPART simple and RPART with detrending. Whole evaluation can be seen in the script that is stored in my GitHub repository.
We can see that detrending time series of electricity consumption improves the accuracy of the forecast with the combination of both regression tree methods – RPART and CTREE. My approach works as expected.
The habit of my posts is that animation must appear. So, I prepared for you two animations (animated dashboards) using animation
, grid
, ggplot
and ggforce
(for zooming) packages that visualize results of forecasting.
We can see that in many days it is almost perfect forecast, but on some days it has some potential for improving.
In this post, I showed you how to solve trend appearance in seasonal time series with using a regression tree model. Detrending time series for regression tree methods is a important (must) procedure due to the character of decision trees. The trend part of a time series was acquired by STL decomposition and separately forecasted by a simple ARIMA model. I evaluated this approach on the dataset from smart meters measurements of electricity consumption. The regression (decision) tree is a great technique for getting simple and interpretable results in very fast computational time.
In the future post, I will focus on enhancing the predictive performance of simple regression tree methods by ensemble learning methods like Bagging, Random Forest, and similar.
This posting includes an audio/video/photo media file: Download Now
Do you want to get your own simmer
hexagon sticker? Just fill in this form and get one send to you for free.
Check out r-simmer.org or CRAN for more information on simmer
, a discrete-event simulation package for R.