Some Neat New R Notations (and 6 more aRticles)

Some Neat New R Notations
How to Create an Online Choice Simulator
Understanding gender roles in movies with text mining
Tidyer BLS data with the blscarpeR package
Learning things we already know about stocks
Using regression trees for forecasting double-seasonal time series with trend in R
Free simmer hexagon stickers!

Some Neat New R Notations

The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following:

library("seplyr")

names <- c('a', 'b')

names := c('x', 'y')
#>   a   b 
#> "x" "y"

This can be very useful when programming in R, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike c(a = 'x', b = 'y'), where all left-hand-sides are concrete values even if not quoted).

A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R‘s “c()” combine/concatenate function. That is: c('a' := 'x', 'b' := 'y') is the same as c('a', 'b') := c('x', 'y'). Roughly this means the two operations play well with each other.

The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “:=” notation for this (and, as is common in R, pick the correct behavior based on runtime types).

The function construction is written as: “variables := { code }” (the braces are required) and the semantics are roughly the same as “function(variables) { code }“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the seplyr notation is not as concise as is possible).

This notation allows us to write the squares of 1 through 4 as:

sapply(1:4, x:={x^2})

instead of writing:

sapply(1:4, function(x) x^2)

It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambda-calculus notation such as “(λx.x^2)“. In the development version of seplyr we are experimenting with the following additional notations:

sapply(1:4, lambda(x)(x^2))

sapply(1:4, λ(x, x^2))

(Both of these currenlty work in the development version, though we are not sure about submitting source files with non-ASCII characters to CRAN.)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

How to Create an Online Choice Simulator

Choice Simulator

Choice model simulator

What is a choice simulator?

A choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator.

Choice simulators have many names: decision support systems, market simulators, preference simulators, desktop simulators, conjoint simulators, and choice model simulators.

How to create a choice simulator

In this post, I show how to create an online choice simulator, with the calculations done using R, and the simulator is hosted in Displayr.

Step 1: Import the model(s) results

First of all, choice simulators are based on models. So, the first step in building a choice simulator is to obtain the model results that are to be used in the simulator. For example, here I use respondent-level parameters from a latent class model, but there are many other types of data that could have been used (e.g., parameters from a GLM, draws from the posterior distribution, beta draws from a maximum simulated likelihood model).

If practical, it is usually a good idea to have model results at the case level (e.g., respondent level), as the resulting simulator can then be easily automatically weighted and/or filtered. If you have case level data, the model results should be imported into Displayr as a Data Set. See Introduction to Displayr 2: Getting your data into Displayr for an overview of ways of getting data into Displayr.

The table below shows estimated parameters of respondents from a discrete choice experiment of the market for eggs. You can work your way through the choice simulator example used in this post here (the link will first take you to a login page in Displayr and then to a document that contains the data in the variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM).

Step 2: Simplify calculations using variable sets

Variable sets are a novel and very useful aspect of Displayr. Variable sets are related variables that are grouped. We can simplify the calculations of a choice simulator by using the variable sets, with one variable set for each attribute.

Variable sets in data tree

In this step, we group the variables for each attribute into separate variable sets, so that they appear as shown on the right. This is done as follows:

If the variables are already grouped into a variable set, select the variable set, and select Data Manipulation > Split (Variables). In the dataset that I am using, all the variables I need for my calculation are already grouped into a single variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM, so I click on this and split it.
Next, select the first attribute’s variables. In my example, this is the four variables that start with Weight:, each of which represents the respondent-level parameters for different egg weights. (The first of these contains only 0s, as dummy coding was used.)
Then, go to Data Manipulation > Combine (Variables).
Next set the Label for the new variable set to something appropriate. For reasons that will become clearer below, it is preferable to set it to a single, short word. For example, Weight.
Set the Label field for each of the variables to whatever label you plan to show in the choice simulator. For example, if you want people to be able to choose an egg weight of 55g (about 2 ounces), set the Label to 55g.
Finally, repeat this process for all the attributes. If you have any numeric attributes, then leave these as a single variable, like Price in the example here.

Step 3: Create the controls

Choice model simulation inputs

In my choice simulator, I have separate columns of controls (i.e., combo boxes) for each of the brands. The fast way to do this is to first create them for the first alternative (column), and then copy and paste them:

Insert > Control (More).
Type the levels, separated by semi-colons, into the Item list. These must match, exactly, to the labels that you have entered for the Labels for the first attribute in point 5 in the previous step. For example: 55g; 60g; 65g; 70g. I recommend using copy and paste because if you make some typos they will be difficult to track down. Where you have a numeric attribute, such as Price in the example, you enter the range of values that you wish the user to be able to choose from (e.g., 1.50; 2.00; 2.50; 3.00; 3.50; 4.00; 4.50; 5.00).
Select the Properties tab in the Object Inspector and set the Name of the control to whatever you set as the Label for the corresponding variable set with the number 1 affixed at the end. For example, Weight.1 (You can use any label, but following this convention will save you time later on.)
Click on the control and select the first level. For example, 55g.
Repeat these steps until you have created controls for each of the attributes, each under each other, as shown above.
Select all the controls that you have created, and then select Home > Copy and Home > Paste, and move the new set of labels to the right of the previous labels. Repeat this for as many sets of alternatives as you wish to include. In my example, there are four alternatives.
Finally, add labels for the brands and attributes: Insert > TextBox (Text and Images).

See also Adding a Combo Box to a Displayr Dashboard for an intro to creating combo boxes.

Step 4: Calculate preference shares

Insert an R Output (Insert > R Output (Analysis)), setting it to Automatic with the appropriate code, and positioning it underneath the first column of combo boxes. Press the Calculate button, and it should calculate the share for the first alternative. If you paste the code below, and everything is setup properly, you will get a value of 25%.
Now, click on the R Output you just created, and copy-and-paste it. Position the new version immediately below the second column of combo boxes.
Modify the very last line of code, replacing [1] with [2], which tells it to show the results of the second alternative.
Repeat steps 2 and 3 for alternatives 3 and 4.

The code below can easily be modified for other models. A few key aspects of the code:

It works with four alternatives and is readily modified to deal with different numbers of alternatives.
The formulas for the utility of each alternative are expressed as simple mathematical expressions. Because I was careful with the naming of the variable sets and the controls, they are easy to read. If you are using Displayr, you can hover over the various elements of the formula and you will get a preview of their data.
The code is already setup to deal with weights. Just click on the R Output that contains the formula and apply a weight (Home > Weight).
It is set up to automatically deal with any filters. More about this below.

R Code to paste:

# Computing the utility for each alternative
u1 = Weight[, Weight.1] + Organic[, Organic.1] + Charity[, Charity.1] + Quality[, Quality.1] + Uniformity[, Uniformity.1] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.1))
u2 = Weight[, Weight.2] + Organic[, Organic.2] + Charity[, Charity.2] + Quality[, Quality.2] + Uniformity[, Uniformity.2] + Feed[, Feed.2] + Price*as.numeric(gsub("\\$", "", Price.2))
u3 = Weight[, Weight.3] + Organic[, Organic.3] + Charity[, Charity.3] + Quality[, Quality.3] + Uniformity[, Uniformity.3] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.3))
u4 = Weight[, Weight.4] + Organic[, Organic.4] + Charity[, Charity.4] + Quality[, Quality.4] + Uniformity[, Uniformity.4] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.4))
# Computing preference shares
utilities = as.matrix(cbind(u1, u2, u3, u4))
eutilities = exp(utilities)
shares = prop.table(eutilities, 1)
# Filtering the shares, if a filter is applied.
shares = shares[QFilter, ]
# Filtering the weight variable, if required.
weight = if (is.null(QPopulationWeight)) rep(1, length(u1)) else QPopulationWeight
weight = weight[QFilter]
# Computing shares for the total sample
shares = sweep(shares, 1, weight, "*")
shares = as.matrix(apply(shares, 2, sum))
shares = 100 * prop.table(shares, 2)[1]

Step 5: Make it pretty

If you wish, you can make your choice simulator prettier. The R Outputs and the controls all have formatting options. In my example, I got our designer, Nat, to create the pretty background screen, which she did in Photoshop, and then added using Insert > Image.

Step 6: Add filters

If you have stored the data as variable sets, you can quickly create filters. Note that the calculations will automatically update when the viewer selects the filters.

Step 7: Share

To share the dashboard, go to the Export tab in the ribbon (at the top of the screen), and click on the black triangle under the Web Page button. Next, check the option for Hide page navigation on exported page and then click Export… and follow the prompts.

Note, the URL for the choice simulator I am using in this example is https://app.displayr.com/Dashboard?id=21043f64-45d0-47af-9797-cd4180805849. This URL is public. You cannot guess or find this link by web-searching for security reasons. If, however, you give the URL to someone, then they can access the document. Alternatively, if you have an annual Displayr account, you can instead go into Settings for the document (the cog at the top-right of the screen) and press Disable Public URL. This will limit access to only people who are set up as users for your organization. You can set up people as users in the company’s Settings, accessible by clicking on the cog at the top-right of the screen. If you don’t see these settings, contact support@displayr.com to buy a license.

Worked example of a choice simulator

You can see the choice simulator in View Mode here (as an end-user will see it), or you can create your own choice simulator here (first log into Displayr and then edit or modify a copy of the document used to create this post).

This posting includes an audio/video/photo media file: Download Now

Understanding gender roles in movies with text mining

I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film.

The R code behind this analysis in publicly available on GitHub.

I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time!

This posting includes an audio/video/photo media file: Download Now

Tidyer BLS data with the blscarpeR package

The recent release of the blscrapeR package brings the “tidyverse” into the fold. Inspired by my recent collaboration with Kyle Walker on his excellent tidycensus package, blscrapeR has been optimized for use within the tidyverse as of the current version 3.0.0.

New things you’ll notice right away include:

All data now returned as tibbles.
dplyr and purrr are now imported packages, along with magrittr and ggplot, which were imported from the start.
No need to call any packages other than tidyverse and blscrapeR.

Major internal changes

Switched from base R to dplyr in instances where performance could be increased.
Standard apply functions replaced with purrr map() functions where performance could be increased.

install.packages("blscrapeR")

The BLS: More than Unemployment

The American Time Use Survey is one of the BLS’ more interesting data sets. Below is an API query that compares the time Americans spend watching TV on a daily basis compared to the time spent socializing and communicating.

It should be noted, some familiarity with BLS series id numbers is required here. The BLS Data Finder is a nice tool to find series id numbers.

library(blscrapeR)
library(tidyverse)
tbl <- bls_api(c("TUU10101AA01014236", "TUU10101AA01013951")) %>%
    spread(seriesID, value) %>%
    dateCast() %>%
    rename(watching_tv = TUU10101AA01014236, socializing_communicating = TUU10101AA01013951)
tbl

## # A tibble: 3 x 7
##    year    period periodName footnotes socializing_communicating watching_tv       date
## *                                               
## 1  2014                          0.71        2.82 2014-01-01
## 2  2015                          0.68        2.78 2015-01-01
## 3  2016                          0.65        2.73 2016-01-01

Unemployment Rates

The main attraction of the BLS are the monthly employment and unemployment data. Below is an API query and plot of three of the major BLS unemployment rates.

U-3: The “official unemployment rate.” Total unemployed, as a percent of the civilian labor force.
U-5: Total unemployed, plus discouraged workers, plus all other marginally attached workers, as a percent of the civilian labor force plus all marginally attached workers.
U-6: Total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers.

library(blscrapeR)
library(tidyverse)
tbl <- bls_api(c("LNS14000000", "LNS13327708", "LNS13327709")) %>%
    spread(seriesID, value) %>%
    dateCast() %>%
    rename(u3_unemployment = LNS14000000, u5_unemployment = LNS13327708, u6_unemployment = LNS13327709)


ggplot(data = tbl, aes(x = date)) + 
    geom_line(aes(y = u3_unemployment, color = "U-3 Unemployment")) +
    geom_line(aes(y = u5_unemployment, color = "U-5 Unemployment")) + 
    geom_line(aes(y = u6_unemployment, color = "U-6 Unemployment")) + 
    labs(title = "Monthly Unemployment Rates") + ylab("value") +
    theme(legend.position="top", plot.title = element_text(hjust = 0.5))

plot of chunk unnamed-chunk-4

For more information and examples, please see the package vignettes.

This posting includes an audio/video/photo media file: Download Now

Learning things we already know about stocks

This example groups stocks together in a network that highlights associations within and between the groups using only historical price data. The result is far from ground-breaking; you can already guess the output. For the most part, the stocks get grouped together into pretty obvious business sectors.

Despite the obvious result, the process of teasing out latent groupings from historic price data is interesting. That’s the focus of this example. A central idea of the approach taken here comes from the great paper of Ledoit and Wolf, “Honey, I Shrunk the Sample Covariance Matrix” (http://www.ledoit.net/honey.pdf). This example employs an alternative approach based on a matrix eigenvalue decomposition, but it’s the same general idea.

This note follows an informal, how-to format. Rather than focus on mathematical analysis, which is well-detailed in the references, I try to spell out the hows and whys: how to do things step by step (using R), and a somewhat non-rigorous rationale for each step that’s hopefully at least convincing and intuitive.

For emphasis, allow me to restate the first sentence as an objective:

Group stocks together in a network that highlights associations within and between the groups using only historical price data

That’s what the rest of this example will do, hopefully illuminating some key ideas about regularization along the way.

Software used in the example

The example uses R, of course, and the following R packages, all available on CRAN (some of the packages themselves have dependencies):

quantmod (at least version 0.4-10)
igraph (at least version 1.1.2)
threejs (at least version 0.3.1)

Getting data

NOTE: You can skip ahead to the Sample correlation section by simply downloading a sample copy of processed log(return) data as follows:

library(quantmod)
load(url("http://illposed.net/logreturns.rdata"))

Otherwise, follow the next two sections to download the raw stock daily price data and process those data into log(returns).

Download daily closing price data from Google Finance

The quantmod package (Ulrich and Ryan, http://www.quantmod.com/) makes it ridiculously easy to download (and visualize) financial time series data. The following code uses quantmod to download daily stock price data for about 100 companies with the largest market capitalizations listed on the Standard & Poor’s 500 index at the time of this writing. The code downloads daily closing prices from 2012 until the present. Modify the code to experiment with different time periods or stocks as desired!

Because stock symbol names may change and companies my come and go, it’s possible that some of the data for some time periods are not available. The tryCatch() block in the code checks for a download error and flags problems by returning NA, later removed from the result. The upshot is that the output number of columns of stock price time series may be smaller than the input list of stock symbols.

The output of the following code is an xts time series matrix of stock prices called prices, whose rows correspond to days and columns to stock symbols.

library(quantmod)
from="2012-05-17"
sym = c("AAPL", "ABBV", "ABT", "ACN", "AGN", "AIG", "ALL", "AMGN", "AMZN", "AXP",
        "BA", "BAC", "BIIB", "BK", "BLK", "BMY", "BRK.B", "C", "CAT", "CELG", "CL",
        "CMCSA", "COF", "COP", "COST", "CSCO", "CVS", "CVX", "DD", "DHR", "DIS", "DOW",
        "DUK", "EMR", "EXC", "F", "FB", "FDX", "FOX", "FOXA", "GD", "GE", "GILD", "GM",
        "GOOG", "GOOGL", "GS", "HAL", "HD", "HON", "IBM", "INTC", "JNJ", "JPM", "KHC",
        "KMI", "KO", "LLY", "LMT", "LOW", "MA", "MCD", "MDLZ", "MDT", "MET", "MMM",
        "MO", "MON", "MRK", "MS", "MSFT", "NEE", "NKE", "ORCL", "OXY", "PCLN", "PEP",
        "PFE", "PG", "PM", "PYPL", "QCOM", "RTN", "SBUX", "SLB", "SO", "SPG", "T",
        "TGT", "TWX", "TXN", "UNH", "UNP", "UPS", "USB", "UTX", "V", "VZ", "WBA",
        "WFC", "WMT", "XOM")

prices = Map(function(n)
             {
               print(n)
               tryCatch(getSymbols(n, src="google", env=NULL, from=from)[, 4], error = function(e) NA)
             }, sym)
N = length(prices)
# identify symbols returning valid data
i = ! unlist(Map(function(i) is.na(prices[i]), seq(N)))
# combine returned prices list into a matrix, one column for each symbol with valid data
prices = Reduce(cbind, prices[i])
colnames(prices) = sym[i]

Clean up and transform data

Not every stock symbol may have prices available for every day. Trading can be suspended for some reason, companies get acquired or go private, new companies form, etc.

Let’s fill in missing values going forward in time using the last reported price (piecewise constant interpolation) – a reasonable approach for stock price time series. After that, if there are still missing values, just remove those symbols that contain them, possibly further reducing the universe of stock symbols we’re working with.

for(j in 1:ncol(prices)) prices[, j] = na.locf(prices[, j])       # fill in
prices = prices[, apply(prices, 2, function(x) ! any(is.na(x)))]  # omit stocks with missing data

Now that we have a universe of stocks with valid price data, convert those prices to log(returns) for the remaining analysis (by returns I mean simply the ratio of prices relative to the first price).

Why log(returns) instead of prices?

The log(returns) are closer to normally distributed than prices especially in the long run. Pat Burns wrote a note about this (with a Tom Waits soundtrack): http://www.portfolioprobe.com/2012/01/23/the-distribution-of-financial-returns-made-simple/.

But why care about getting data closer to normally distributed?

That turns out to be important to us because later we’ll use a technique called partial correlation. That technique generally works better for normally distributed data than otherwise, see for example a nice technical discussion about this by Baba, Shibata, and Sibuya here: https://doi.org/10.1111%2Fj.1467-842X.2004.00360.x

The following simple code converts our prices matrix into a matrix of log(returns):

log_returns = apply(prices, 2, function(x) diff(log(x)))

Sample correlation matrix

It’s easy to convert the downloaded log(returns) data into a Pearson’s sample correlation matrix X:

X = cor(log_returns)

The (i, j)th entry of the sample correlation matrix X above is a measurement of the degree of linear dependence between the log(return) series for the stocks in columns i and j.

There exist at least two issues that can lead to serious problems with the interpretation of the sample correlation values:

As Ledoit and Wolf point out, it’s well known that empirical correlation estimates may contain lots of error.
Correlation estimates between two stock log(return) series can be misleading for many reasons, including spurious correlation or existence of confounding variables related to both series (http://www.tylervigen.com/spurious-correlations).

A Nobel-prize winning approach to dealing with the second problem considers cointegration between series instead of correlation; see for example notes by Eric Zivot (https://faculty.washington.edu/ezivot/econ584/notes/cointegrationslides.pdf), Bernhard Pfaff’s lovely book “Analysis of Integrated and Cointegrated Time Series with R” (http://www.springer.com/us/book/9780387759661), or Wikipedia (https://en.wikipedia.org/wiki/Cointegration). (I also have some weird technical notes on the numerics of cointegration at http://illposed.net/cointegration.html.)

Cointegration is a wonderful but fairly technical topic. Instead, let’s try a simpler approach.

We can try to address issue 2 above by controlling for confounding variables, at least partially. One approach considers partial correlation instead of correlation (see for example the nice description in Wikipedia https://en.wikipedia.org/wiki/Partial_correlation). That approach works best in practice with approximately normal data – one reason for the switch to log(returns) instead of prices. We will treat the entries of the precision matrix as measures of association in a network of stocks below.

It’s worth stating that our simple approach basically treats the log(returns) series as a bunch of vectors and not so much bona fide time series, and can’t handle as many pathologies that might occur as well as cointegration can. But as we will see, this simple technique is still pretty effective at finding structure in our data. (And, indeed, related methods as discussed by Ledoit and Wolf and elsewhere are widely used in portfolio and risk analyses in practice.)

The partial correlation coefficients between all stock log(returns) series are the entries of the inverse of the sample correlation matrix (https://www.statlect.com/glossary/precision-matrix).

Market trading of our universe of companies, with myriad known and unknown associations between them and the larger economy, produced the stock prices we downloaded. Our objective is a kind of inverse problem: given a bunch of historical stock prices, produce a network of associations.

You may recall from some long ago class that, numerically speaking, inverting matrices is generally a bad idea. Even worse, issue 1 above says that our estimated correlation coefficients contain error (noise). Even a tiny amount noise can be hugely amplified if we invert the matrix. That’s because, as we will soon see, the sample correlation matrix contains tiny eigenvalues, and matrix inversion effectively divides the noise by those tiny values. Simply stated, dividing by a tiny number returns a big number; that is, matrix inversion tends to blow the noise up. This is a fundamental issue (in a sense, the fundamental issue) common to many inverse problems.

Ledoit and Wolf’s sensible answer to reducing the influence of noise is regularization. Regularization replaces models with different, but related, models designed to reduce the influence of noise on their output. LW use a form of regularization related to ridge regression (a.k.a., Tikhonov regularization) with a peculiar regularization operator based on a highly structured estimate of the covariance. We will use a simpler kind of regularization based on an eigenvalue decomposition of the sample correlation matrix X.

Regularization

Here is an eigenvalue decomposition of the sample correlation matrix:

L = eigen(X, symmetric=TRUE)

Note that R’s eigen() function takes care to return the (real-valued) eigenvalues of a symmetric matrix in decreasing order for us. (Technically, the correlation matrix is symmetric positive semi-definite, and will have only non-negative real eigenvalues.)

Each eigenvector represents an orthogonal projection of the sample correlation matrix into a line (a 1-d shadow of the data). The first two eigenvectors define a projection of the sample correlation matrix into a plane (2-d), and so on. The eigenvalues estimate the proportion of information (or variability, if you prefer) from the original sample correlation matrix contained in each eigenvector. Because the eigenvectors are orthogonal, these measurements of projected information are additive.

Here is a plot of all the sample correlation matrix eigenvalues (along with a vertical line that will be explained in a moment):

plot(L$values, ylab="eigenvalues")
abline(v=10)

The eigenvalues fall off rather quickly in our example! That means that a lot of the information in the sample correlation matrix is contained in the first few eigenvectors.

Let’s assume, perhaps unreasonably, that the errors in our estimate of the correlation matrix are equally likely to occur in any direction (that the errors are white noise, basically). As we can see above, most of the information is concentrated in the subspace corresponding to the first few eigenvectors. But white noise will have information content in all the dimensions more or less equally.

One regularization technique replaces the sample correlation matrix with an approximation defined by only its first few eigenvectors. Because they represent a large amount of the information content, the approximation can be pretty good. More importantly, because we assumed noise to be more or less equally represented across the eigenvector directions and we’re cutting most of those off, this approximation tends to damp the noise more than the underlying information. Most importantly, we’re cutting off the subspace associated with tiny eigenvalues, avoiding the problem of division by tiny values and significantly reducing amplified noise in the inverse of the sample correlation matrix (the precision matrix).

The upshot is, we regularize the sample correlation matrix by approximating it by a low-rank matrix that substantially reduces the influence of noise on the precision matrix. See Per Christian Hansen’s classic paperback, “Rank-Deficient and Discrete Ill-Posed Problems” (http://epubs.siam.org/doi/book/10.1137/1.9780898719697), for insight into related topics.

But how to choose a cut-off rank?

There is substantial mathematical literature for just this topic (regularization parameter choice selection), complete with deep theory as well as lots of heuristics. Let’s keep things simple for this example and form our approximation by cutting off eigenvectors beyond where the eigenvalue plot starts to flatten out – close to the vertical line in the above plot.

Alternatively, consider the lovely short 2004 paper by Chris Ding and Xiaofeng He (http://dl.acm.org/citation.cfm?id=1015408) that illuminates connections (that I happen to find fascinating) between k-means clustering and projections like truncated eigenvalue expansions. Although we aren’t interested in k-means clustering per se, our objective is connected to clustering. Ding and He show that we can find at least k (k-means) clusters using the first k – 1 eigenvectors above. This gives us another heuristic way to choose a projection dimension, at least if we have an idea about the number of clusters to look for.

A precision matrix, finally

Finally, we form the precision matrix P from the regularized sample correlation matrix. The inversion is less numerically problematic now because of regularization. Feel free to experiment with the projected rank N below!

N = 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))

Other approaches

I’m not qualified to write about them, but you should be aware that Bayesian approaches to solving problems like this are also effectively (and effective!) regularization methods. I hope to someday better understand the connections between classical inverse problem solution methods that I know a little bit about, and Bayesian methods that I know substantially less about.

Put a package on it

There is a carefully written R package to construct regularized correlation and precision matrices: the corpcor package (https://cran.r-project.org/package=corpcor; also see http://strimmerlab.org/software/corpcor/) by Juliane Schafer, Rainer Opgen-Rhein, Verena Zuber, Miika Ahdesmaki, A. Pedro Duarte Silva, and Korbinian Strimmer. Their package includes the original Ledoit-Wolf-like regularization method, as well as refinements to it and many other methods. The corpcor package, like Ledoit Wolf, includes ways to use sophisticated regularization operators, and can apply more broadly than the simple approach taken in this post.

You can use the corpcor package to form a Ledoit-Wolf-like regularized precision matrix P, and you should try it! The result is pretty similar to what we get from our simple truncated eigenvalue decomposition regularization in this example.

Networks and clustering

The (i, j)th entry of the precision matrix P is a measure of association between the log(return) time series for the stocks in columns i and j, with larger values corresponding to more association.

An interesting way to group related stocks together is to think of the precision matrix as an adjacency matrix defining a weighted, undirected network of stock associations. Thresholding entries of the precision matrix to include, say, only the top 10% results in a network of only the most strongly associated stocks.

Thinking in terms of networks opens up a huge and useful toolbox: graph theory. We gain access to all kinds of nifty ways to analyze and visualize data, including methods for clustering and community detection.

R’s comprehensive igraph package by Gábor Csárdi (https://cran.r-project.org/package=igraph) includes many network cluster detection algorithms. The example below uses Blondel and co-authors’ fast community detection algorithm implemented by igraph’s cluster_louvain() function to segment the thresholded precision matrix of stocks into groups. The code produces an igraph graph object g, with vertices colored by group membership.

suppressMessages(library(igraph))

threshold = 0.90
Q = P * (P > quantile(P, probs=threshold))                           # thresholded precision matrix
g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph

# The rest of the code lumps any singletons lacking edges into a single 'unassociated' group shown in gray
# (also assigning distinct colors to the other groups).
x = groups(cluster_louvain(g))
i = unlist(lapply(x, length))
d = order(i, decreasing=TRUE)
x = x[d]
i = i[d]
j = i > 1
s = sum(j)
names(x)[j] = seq(1, s)
names(x)[! j] = s + 1
grp = as.integer(rep(names(x), i))
clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]
g = set_vertex_attr(g, "color", value=clrs)

Use the latest threejs package to make a nice interactive visualization of the network (you can use your mouse/trackpad to rotate, zoom and pan the visualization).

library(threejs)
graphjs(g, vertex.size=0.2, vertex.shape=colnames(X), edge.alpha=0.5)

The stock groups identified by this method are uncanny, but hardly all that surprising really. Look closely and you will see clusters made up of bank-like companies (AIG, BAC, BK, C, COF, GS, JPM, MET, MS, USB, WFC), pharmaceutical companies (ABT, AMGN, BIIB, BMY, CELG, GILD, JNJ, LLY, MRK, PFE), computer/technology-driven companies (AAPL, ACN, CSCO, IBM, INTC, MSFT, ORCL, QCOM, T, TXN, VZ – except oddly, the inclusion of CAT in this list), and so on. With the threshold value of 0.9 above, a few stocks aren’t connected to any others; they appear in gray.

The groups more or less correspond to what we already know!

The group that includes FB, GOOG, and AMZN (Facebook, Alphabet/Google, and Amazon) is interesting and a bit mysterious. It includes credit card companies V (Visa), MA (Mastercard) and American Express (AXP). Perhaps the returns of FB, GOOG and AMZN are more closely connected to consumer spending than technology! But oddly, this group also includes a few energy companies (DUK, EXC, NEE), and I’m not sure what to make of that…

This way of looking at things also nicely highlights connections between groups. For instance, we see that a group containing consumer products companies (PEP, KO, PG, CL, etc.) is connected to both the Pharma group, and the credit card company group. And see the appendix below for a visualization that explores different precision matrix threshold values, including lower values with far greater network connectivity.

Review

We downloaded daily closing stock prices for 100 stocks from the S&P 500, and, using basic tools of statistics and analysis like correlation and regularization, we grouped the stocks together in a network that highlights associations within and between the groups. The structure teased out of the stock price data is reasonably intuitive.

Appendix: `threejs` tricks

The following self-contained example shows how the network changes with threshold value. It performs the same steps as we did above, but uses some tricks in threejs and an experimental extension to the crosstalk package and a few additional R packages to present an interactive animation. Enjoy!

suppressMessages({
library(quantmod)
library(igraph)
library(threejs)
library(crosstalk)
library(htmltools)
# using an experimental extension to crosstalk:
library(crosstool) # devtools::install_github('bwlewis/crosstool')
})

# Download the processed log(returns) data:
suppressMessages(load(url("http://illposed.net/logreturns.rdata")))

X = cor(log_returns)
L = eigen(X, symmetric=TRUE)
N = 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))
colnames(P) = colnames(X)

# A function that creates a network for a given threshold and precision matrix
f = function(threshold, P)
{
  Q = P * (P > quantile(P, probs=threshold))                           # thresholded precision matrix
  g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph

  x = groups(cluster_louvain(g))
  i = unlist(lapply(x, length))
  d = order(i, decreasing=TRUE)
  x = x[d]
  i = i[d]
  j = i > 1
  s = sum(j)
  names(x)[j] = seq(1, s)
  names(x)[! j] = s + 1
  grp = as.integer(rep(names(x), i))
  clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]
  g = set_vertex_attr(g, "color", value=clrs)
  set_vertex_attr(g, "shape", value=colnames(P))
}

threshold = c(0.99, 0.95, 0.90, 0.85, 0.8)
g = Map(f, threshold, MoreArgs=list(P=P)) # list of graphs, one for each threshold

# Compute force-directed network layouts for each threshold value
# A bit expensive to compute, so run in parallel!
library(parallel)
l = mcMap(function(x) layout_with_fr(x, dim=3, niter=150), g, mc.cores=detectCores())

sdf = SharedData$new(data.frame(key=paste(seq(0, length(threshold) - 1))), key=~key)
slider = crosstool(sdf, "transmitter",
                sprintf("",
                length(threshold) - 1), width="100%", height=20, channel="filter")
vis = graphjs(g, l, vertex.size=0.2, main=as.list(threshold), defer=TRUE, edge.alpha=0.5, deferfps=30,
        crosstalk=sdf, width="100%", height=900)

browsable(div(list(HTML(""), tags$h3("Precision matrix quantile threshold (adjust slider to change)"), slider, vis)))

Precision matrix quantile threshold (adjust slider to change)

This posting includes an audio/video/photo media file: Download Now

Using regression trees for forecasting double-seasonal time series with trend in R

After blogging break caused by writing research papers, I managed to secure time to write something new about time series forecasting. This time I want to share with you my experiences with seasonal-trend time series forecasting using simple regression trees. Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because of these factors:

simple to understand (white box)
from a tree we can extract interpretable results and make simple decisions
they are helpful for exploratory analysis as binary structure of tree is simple to visualize
very good prediction accuracy performance
very fast
they can be simply tuned by ensemble learning techniques

But! There is always some “but”, they poorly adapt when new unexpected situations (values) appears. In other words, they can not detect and adapt to change or concept drift well (absolutely not). This is due to the fact that tree creates during learning just simple rules based on training data. Simple decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. You would ask now, so why we are talking about time series forecasting with regression tree together, right? I will explain how to deal with it in more detail further in this post.

You will learn in this post how to:

decompose double-seasonal time series
detrend time series
model and forecast double-seasonal time series with trend
use two types of simple regression trees
set important hyperparameters related to regression tree

Exploring time series data of electricity consumption

As in previous posts, I will use smart meter data of electricity consumption for demonstrating forecasting of seasonal time series. I created a new dataset of aggregated electricity load of consumers from an anonymous area. Time series data have the length of 17 weeks.

Firstly, let’s scan all of the needed packages for data analysis, modeling and visualizing.

library(feather) # data import
library(data.table) # data handle
library(rpart) # decision tree method
library(rpart.plot) # tree plot
library(party) # decision tree method
library(forecast) # forecasting methods
library(ggplot2) # visualizations
library(ggforce) # visualization tools
library(plotly) # interactive visualizations
library(grid) # visualizations
library(animation) # gif

Now read the mentioned time series data by read_feather to one data.table.

DT <- as.data.table(read_feather("DT_load_17weeks"))

And store information of the date and period of time series that is 48.

n_date <- unique(DT[, date])
period <- 48

For data visualization needs, store my favorite ggplot theme settings by function theme.

theme_ts <- theme(panel.border = element_rect(fill = NA, 
                                              colour = "grey10"),
                  panel.background = element_blank(),
                  panel.grid.minor = element_line(colour = "grey85"),
                  panel.grid.major = element_line(colour = "grey85"),
                  panel.grid.major.x = element_line(colour = "grey85"),
                  axis.text = element_text(size = 13, face = "bold"),
                  axis.title = element_text(size = 15, face = "bold"),
                  plot.title = element_text(size = 16, face = "bold"),
                  strip.text = element_text(size = 16, face = "bold"),
                  strip.background = element_rect(colour = "black"),
                  legend.text = element_text(size = 15),
                  legend.title = element_text(size = 16, face = "bold"),
                  legend.background = element_rect(fill = "white"),
                  legend.key = element_rect(fill = "white"))

Now, pick some dates of the length 3 weeks from dataset to split data on the train and test part. Test set has the length of only one day because we will perform one day ahead forecast of electricity consumption.

data_train <- DT[date %in% n_date[43:63]]
data_test <- DT[date %in% n_date[64]]

Let’s plot the train set and corresponding average weekly values of electricity load.

averages <- data.table(value = rep(sapply(0:2, function(i)
                        mean(data_train[((i*period*7)+1):((i+1)*period*7), value])),
                        each = period * 7),
                       date_time = data_train$date_time)
 
ggplot(data_train, aes(date_time, value)) +
  geom_line() +
  geom_line(data = averages, aes(date_time, value),
            linetype = 5, alpha = 0.75, size = 1.2, color = "firebrick2") +
  labs(x = "Date", y = "Load (kW)") +
  theme_ts

plot of chunk unnamed-chunk-7

We can see some trend increasing over time, maybe air conditioning is more used when gets hotter in summer. The double-seasonal (daily and weekly) character of time series is obvious.

A very useful method for visualization and analysis of time series is STL decomposition.
STL decomposition is based on Loess regression, and it decomposes time series to three parts: seasonal, trend and remainder.
We will use results from the STL decomposition to model our data as well.
I am using stl() from stats package and before computation we must define weekly seasonality to our time series object. Let’s look on results:

data_ts <- ts(data_train$value, freq = period * 7)
decomp_ts <- stl(data_ts, s.window = "periodic", robust = TRUE)$time.series
 
decomp_stl <- data.table(Load = c(data_train$value, as.numeric(decomp_ts)),
                         Date = rep(data_train[,date_time], ncol(decomp_ts)+1),
                         Type = factor(rep(c("original data", colnames(decomp_ts)),
                                       each = nrow(decomp_ts)),
                                       levels = c("original data", colnames(decomp_ts))))
 
ggplot(decomp_stl, aes(x = Date, y = Load)) +
  geom_line() + 
  facet_grid(Type ~ ., scales = "free_y", switch = "y") +
  labs(x = "Date", y = NULL,
       title = "Time Series Decomposition by STL") +
  theme_ts

plot of chunk unnamed-chunk-8

As was expected from the previous picture, we can see that there is “slight” trend increasing and decreasing (by around 100 kW so slightly large ).
Remainder part (noise) is very fluctuate and not seems like classical white noise (we obviously missing additional information like weather and other unexpected situations).

Constructing features to model

In this section I will do feature engineering for modeling double-seasonal time series with trend best as possible by just available historical values.

Classical way to handle seasonality is to add seasonal features to a model as vectors of form \( (1, \dots, DailyPeriod, 1, …, DailyPeriod,…) \) for daily season or \( (1, \dots, 1, 2, \dots, 2, \dots , 7, 1, \dots) \) for weekly season. I used it this way in my previous post about GAM and somehow similar also with multiple linear regression.

A better way to model seasonal variables (features) with nonlinear regression methods like tree is to transform it to Fourier terms (sinus and cosine). It is more effective to tree models and also other nonlinear machine learning methods. I will explain why it is like that further of this post.

Fourier daily signals (terms) are defined as:

where \( ds \) is number of daily seasonality Fourier pairs and Fourier weekly terms are defines as:

where \( ws \) is a number of weekly seasonality Fourier pairs.

Another great feature (most of the times most powerful) is a lag of original time series. We can use lag by one day, one week, etc…
The lag of time series can be preprocessed by removing noise or trend for example by STL decomposition method to ensure stability.

As was earlier mentioned, regression trees can’t predict trend because they logically make rules and predict future values only by rules made by training set.
Therefore original time series that inputs to regression tree as dependent variable must be detrended (removing the trend part of the time series). The acquired trend part then can be forecasted by for example ARIMA model.

Let’s go to constructing mentioned features and trend forecasting.

Double-seasonal Fourier terms can be simply extracted by fourier function from forecast package.
Firstly, we must create multiple seasonal object with function msts.

data_msts <- msts(data_train$value, seasonal.periods = c(period, period*7))

Now use fourier function using two conditions for a number of K terms.
Set K for example just to 2.

K <- 2
fuur <- fourier(data_msts, K = c(K, K))

It made 2 pairs (sine and cosine) of daily and weekly seasonal signals.
If we compare it with approach described in previous posts, so simple periodic vectors, it looks like this:

Daily <- rep(1:period, 21) # simple daily vector
Weekly <- data_train[, week_num] # simple weekly vector
 
data_fuur_simple <- data.table(value = c(scale(Daily), fuur[,2], scale(Weekly), fuur[,6]),
                               date = rep(data_train$date_time, 4),
                               method = rep(c("simple-daily", "four-daily",
                                              "simple-weekly", "four-weekly"),
                                            each = nrow(fuur)),
                               type = rep(c("Daily season", "Weekly season"),
                                          each = nrow(fuur)*2))
 
ggplot(data_fuur_simple, aes(x = date, y = value, color = method)) +
  geom_line(size = 1.2, alpha = 0.7) + 
  facet_grid(type ~ ., scales = "free_y", switch = "y") +
  labs(x = "Date", y = NULL,
       title = "Features Comparison") +
  theme_ts

plot of chunk unnamed-chunk-11

where four-daily is the Fourier term for daily season, simple-daily is the simple feature for daily season, four-weekly is the Fourier term for weekly season, and simple-weekly is the simple feature for weekly season. The advantage of Fourier terms is that there is much more closeness between ending and starting of a day or a week, which is more natural.

Now, let’s use data from STL decomposition to forecast trend part of time series. I will use auto.arima procedure from the forecast package to perform this.

trend_part <- ts(decomp_ts[,2])
trend_fit <- auto.arima(trend_part)
trend_for <- forecast(trend_fit, period)$mean

Let’s plot it:

trend_data <- data.table(Load = c(decomp_ts[,2], trend_for),
                         Date = c(data_train$date_time, data_test$date_time),
                         Type = c(rep("Real", nrow(data_train)), rep("Forecast",
                                                                     nrow(data_test))))
 
ggplot(trend_data, aes(Date, Load, color = Type)) +
  geom_line(size = 1.2) +
  labs(title = paste(trend_fit)) +
  theme_ts

plot of chunk unnamed-chunk-13

Function auto.arima chose ARIMA(0,2,0) model as best for trend forecasting.

Next, make the final feature to the model (lag) and construct train matrix (model matrix).
I am creating lag by one day and just taking seasonal part from STL decomposition (for having smooth lag time series feature).

N <- nrow(data_train)
window <- (N / period) - 1 # number of days in train set minus lag
 
new_load <- rowSums(decomp_ts[, c(1,3)]) # detrended load
lag_seas <- decomp_ts[1:(period*window), 1] # seasonal part of time series as lag feature
 
matrix_train <- data.table(Load = tail(new_load, window*period),
                           fuur[(period + 1):N,],
                           Lag = lag_seas)

The accuracy of forecast (or fitted values of a model) will be measured by MAPE, let’s defined it:

mape <- function(real, pred){
  return(100 * mean(abs((real - pred)/real))) # MAPE - Mean Absolute Percentage Error
}

RPART (CART) tree

In the next two sections, I will describe two regression tree methods. The first is RPART, or CART (Classification and Regression Trees), the second will be CTREE. RPART is recursive partitioning type of binary tree for classification or regression tasks. It performs a search over all possible splits by maximizing an information measure of node impurity, selecting the covariate showing the best split.

I’m using rpart implementation from the same named package. Let’s go forward to modeling and try default settings of rpart function:

tree_1 <- rpart(Load ~ ., data = matrix_train)

It makes many interesting outputs to check, for example we can see a table of nodes and corresponding errors by printcp(tree_1) or see a detailed summary of created nodes by summary(tree_1). We will check variable importance and number of created splits:

tree_1$variable.importance

##       Lag     C2-48    C1-336     S1-48     C1-48    S1-336    C2-336 
## 100504751  45918330  44310331  36245736  32359598  27831258  25385506 
##    S2-336     S2-48 
##  15156041   7595266

paste("Number of splits: ", tree_1$cptable[dim(tree_1$cptable)[1], "nsplit"])

## [1] "Number of splits:  10"

We can see that most important variables are Lag and cosine forms of the daily and weekly season. The number of splits is 10, ehm, is it enough for time series of length 1008 values?

Let’s plot created rules with fancy rpart.plot function from the same named package:

rpart.plot(tree_1, digits = 2, 
           box.palette = viridis::viridis(10, option = "D", begin = 0.85, end = 0), 
           shadow.col = "grey65", col = "grey99")

plot of chunk unnamed-chunk-18

We can see values, rules, and percentage of values split each time. Pretty simple and interpretable.

Now plot fitted values to see results of the tree_1 model.

datas <- data.table(Load = c(matrix_train$Load,
                             predict(tree_1)),
                    Time = rep(1:length(matrix_train$Load), 2),
                    Type = rep(c("Real", "RPART"), each = length(matrix_train$Load)))
 
ggplot(datas, aes(Time, Load, color = Type)) +
  geom_line(size = 0.8, alpha = 0.75) +
  labs(y = "Detrended load", title = "Fitted values from RPART tree") +
  theme_ts

plot of chunk unnamed-chunk-19

And see the error of fitted values against real values.

mape(matrix_train$Load, predict(tree_1))

## [1] 180.6669

Whups. It’s a little bit simple (rectangular) and not really accurate, but it’s logical result from a simple tree model.
The key to achieving better results and have more accurate fit is to set manually control hyperparameters of rpart.
Check ?rpart.control to get more information.
The “hack” is to change cp (complexity) parameter to very low to produce more splits (nodes). The cp is a threshold deciding if each branch fulfills conditions for further processing, so only nodes with fitness larger than factor cp are processed. Other important parameters are the minimum number of observations in needed in a node to split (minsplit) and the maximal depth of a tree (maxdepth).
Set the minsplit to 2 and set the maxdepth to its maximal value – 30.

tree_2 <- rpart(Load ~ ., data = matrix_train,
                control = rpart.control(minsplit = 2,
                                        maxdepth = 30,
                                        cp = 0.000001))

Now make simple plot to see depth of the created tree…

plot(tree_2, compress = TRUE)

plot of chunk unnamed-chunk-22

That’s little bit impressive difference than previous one, isn’t it?
Check also number of splits.

tree_2$cptable[dim(tree_2$cptable)[1], "nsplit"] # Number of splits

## [1] 600

600 is higher than 10

Let’s plot fitted values from the model tree_2:

datas <- data.table(Load = c(matrix_train$Load,
                             predict(tree_2)),
                    Time = rep(1:length(matrix_train$Load), 2),
                    Type = rep(c("Real", "RPART"), each = length(matrix_train$Load)))
 
ggplot(datas, aes(Time, Load, color = Type)) +
  geom_line(size = 0.8, alpha = 0.75) +
  labs(y = "Detrended load", title = "Fitted values from RPART") +
  theme_ts

plot of chunk unnamed-chunk-24

And see the error of fitted values against real values.

mape(matrix_train$Load, predict(tree_2))

## [1] 16.0639

Much better, but obviously the model can be overfitted now.

Add together everything that we got till now, so forecast load one day ahead.
Let’s create testing data matrix:

test_lag <- decomp_ts[((period*window)+1):N, 1]
fuur_test <- fourier(data_msts, K = c(K, K), h = period)
 
matrix_test <- data.table(fuur_test,
                          Lag = test_lag)

Predict detrended time series part with tree_2 model + add the trend part of time series forecasted by ARIMA model.

for_rpart <- predict(tree_2, matrix_test) + trend_for

Let’s plot the results and compare it with real values from data_test.

data_for <- data.table(Load = c(data_train$value, data_test$value, for_rpart),
                       Date = c(data_train$date_time, rep(data_test$date_time, 2)),
                       Type = c(rep("Train data", nrow(data_train)),
                                rep("Test data", nrow(data_test)),
                                rep("Forecast", nrow(data_test))))
 
ggplot(data_for, aes(Date, Load, color = Type)) +
  geom_line(size = 0.8, alpha = 0.75) +
  facet_zoom(x = Date %in% data_test$date_time, zoom.size = 1.2) +
  labs(title = "Forecast from RPART") +
  theme_ts

plot of chunk unnamed-chunk-28

Not bad. For clarity, compare forecasting results with model without separate trend forecasting and detrending.

matrix_train_sim <- data.table(Load = tail(data_train$value, window*period),
                           fuur[(period+1):N,],
                           Lag = lag_seas)
 
tree_sim <- rpart(Load ~ ., data = matrix_train_sim,
                  control = rpart.control(minsplit = 2,
                                          maxdepth = 30,
                                          cp = 0.000001))
 
for_rpart_sim <- predict(tree_sim, matrix_test)
 
data_for <- data.table(Load = c(data_train$value, data_test$value, for_rpart, for_rpart_sim),
                       Date = c(data_train$date_time, rep(data_test$date_time, 3)),
                       Type = c(rep("Train data", nrow(data_train)),
                                rep("Test data", nrow(data_test)),
                                rep("Forecast with trend", nrow(data_test)),
                                rep("Forecast simple", nrow(data_test))))
 
ggplot(data_for, aes(Date, Load, color = Type, linetype = Type)) +
  geom_line(size = 0.8, alpha = 0.7) +
  facet_zoom(x = Date %in% data_test$date_time, zoom.size = 1.2) +
  labs(title = "Forecasts from RPARTs with and without trend forecasting") +
  scale_linetype_manual(values = c(5,6,1,1)) +
  theme_ts

plot of chunk unnamed-chunk-29

We can see that RPART model without trend manipulation has higher values of the forecast.
Evaluate results with MAPE forecasting measure.

mape(data_test$value, for_rpart)

## [1] 3.727473

mape(data_test$value, for_rpart_sim)

## [1] 6.976259

We can see the large difference in MAPE. So detrending original time series and forecasting separately trend part really works, but not generalize the result now. You can read more about RPART method in its great package vignette.

CTREE

The second simple regression tree method that will be used is CTREE. Conditional inference trees (CTREE) is a statistical approach to recursive partitioning, which takes into account the distributional properties of the data. CTREE performs multiple test procedures that are applied to determine whether no significant association between any of the feature and the response (load in the our case) can be stated and the recursion needs to stop.
In R CTREE is implemented in the package party in the function ctree.

Let’s try fit simple ctree with a default values.

ctree_1 <- ctree(Load ~ ., data = matrix_train)

Constructed tree can be again simply plotted by plot function, but it made many splits so it’s disarranged.

Let’s plot fitted values from ctree_1 model.

datas <- data.table(Load = c(matrix_train$Load,
                             predict(ctree_1)),
                    Time = rep(1:length(matrix_train$Load), 2),
                    Type = rep(c("Real", "CTREE"), each = length(matrix_train$Load)))
 
ggplot(datas, aes(Time, Load, color = Type)) +
  geom_line(size = 0.8, alpha = 0.75) +
  labs(y = "Detrended load", title = "Fitted values from CTREE") +
  theme_ts

plot of chunk unnamed-chunk-32

And see the error of fitted values against real values.

mape(matrix_train$Load, predict(ctree_1))

## [1] 87.85983

Actually, this is pretty nice, but again, it can be tuned.

For available hyperparameters tuning check ?ctree_control. I changed hyperparameters minsplit and minbucket that have similar meaning like the cp parameter in RPART. The mincriterion can be tuned also, and it is significance level (1 – p-value) that must be exceeded in order to implement a split. Let’s plot results.

ctree_2 <- ctree(Load ~ ., data = matrix_train,
                        controls = party::ctree_control(teststat = "quad", 
                                                        testtype = "Teststatistic", 
                                                        mincriterion = 0.925,
                                                        minsplit = 1,
                                                        minbucket = 1))
 
datas <- data.table(Load = c(matrix_train$Load,
                             predict(ctree_2)),
                    Time = rep(1:length(matrix_train$Load), 2),
                    Type = rep(c("Real", "CTREE"), each = length(matrix_train$Load)))
 
ggplot(datas, aes(Time, Load, color = Type)) +
  geom_line(size = 0.8, alpha = 0.75) +
  labs(y = "Detrended load", title = "Fitted values from CTREE") +
  theme_ts

plot of chunk unnamed-chunk-34

And see the error of fitted values against real values.

mape(matrix_train$Load, predict(ctree_2))

## [1] 39.70532

It’s better. Now forecast values with ctree_2 model.

for_ctree <- predict(ctree_2, matrix_test) + trend_for

And compare CTREE with RPART model.

data_for <- data.table(Load = c(data_train$value, data_test$value, for_rpart, for_ctree),
                       Date = c(data_train$date_time, rep(data_test$date_time, 3)),
                       Type = c(rep("Train data", nrow(data_train)),
                                rep("Test data", nrow(data_test)),
                                rep("RPART", nrow(data_test)),
                                rep("CTREE", nrow(data_test))))
 
ggplot(data_for, aes(Date, Load, color = Type, linetype = Type)) +
  geom_line(size = 0.8, alpha = 0.7) +
  facet_zoom(x = Date %in% data_test$date_time, zoom.size = 1.2) +
  labs(title = "Forecasts from RPART and CTREE models") +
  scale_linetype_manual(values = c(5,6,1,1)) +
  theme_ts

plot of chunk unnamed-chunk-37

mape(data_test$value, for_rpart)

## [1] 3.727473

mape(data_test$value, for_ctree)

## [1] 4.020834

Slightly better MAPE value with RPART, but again now it can not be anything to generalize. You can read more about CTREE method in its great package vignette.
Try to forecast future values with all available electricity load data with sliding window approach (window of the length of three weeks) for a period of more than three months (98 days).

Comparison

Define functions that produce forecasts, so add up everything that we learned so far.

RpartTrend <- function(data, set_of_date, K, period = 48){
  
  data_train <- data[date %in% set_of_date]
  
  N <- nrow(data_train)
  window <- (N / period) - 1
  
  data_ts <- msts(data_train$value, seasonal.periods = c(period, period*7))
  
  fuur <- fourier(data_ts, K = c(K, K))
  fuur_test <- as.data.frame(fourier(data_ts, K = c(K, K), h = period))
  
  data_ts <- ts(data_train$value, freq = period*7)
  decomp_ts <- stl(data_ts, s.window = "periodic", robust = TRUE)
  new_load <- rowSums(decomp_ts$time.series[, c(1,3)])
  trend_part <- ts(decomp_ts$time.series[,2])
  
  trend_fit <- auto.arima(trend_part)
  trend_for <- as.vector(forecast(trend_fit, period)$mean)
  
  lag_seas <- decomp_ts$time.series[1:(period*window), 1]
  
  matrix_train <- data.table(Load = tail(new_load, window*period),
                             fuur[(period+1):N,],
                             Lag = lag_seas)
  
  tree_1 <- rpart(Load ~ ., data = matrix_train,
                  control = rpart.control(minsplit = 2,
                                          maxdepth = 30,
                                          cp = 0.000001))
  
  test_lag <- decomp_ts$time.series[((period*(window))+1):N, 1]
  
  matrix_test <- data.table(fuur_test,
                            Lag = test_lag)
  
  # prediction
  pred_tree <- predict(tree_1, matrix_test) + trend_for
  
  return(as.vector(pred_tree))
}
 
CtreeTrend <- function(data, set_of_date, K, period = 48){
  
  # subsetting the dataset by dates
  data_train <- data[date %in% set_of_date]
 
  N <- nrow(data_train)
  window <- (N / period) - 1
  
  data_ts <- msts(data_train$value, seasonal.periods = c(period, period*7))
  
  fuur <- fourier(data_ts, K = c(K, K))
  fuur_test <- as.data.frame(fourier(data_ts, K = c(K, K), h = period))
  
  data_ts <- ts(data_train$value, freq = period*7)
  decomp_ts <- stl(data_ts, s.window = "periodic", robust = TRUE)
  new_load <- rowSums(decomp_ts$time.series[, c(1,3)])
  trend_part <- ts(decomp_ts$time.series[,2])
  
  trend_fit <- auto.arima(trend_part)
  trend_for <- as.vector(forecast(trend_fit, period)$mean)
  
  lag_seas <- decomp_ts$time.series[1:(period*window), 1]
  
  matrix_train <- data.table(Load = tail(new_load, window*period),
                             fuur[(period+1):N,],
                             Lag = lag_seas)
  
  tree_2 <- party::ctree(Load ~ ., data = matrix_train,
                         controls = party::ctree_control(teststat = "quad",
                                                         testtype = "Teststatistic",
                                                         mincriterion = 0.925,
                                                         minsplit = 1,
                                                         minbucket = 1))
  
  test_lag <- decomp_ts$time.series[((period*(window))+1):N, 1]
  
  matrix_test <- data.table(fuur_test,
                            Lag = test_lag)
  
  pred_tree <- predict(tree_2, matrix_test) + trend_for
  
  return(as.vector(pred_tree))
}

I created plotly boxplots graph of MAPE values from four models – CTREE simple, CTREE with detrending, RPART simple and RPART with detrending. Whole evaluation can be seen in the script that is stored in my GitHub repository.

We can see that detrending time series of electricity consumption improves the accuracy of the forecast with the combination of both regression tree methods – RPART and CTREE. My approach works as expected.

The habit of my posts is that animation must appear. So, I prepared for you two animations (animated dashboards) using animation, grid, ggplot and ggforce (for zooming) packages that visualize results of forecasting.

We can see that in many days it is almost perfect forecast, but on some days it has some potential for improving.

Conclusion

In this post, I showed you how to solve trend appearance in seasonal time series with using a regression tree model. Detrending time series for regression tree methods is a important (must) procedure due to the character of decision trees. The trend part of a time series was acquired by STL decomposition and separately forecasted by a simple ARIMA model. I evaluated this approach on the dataset from smart meters measurements of electricity consumption. The regression (decision) tree is a great technique for getting simple and interpretable results in very fast computational time.

In the future post, I will focus on enhancing the predictive performance of simple regression tree methods by ensemble learning methods like Bagging, Random Forest, and similar.

This posting includes an audio/video/photo media file: Download Now

Free simmer hexagon stickers!

design1
design2

Do you want to get your own simmer hexagon sticker? Just fill in this form and get one send to you for free.

Check out r-simmer.org or CRAN for more information on simmer, a discrete-event simulation package for R.

Some Neat New R Notations (and 6 more aRticles)

What is a choice simulator?

How to create a choice simulator

Step 1: Import the model(s) results

Step 2: Simplify calculations using variable sets

Step 3: Create the controls

Step 4: Calculate preference shares

R Code to paste:

Step 5: Make it pretty

Step 6: Add filters

Step 7: Share

Worked example of a choice simulator

New things you’ll notice right away include:

Major internal changes

The BLS: More than Unemployment

Unemployment Rates

Software used in the example

Getting data

Download daily closing price data from Google Finance

Clean up and transform data

Sample correlation matrix

Regularization

But how to choose a cut-off rank?

A precision matrix, finally

Other approaches

Put a package on it

Networks and clustering

Review

Appendix: threejs tricks

Precision matrix quantile threshold (adjust slider to change)

Exploring time series data of electricity consumption

Constructing features to model

RPART (CART) tree

CTREE

Comparison

Conclusion

Appendix: `threejs` tricks