One of the most prominent classical statistical techniques is the Analysis of Variance (ANOVA). ANOVA is an especially important tool in experimental analysis, where it is used as an omnibus test of a null hypothesis that mean outcomes across all groups are equal (or, stated differently, that the outcome variance between groups is no larger than the outcome variance within groups). This tutorial walks through the basics of using ANOVA in R. We'll start with some fake data from an imaginary three-group experiment:
set.seed(100)
tr <- rep(1:4, each = 30)
y <- numeric(length = 120)
y[tr == 1] <- rnorm(30, 5, 1)
y[tr == 2] <- rnorm(30, 4, 2)
y[tr == 3] <- rnorm(30, 4, 5)
y[tr == 4] <- rnorm(30, 1, 2)
The principle use of ANOVA is to partition the sum of squares from the data and test whether the variance across groups is larger than the variance within groups. The function to do this in R is aov
. (Note: This should not be confused with the anova
function, which is a model-comparison tool for regression models.)
ANOVA models can be expressed as formulae (like in regression, since the techniques are analogous):
aov(y ~ tr)
## Call:
## aov(formula = y ~ tr)
##
## Terms:
## tr Residuals
## Sum of Squares 251.2 1196.4
## Deg. of Freedom 1 118
##
## Residual standard error: 3.184
## Estimated effects may be unbalanced
The default output of the aov
function is surprisingly uninformative and we should instead use summary
to see a more meaningful output:
summary(aov(y ~ factor(tr)))
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(tr) 3 297 99.0 9.98 6.6e-06 ***
## Residuals 116 1151 9.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output is precisely what we would expect. It shows the “within” and “between” sum of squares, the F-statistic, and the p-value associated with that statistic. If significant (which it is in this case), we also see some stars to the right-hand side.
Another way to see basically the same output is with the oneway.test
function. It conducts a one-way ANOVA, whereas aov
is flexible to alternative experimental designs:
oneway.test(y ~ tr)
##
## One-way analysis of means (not assuming equal variances)
##
## data: y and tr
## F = 39.38, num df = 3.00, denom df = 54.25, p-value = 1.191e-13
The oneway.test
function allows us to control whether equal variances are assumed across groups with the var.equal
argument:
oneway.test(y ~ factor(tr), var.equal = TRUE)
##
## One-way analysis of means
##
## data: y and factor(tr)
## F = 9.983, num df = 3, denom df = 116, p-value = 6.634e-06
I always feel like the F-statistic is a bit of a let down. It's a lot of calculation to be reduced to a single number (the F-statistic), which really doesn't tell you much. Instead, we need to actually summary the data - with a table or figure - in order to actually see what that F-statistic means in practice.
As a non-parametric alternative to the ANOVA, which invokes a normality assumption about the residuals, one can use the Kruskal-Wallis analysis of variance test. This does not assume normality of residuals, but does assume that the treatment group outcome distributions have identical shape (other than a shift in median). To implement the Kruskal-Wallis ANOVA, we simply use kruskal.test
:
kruskal.test(y ~ tr)
##
## Kruskal-Wallis rank sum test
##
## data: y by tr
## Kruskal-Wallis chi-squared = 36.96, df = 3, p-value = 4.702e-08
The output of this test is somewhat simpler than that from aov
, presenting us with the test statistic and associated p-value immediately.
For more details on assumptions about distributions, look at the tutorial on variance tests.
Post-hoc comparisons are possible in R. The TukeyHSD
function is available in the base stats package, but the multicomp add-on package offers much more. Other options include the psych package and the car package. In all, it's too much to cover in detail here. We'll look at the TukeyHSD
function, which estimate's Tukey's Honestly Significant Difference statistics (for all pairwise group comparisons in an aov
object):
TukeyHSD(aov(y ~ factor(tr)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = y ~ factor(tr))
##
## $`factor(tr)`
## diff lwr upr p adj
## 2-1 -0.8437 -2.963 1.2759 0.7278
## 3-1 -1.2482 -3.368 0.8714 0.4200
## 4-1 -4.1787 -6.298 -2.0590 0.0000
## 3-2 -0.4045 -2.524 1.7151 0.9595
## 4-2 -3.3350 -5.455 -1.2153 0.0004
## 4-3 -2.9304 -5.050 -0.8108 0.0026
One can always fall back on the trusty t-test (implemented with t.test
) to compare treatment groups pairwise:
t.test(y[tr %in% 1:2] ~ tr[tr %in% 1:2])
##
## Welch Two Sample t-test
##
## data: y[tr %in% 1:2] by tr[tr %in% 1:2]
## t = 1.896, df = 34.2, p-value = 0.06646
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06051 1.74794
## sample estimates:
## mean in group 1 mean in group 2
## 5.029 4.185
t.test(y[tr %in% c(2, 4)] ~ tr[tr %in% c(2, 4)])
##
## Welch Two Sample t-test
##
## data: y[tr %in% c(2, 4)] by tr[tr %in% c(2, 4)]
## t = 5.98, df = 56.41, p-value = 1.6e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.218 4.452
## sample estimates:
## mean in group 2 mean in group 4
## 4.1851 0.8502
But the user should, of course, we aware of problems with multiple comparisons.
The easiest way to summarize the information underlying an ANOVA procedure is to look at the treatment group means and variances (or standard deviations). Luckily R makes it very easy to calculate this statistic on each group using the by
function. If we want the mean of y
for each level of tr
, we simply call:
by(y, tr, FUN = mean)
## tr: 1
## [1] 5.029
## --------------------------------------------------------
## tr: 2
## [1] 4.185
## --------------------------------------------------------
## tr: 3
## [1] 3.781
## --------------------------------------------------------
## tr: 4
## [1] 0.8502
The result is an output that shows the treatment level and the associated mean. We can also obtain the same information in a slightly different format using tapply
:
tapply(y, tr, FUN = mean)
## 1 2 3 4
## 5.0289 4.1851 3.7806 0.8502
This returns an object of class “table”, which is perhaps easier to work with. We can do the same for the treatment group standard deviations:
tapply(y, tr, FUN = sd)
## 1 2 3 4
## 0.702 2.334 5.464 1.970
And we could even mind them together:
out <- cbind(tapply(y, tr, FUN = mean), tapply(y, tr, FUN = sd))
colnames(out) <- c("mean", "sd")
out
## mean sd
## 1 5.0289 0.702
## 2 4.1851 2.334
## 3 3.7806 5.464
## 4 0.8502 1.970
The result is a nice matrix showing the mean and standard deviation for each group. If there was some other statistic we wanted to calculate for each group, we could easily use by
or tapply
to obtain it.
A perhaps more convenient way to see our data is to plot it. We can use plot
to produce a simply scatterplot. And we can use our out
matrix to highlight the treatment group means:
plot(y ~ tr, col = rgb(1, 0, 0, 0.5), pch = 16)
# highlight the means:
points(1:4, out[, 1], col = "blue", bg = "blue", pch = 23, cex = 2)
This nice because it shows the distribution of the data, but we can also use a boxplot summary to precisely see the locations of points on the distribution. Specifically, a boxplot will draw the five-number summary for each treatment group:
tapply(y, tr, fivenum)
## $`1`
## [1] 3.842 4.562 5.093 5.319 7.310
##
## $`2`
## [1] -0.5439 2.9554 3.9527 6.1308 7.7949
##
## $`3`
## [1] -6.3720 0.4349 3.7029 7.1900 16.9098
##
## $`4`
## [1] -1.9160 -0.5528 0.3361 1.8270 5.8914
boxplot(y ~ tr)
Another approach is to use our out
object, containing treatment group means and standard deviations to draw a dotchart. We'll first divide our standard deviations by sqrt(30)
to convert them to standard errors of the mean.
out[, 2] <- out[, 2]/sqrt(30)
dotchart(out[, 1], xlim = c(0, 6), xlab = "y", main = "Treatment group means",
pch = 23, bg = "black")
segments(out[, 1] - out[, 2], 1:4, out[, 1] + out[, 2], 1:4, lwd = 2)
segments(out[, 1] - 2 * out[, 2], 1:4, out[, 1] + 2 * out[, 2], 1:4, lwd = 1)
This plot nicely shows the means and both 1- and 2-standard errors of the mean.
R enables you to do basic math, using all the usual operators: Addition
2 + 2
## [1] 4
1 + 2 + 3 + 4 + 5
## [1] 15
Subtraction
10 - 1
## [1] 9
5 - 6
## [1] -1
Multiplication
2 * 2
## [1] 4
1 * 2 * 3
## [1] 6
Division
4/2
## [1] 2
10/2/4
## [1] 1.25
Parentheses can be used to adjust order of operations:
10/2 + 2
## [1] 7
10/(2 + 2)
## [1] 2.5
(10/2) + 2
## [1] 7
Check your intuition by checking order of operations on your own functions.
Exponents and square roots involve intuitive syntax:
2^2
## [1] 4
3^4
## [1] 81
1^0
## [1] 1
sqrt(4)
## [1] 2
So do logarithms:
log(0)
## [1] -Inf
log(1)
## [1] 0
and logarithms to other bases, including arbitrary ones:
log10(1)
## [1] 0
log10(10)
## [1] 1
log2(1)
## [1] 0
log2(2)
## [1] 1
logb(1, base = 5)
## [1] 0
logb(5, base = 5)
## [1] 1
natural number exponents use a similar syntax:
exp(0)
## [1] 1
exp(1)
## [1] 2.718
There are also tons of other mathematical operations, like: Absolute value
abs(-10)
## [1] 10
Factorials:
factorial(10)
## [1] 3628800
and Choose:
choose(4, 1)
## [1] 4
choose(6, 3)
## [1] 20
R is obviously a statistical programming language and environment, so we can use it to do statistics. With any vector, we can calculate a number of statistics, including:
set.seed(1)
a <- rnorm(100)
mininum
min(a)
## [1] -2.215
maximum
max(a)
## [1] 2.402
We can get the minimum and maximum together with range
:
range(a)
## [1] -2.215 2.402
We can also obtain the minimum by sorting the vector (using sort
):
sort(a)[1]
## [1] -2.215
And we can obtain the maximum by sorting in the opposite order:
sort(a, decreasing = TRUE)[1]
## [1] 2.402
To calculate the central tendency, we have several options. mean
mean(a)
## [1] 0.1089
This is of course equivalent to:
sum(a)/length(a)
## [1] 0.1089
median
median(a)
## [1] 0.1139
In a vector with an even number of elements, this is equivalent to:
(sort(a)[length(a)/2] + sort(a)[length(a)/2 + 1])/2
## [1] 0.1139
In a vector with an odd number of elements, this is equivalent to:
a2 <- a[-1] #' drop first observation of `a`
sort(a2)[length(a2)/2 + 1]
## [1] 0.1533
We can also obtain measures of dispersion: Variance
var(a)
## [1] 0.8068
This is equivalent to:
sum((a - mean(a))^2)/(length(a) - 1)
## [1] 0.8068
Standard deviation
sd(a)
## [1] 0.8982
Which is equivalent to:
sqrt(var(a))
## [1] 0.8982
Or:
sqrt(sum((a - mean(a))^2)/(length(a) - 1))
## [1] 0.8982
There are also some convenience functions that provide multiple statistics.
The fivenum
function provides the five-number summary (minimum, Q1, median, Q3, and maximum):
fivenum(a)
## [1] -2.2147 -0.5103 0.1139 0.6934 2.4016
It is also possible to obtain arbitrary percentiles/quantiles from a vector:
quantile(a, 0.1) #' 10% quantile
## 10%
## -1.053
You can also specify a vector of quantiles:
quantile(a, c(0.025, 0.975))
## 2.5% 97.5%
## -1.671 1.797
quantile(a, seq(0, 1, by = 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80%
## -2.2147 -1.0527 -0.6139 -0.3753 -0.0767 0.1139 0.3771 0.5812 0.7713
## 90% 100%
## 1.1811 2.4016
The summary
function, applied to a numeric vector, provides those values and the mean:
summary(a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.210 -0.494 0.114 0.109 0.692 2.400
Note: The summary
function returns different results if the vector is a logical, character, or factor.
For a logical vector, summary
returns some tabulations:
summary(as.logical(rbinom(100, 1, 0.5)))
## Mode FALSE TRUE NA's
## logical 62 38 0
For a character vector, summary
returns just some basic information about the vector:
summary(sample(c("a", "b", "c"), 100, TRUE))
## Length Class Mode
## 100 character character
For a factor, summary
returns a table of all values in the vector:
summary(factor(a))
## -2.2146998871775 -1.98935169586337 -1.80495862889104
## 1 1 1
## -1.52356680042976 -1.47075238389927 -1.37705955682861
## 1 1 1
## -1.27659220845804 -1.2536334002391 -1.22461261489836
## 1 1 1
## -1.12936309608079 -1.04413462631653 -0.934097631644252
## 1 1 1
## -0.835628612410047 -0.820468384118015 -0.743273208882405
## 1 1 1
## -0.709946430921815 -0.70749515696212 -0.68875569454952
## 1 1 1
## -0.626453810742332 -0.621240580541804 -0.612026393250771
## 1 1 1
## -0.589520946188072 -0.573265414236886 -0.568668732818502
## 1 1 1
## -0.54252003099165 -0.47815005510862 -0.473400636439312
## 1 1 1
## -0.443291873218433 -0.41499456329968 -0.394289953710349
## 1 1 1
## -0.367221476466509 -0.305388387156356 -0.304183923634301
## 1 1 1
## -0.253361680136508 -0.164523596253587 -0.155795506705329
## 1 1 1
## -0.135178615123832 -0.135054603880824 -0.112346212150228
## 1 1 1
## -0.102787727342996 -0.0593133967111857 -0.0561287395290008
## 1 1 1
## -0.0538050405829051 -0.0449336090152309 -0.0392400027331692
## 1 1 1
## -0.0161902630989461 0.00110535163162413 0.0280021587806661
## 1 1 1
## 0.0743413241516641 0.0745649833651906 0.153253338211898
## 1 1 1
## 0.183643324222082 0.188792299514343 0.267098790772231
## 1 1 1
## 0.291446235517463 0.329507771815361 0.332950371213518
## 1 1 1
## 0.341119691424425 0.36458196213683 0.370018809916288
## 1 1 1
## 0.387671611559369 0.389843236411431 0.398105880367068
## 1 1 1
## 0.417941560199702 0.475509528899663 0.487429052428485
## 1 1 1
## 0.556663198673657 0.558486425565304 0.569719627442413
## 1 1 1
## 0.575781351653492 0.593901321217509 0.593946187628422
## 1 1 1
## 0.610726353489055 0.61982574789471 0.689739362450777
## 1 1 1
## 0.696963375404737 0.700213649514998 0.738324705129217
## 1 1 1
## 0.763175748457544 0.768532924515416 0.782136300731067
## 1 1 1
## 0.821221195098089 0.881107726454215 0.918977371608218
## 1 1 1
## 0.943836210685299 1.06309983727636 1.10002537198388
## 1 1 1
## 1.12493091814311 1.16040261569495 1.1780869965732
## 1 1 1
## 1.20786780598317 1.35867955152904 1.43302370170104
## 1 1 1
## 1.46555486156289 1.51178116845085 1.58683345454085
## 1 1 1
## 1.59528080213779 1.98039989850586 2.17261167036215
## 1 1 1
## 2.40161776050478
## 1
A summary
of a dataframe will return the summary information separate for each column vector.
This may look produce different result for each column, depending on the class of the column:
summary(data.frame(a = 1:10, b = 11:20))
## a b
## Min. : 1.00 Min. :11.0
## 1st Qu.: 3.25 1st Qu.:13.2
## Median : 5.50 Median :15.5
## Mean : 5.50 Mean :15.5
## 3rd Qu.: 7.75 3rd Qu.:17.8
## Max. :10.00 Max. :20.0
summary(data.frame(a = 1:10, b = factor(11:20)))
## a b
## Min. : 1.00 11 :1
## 1st Qu.: 3.25 12 :1
## Median : 5.50 13 :1
## Mean : 5.50 14 :1
## 3rd Qu.: 7.75 15 :1
## Max. :10.00 16 :1
## (Other):4
A summary
of a list will return not very useful information:
summary(list(a = 1:10, b = 1:10))
## Length Class Mode
## a 10 -none- numeric
## b 10 -none- numeric
A summary
of a matrix returns a summary of each column separately (like a dataframe):
summary(matrix(1:20, nrow = 4))
## V1 V2 V3 V4
## Min. :1.00 Min. :5.00 Min. : 9.00 Min. :13.0
## 1st Qu.:1.75 1st Qu.:5.75 1st Qu.: 9.75 1st Qu.:13.8
## Median :2.50 Median :6.50 Median :10.50 Median :14.5
## Mean :2.50 Mean :6.50 Mean :10.50 Mean :14.5
## 3rd Qu.:3.25 3rd Qu.:7.25 3rd Qu.:11.25 3rd Qu.:15.2
## Max. :4.00 Max. :8.00 Max. :12.00 Max. :16.0
## V5
## Min. :17.0
## 1st Qu.:17.8
## Median :18.5
## Mean :18.5
## 3rd Qu.:19.2
## Max. :20.0
This tutorial draws aims at making various binary outcome GLM models interpretable through the use of plots. As such, it begins by setting up some data (involving a few covariates) and then generating various versions of an outcome based upon data-generating proceses with and without interaction. The aim of the tutorial is to both highlight the use of predicted probability plots for demonstrating effects and demonstrate the challenge - even then - of clearly communicating the results of these types of models.
Let's begin by generating our covariates:
set.seed(1)
n <- 200
x1 <- rbinom(n, 1, 0.5)
x2 <- runif(n, 0, 1)
x3 <- runif(n, 0, 5)
Now, we'll build several models. Each model has an outcome that is a transformed linear function the covariates (i.e., we calculate a y
variable that is a linear function of the covariates, then rescale that outcome [0,1], and use the rescaled version as a probability model in generating draws from a binomial distribution).
# Simple multivariate model (no interaction):
y1 <- 2 * x1 + 5 * x2 + rnorm(n, 0, 3)
y1s <- rbinom(n, 1, (y1 - min(y1))/(max(y1) - min(y1))) # the math here is just to rescale to [0,1]
# Simple multivariate model (with interaction):
y2 <- 2 * x1 + 5 * x2 + 2 * x1 * x2 + rnorm(n, 0, 3)
y2s <- rbinom(n, 1, (y2 - min(y2))/(max(y2) - min(y2)))
# Simple multivariate model (with interaction and an extra term):
y3 <- 2 * x1 + 5 * x2 + 2 * x1 * x2 + x3 + rnorm(n, 0, 3)
y3s <- rbinom(n, 1, (y2 - min(y2))/(max(y2) - min(y2)))
We thus have three outcomes (y1s
, y2s
, and y3s
) that are binary outcomes, but each is constructed as a slightly different function of our three covariates.
We can then build models of each outcome. We'll build two versions of y2s
and y3s
(one version a
that does not model the interaction and another version b
that does model it):
m1 <- glm(y1s ~ x1 + x2, family = binomial(link = "probit"))
m2a <- glm(y2s ~ x1 + x2, family = binomial(link = "probit"))
m2b <- glm(y2s ~ x1 * x2, family = binomial(link = "probit"))
m3a <- glm(y1s ~ x1 + x2 + x3, family = binomial(link = "probit"))
m3b <- glm(y1s ~ x1 * x2 + x3, family = binomial(link = "probit"))
We can look at the outcome of one of our models, e.g. m3b
(for y3s
modelled with an interaction), but we know that the coefficients are not directly interpretable:
summary(m3b)
##
## Call:
## glm(formula = y1s ~ x1 * x2 + x3, family = binomial(link = "probit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.554 -1.141 0.873 1.011 1.365
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1591 0.2769 -0.57 0.57
## x1 0.4963 0.3496 1.42 0.16
## x2 0.3212 0.4468 0.72 0.47
## x3 -0.0315 0.0625 -0.50 0.61
## x1:x2 -0.0794 0.6429 -0.12 0.90
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 274.83 on 199 degrees of freedom
## Residual deviance: 266.59 on 195 degrees of freedom
## AIC: 276.6
##
## Number of Fisher Scoring iterations: 4
Instead we need to look at fitted values (specifically, the predicted probability of observing y==1
in each model. We can see these fitted values for our actual data using the predict
function:
p3b.fitted <- predict(m3b, type = "response", se.fit = TRUE)
p3b.fitted
## $fit
## 1 2 3 4 5 6 7 8 9 10
## 0.4298 0.4530 0.6225 0.6029 0.4015 0.6364 0.6609 0.5970 0.6545 0.4695
## 11 12 13 14 15 16 17 18 19 20
## 0.4973 0.4273 0.6569 0.5082 0.6642 0.4463 0.6614 0.6982 0.5151 0.6141
## 21 22 23 24 25 26 27 28 29 30
## 0.6302 0.4258 0.6220 0.4929 0.5332 0.4765 0.4642 0.3856 0.6203 0.4908
## 31 32 33 34 35 36 37 38 39 40
## 0.4228 0.6397 0.4608 0.4837 0.6609 0.6472 0.6508 0.5043 0.6425 0.4370
## 41 42 43 44 45 46 47 48 49 50
## 0.6572 0.6667 0.6839 0.6086 0.6539 0.6261 0.4685 0.3956 0.6660 0.7103
## 51 52 53 54 55 56 57 58 59 60
## 0.4737 0.6890 0.4736 0.5032 0.4953 0.4100 0.4323 0.6188 0.6302 0.5521
## 61 62 63 64 65 66 67 68 69 70
## 0.6694 0.4475 0.4720 0.5097 0.6794 0.3998 0.4009 0.6893 0.4861 0.6063
## 71 72 73 74 75 76 77 78 79 80
## 0.4162 0.5845 0.4409 0.4336 0.4674 0.6353 0.6039 0.4327 0.6459 0.6608
## 81 82 83 84 85 86 87 88 89 90
## 0.3937 0.6701 0.4960 0.4253 0.6011 0.4232 0.6499 0.4563 0.3995 0.4565
## 91 92 93 94 95 96 97 98 99 100
## 0.4641 0.3956 0.7014 0.6519 0.6543 0.6663 0.4225 0.4199 0.5856 0.7099
## 101 102 103 104 105 106 107 108 109 110
## 0.6602 0.4064 0.4584 0.6347 0.6382 0.5027 0.4342 0.4874 0.5928 0.6369
## 111 112 113 114 115 116 117 118 119 120
## 0.5924 0.6219 0.5437 0.4831 0.4262 0.4882 0.6009 0.4480 0.5396 0.6509
## 121 122 123 124 125 126 127 128 129 130
## 0.6434 0.4324 0.4667 0.5559 0.6854 0.5004 0.6580 0.4891 0.4115 0.6425
## 131 132 133 134 135 136 137 138 139 140
## 0.6845 0.4723 0.4666 0.6879 0.6321 0.6621 0.6502 0.6263 0.6739 0.6822
## 141 142 143 144 145 146 147 148 149 150
## 0.6963 0.5908 0.4650 0.4652 0.6807 0.4638 0.4344 0.6252 0.4487 0.6544
## 151 152 153 154 155 156 157 158 159 160
## 0.6247 0.6036 0.5363 0.4336 0.6748 0.4489 0.7010 0.4441 0.4986 0.4451
## 161 162 163 164 165 166 167 168 169 170
## 0.4102 0.6543 0.5224 0.6137 0.6317 0.5000 0.4308 0.4861 0.6780 0.4361
## 171 172 173 174 175 176 177 178 179 180
## 0.6620 0.6581 0.6992 0.4602 0.4137 0.6391 0.6228 0.6851 0.5977 0.6332
## 181 182 183 184 185 186 187 188 189 190
## 0.4698 0.4494 0.6532 0.7151 0.6177 0.4415 0.6977 0.6755 0.5844 0.6767
## 191 192 193 194 195 196 197 198 199 200
## 0.6125 0.4399 0.5062 0.6395 0.4052 0.6253 0.4909 0.6311 0.4578 0.6187
##
## $se.fit
## 1 2 3 4 5 6 7 8 9
## 0.06090 0.07493 0.07213 0.07595 0.08505 0.05446 0.05024 0.08497 0.08736
## 10 11 12 13 14 15 16 17 18
## 0.08458 0.11678 0.07902 0.07268 0.10658 0.07666 0.05544 0.05338 0.08647
## 19 20 21 22 23 24 25 26 27
## 0.10526 0.06544 0.06385 0.06895 0.05878 0.07211 0.10434 0.05489 0.07976
## 28 29 30 31 32 33 34 35 36
## 0.09837 0.06283 0.09679 0.07274 0.09728 0.05479 0.06119 0.06930 0.06898
## 37 38 39 40 41 42 43 44 45
## 0.06506 0.08250 0.04903 0.06905 0.08040 0.05357 0.08155 0.07871 0.05722
## 46 47 48 49 50 51 52 53 54
## 0.07044 0.05457 0.08967 0.06453 0.09121 0.09065 0.08373 0.05496 0.07556
## 55 56 57 58 59 60 61 62 63
## 0.07928 0.07808 0.06067 0.07489 0.05318 0.12015 0.06053 0.05486 0.08607
## 64 65 66 67 68 69 70 71 72
## 0.08180 0.06173 0.08619 0.08555 0.07004 0.06089 0.07659 0.08207 0.09648
## 73 74 75 76 77 78 79 80 81
## 0.05605 0.06985 0.07491 0.07977 0.07469 0.06774 0.04755 0.07062 0.09192
## 82 83 84 85 86 87 88 89 90
## 0.06098 0.09897 0.07266 0.09254 0.07215 0.06856 0.09395 0.08548 0.07833
## 91 92 93 94 95 96 97 98 99
## 0.08700 0.08957 0.09002 0.06974 0.04870 0.05586 0.07756 0.07460 0.09843
## 100 101 102 103 104 105 106 107 108
## 0.09045 0.05616 0.08076 0.05338 0.05133 0.05233 0.11964 0.06895 0.09019
## 109 110 111 112 113 114 115 116 117
## 0.09251 0.05045 0.08785 0.07194 0.11253 0.07388 0.06328 0.06781 0.08778
## 118 119 120 121 122 123 124 125 126
## 0.05248 0.11138 0.05489 0.06888 0.05946 0.05557 0.12468 0.07418 0.11247
## 127 128 129 130 131 132 133 134 135
## 0.08133 0.08255 0.07904 0.09024 0.09377 0.08286 0.05628 0.07110 0.10181
## 136 137 138 139 140 141 142 143 144
## 0.07580 0.05096 0.08026 0.06026 0.09215 0.09064 0.08885 0.05537 0.05696
## 145 146 147 148 149 150 151 152 153
## 0.06283 0.07662 0.07562 0.07680 0.05194 0.06802 0.06453 0.09005 0.10586
## 154 155 156 157 158 159 160 161 162
## 0.06816 0.09976 0.05952 0.08116 0.09688 0.09872 0.06212 0.07796 0.07138
## 163 164 165 166 167 168 169 170 171
## 0.09198 0.06561 0.07026 0.08818 0.08680 0.06974 0.06071 0.07222 0.06292
## 172 173 174 175 176 177 178 179 180
## 0.04906 0.08438 0.08074 0.07343 0.04938 0.05798 0.07527 0.08185 0.05499
## 181 182 183 184 185 186 187 188 189
## 0.05549 0.07398 0.05759 0.09607 0.06677 0.06207 0.07768 0.07232 0.09640
## 190 191 192 193 194 195 196 197 198
## 0.09458 0.07770 0.07854 0.10283 0.05069 0.08024 0.05617 0.06491 0.05762
## 199 200
## 0.05557 0.07847
##
## $residual.scale
## [1] 1
We can even draw a small plot showing the predicted values separately for levels of x1
(recall that x1
is a binary/indicator variable):
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
points(x2[x1 == 0], p3b.fitted$fit[x1 == 0], col = rgb(1, 0, 0, 0.5))
points(x2[x1 == 1], p3b.fitted$fit[x1 == 1], col = rgb(0, 0, 1, 0.5))
But this graph doesn't show the fit of the model to all values of x1
and x2
(or x3
) and doesn't communicate any of our uncertainty.
To get a better grasp on our models, we'll create some fake data representing the full scales of x1
, x2
, and x3
:
newdata1 <- expand.grid(x1 = 0:1, x2 = seq(0, 1, length.out = 10))
newdata2 <- expand.grid(x1 = 0:1, x2 = seq(0, 1, length.out = 10), x3 = seq(0,
5, length.out = 25))
We can then use these new fake data to generate predicted probabilities of each outcome at each combination of covarites:
p1 <- predict(m1, newdata1, type = "response", se.fit = TRUE)
p2a <- predict(m2a, newdata1, type = "response", se.fit = TRUE)
p2b <- predict(m2b, newdata1, type = "response", se.fit = TRUE)
p3a <- predict(m3a, newdata2, type = "response", se.fit = TRUE)
p3b <- predict(m3b, newdata2, type = "response", se.fit = TRUE)
We can look at one of these objects, e.g. p3b
, to see that we have predicted probabilities and associated standard errors:
p3b
## $fit
## 1 2 3 4 5 6 7 8 9 10
## 0.4368 0.6320 0.4509 0.6421 0.4650 0.6521 0.4792 0.6619 0.4935 0.6717
## 11 12 13 14 15 16 17 18 19 20
## 0.5077 0.6814 0.5219 0.6909 0.5361 0.7003 0.5503 0.7096 0.5644 0.7187
## 21 22 23 24 25 26 27 28 29 30
## 0.4342 0.6295 0.4483 0.6396 0.4624 0.6496 0.4766 0.6595 0.4909 0.6693
## 31 32 33 34 35 36 37 38 39 40
## 0.5051 0.6790 0.5193 0.6886 0.5335 0.6980 0.5477 0.7073 0.5618 0.7165
## 41 42 43 44 45 46 47 48 49 50
## 0.4316 0.6271 0.4457 0.6372 0.4598 0.6472 0.4740 0.6571 0.4882 0.6670
## 51 52 53 54 55 56 57 58 59 60
## 0.5025 0.6767 0.5167 0.6863 0.5309 0.6957 0.5451 0.7051 0.5592 0.7143
## 61 62 63 64 65 66 67 68 69 70
## 0.4291 0.6246 0.4431 0.6347 0.4572 0.6448 0.4714 0.6547 0.4856 0.6646
## 71 72 73 74 75 76 77 78 79 80
## 0.4999 0.6743 0.5141 0.6839 0.5283 0.6934 0.5425 0.7028 0.5566 0.7120
## 81 82 83 84 85 86 87 88 89 90
## 0.4265 0.6221 0.4405 0.6323 0.4546 0.6423 0.4688 0.6523 0.4830 0.6622
## 91 92 93 94 95 96 97 98 99 100
## 0.4972 0.6719 0.5115 0.6816 0.5257 0.6911 0.5399 0.7005 0.5540 0.7098
## 101 102 103 104 105 106 107 108 109 110
## 0.4239 0.6196 0.4379 0.6298 0.4520 0.6399 0.4662 0.6499 0.4804 0.6598
## 111 112 113 114 115 116 117 118 119 120
## 0.4946 0.6696 0.5089 0.6792 0.5231 0.6888 0.5373 0.6982 0.5514 0.7075
## 121 122 123 124 125 126 127 128 129 130
## 0.4213 0.6171 0.4354 0.6273 0.4494 0.6374 0.4636 0.6475 0.4778 0.6574
## 131 132 133 134 135 136 137 138 139 140
## 0.4920 0.6672 0.5062 0.6769 0.5205 0.6865 0.5347 0.6959 0.5488 0.7053
## 141 142 143 144 145 146 147 148 149 150
## 0.4188 0.6146 0.4328 0.6248 0.4468 0.6350 0.4610 0.6450 0.4752 0.6550
## 151 152 153 154 155 156 157 158 159 160
## 0.4894 0.6648 0.5036 0.6745 0.5179 0.6842 0.5321 0.6936 0.5462 0.7030
## 161 162 163 164 165 166 167 168 169 170
## 0.4162 0.6121 0.4302 0.6223 0.4443 0.6325 0.4584 0.6426 0.4726 0.6525
## 171 172 173 174 175 176 177 178 179 180
## 0.4868 0.6624 0.5010 0.6722 0.5153 0.6818 0.5295 0.6913 0.5436 0.7007
## 181 182 183 184 185 186 187 188 189 190
## 0.4137 0.6096 0.4276 0.6198 0.4417 0.6300 0.4558 0.6401 0.4700 0.6501
## 191 192 193 194 195 196 197 198 199 200
## 0.4842 0.6600 0.4984 0.6698 0.5126 0.6795 0.5269 0.6890 0.5410 0.6985
## 201 202 203 204 205 206 207 208 209 210
## 0.4111 0.6071 0.4251 0.6173 0.4391 0.6275 0.4532 0.6377 0.4674 0.6477
## 211 212 213 214 215 216 217 218 219 220
## 0.4816 0.6576 0.4958 0.6674 0.5100 0.6771 0.5242 0.6867 0.5384 0.6962
## 221 222 223 224 225 226 227 228 229 230
## 0.4086 0.6045 0.4225 0.6148 0.4365 0.6251 0.4506 0.6352 0.4647 0.6453
## 231 232 233 234 235 236 237 238 239 240
## 0.4789 0.6552 0.4932 0.6650 0.5074 0.6748 0.5216 0.6844 0.5358 0.6939
## 241 242 243 244 245 246 247 248 249 250
## 0.4060 0.6020 0.4199 0.6123 0.4339 0.6226 0.4480 0.6327 0.4621 0.6428
## 251 252 253 254 255 256 257 258 259 260
## 0.4763 0.6528 0.4906 0.6627 0.5048 0.6724 0.5190 0.6821 0.5332 0.6916
## 261 262 263 264 265 266 267 268 269 270
## 0.4035 0.5995 0.4174 0.6098 0.4313 0.6201 0.4454 0.6303 0.4595 0.6404
## 271 272 273 274 275 276 277 278 279 280
## 0.4737 0.6504 0.4879 0.6603 0.5022 0.6700 0.5164 0.6797 0.5306 0.6893
## 281 282 283 284 285 286 287 288 289 290
## 0.4010 0.5969 0.4148 0.6073 0.4288 0.6176 0.4428 0.6278 0.4569 0.6379
## 291 292 293 294 295 296 297 298 299 300
## 0.4711 0.6479 0.4853 0.6579 0.4996 0.6677 0.5138 0.6774 0.5280 0.6869
## 301 302 303 304 305 306 307 308 309 310
## 0.3984 0.5944 0.4123 0.6048 0.4262 0.6151 0.4402 0.6253 0.4543 0.6354
## 311 312 313 314 315 316 317 318 319 320
## 0.4685 0.6455 0.4827 0.6554 0.4970 0.6653 0.5112 0.6750 0.5254 0.6846
## 321 322 323 324 325 326 327 328 329 330
## 0.3959 0.5919 0.4097 0.6023 0.4236 0.6126 0.4376 0.6228 0.4517 0.6330
## 331 332 333 334 335 336 337 338 339 340
## 0.4659 0.6431 0.4801 0.6530 0.4943 0.6629 0.5086 0.6726 0.5228 0.6823
## 341 342 343 344 345 346 347 348 349 350
## 0.3934 0.5893 0.4072 0.5997 0.4211 0.6101 0.4351 0.6203 0.4492 0.6305
## 351 352 353 354 355 356 357 358 359 360
## 0.4633 0.6406 0.4775 0.6506 0.4917 0.6605 0.5060 0.6703 0.5202 0.6799
## 361 362 363 364 365 366 367 368 369 370
## 0.3909 0.5868 0.4046 0.5972 0.4185 0.6075 0.4325 0.6178 0.4466 0.6280
## 371 372 373 374 375 376 377 378 379 380
## 0.4607 0.6382 0.4749 0.6482 0.4891 0.6581 0.5033 0.6679 0.5176 0.6776
## 381 382 383 384 385 386 387 388 389 390
## 0.3883 0.5842 0.4021 0.5946 0.4159 0.6050 0.4299 0.6153 0.4440 0.6256
## 391 392 393 394 395 396 397 398 399 400
## 0.4581 0.6357 0.4723 0.6457 0.4865 0.6557 0.5007 0.6655 0.5150 0.6752
## 401 402 403 404 405 406 407 408 409 410
## 0.3858 0.5816 0.3995 0.5921 0.4134 0.6025 0.4273 0.6128 0.4414 0.6231
## 411 412 413 414 415 416 417 418 419 420
## 0.4555 0.6332 0.4697 0.6433 0.4839 0.6533 0.4981 0.6631 0.5123 0.6729
## 421 422 423 424 425 426 427 428 429 430
## 0.3833 0.5791 0.3970 0.5896 0.4108 0.6000 0.4248 0.6103 0.4388 0.6206
## 431 432 433 434 435 436 437 438 439 440
## 0.4529 0.6308 0.4671 0.6408 0.4813 0.6508 0.4955 0.6607 0.5097 0.6705
## 441 442 443 444 445 446 447 448 449 450
## 0.3808 0.5765 0.3945 0.5870 0.4083 0.5974 0.4222 0.6078 0.4362 0.6181
## 451 452 453 454 455 456 457 458 459 460
## 0.4503 0.6283 0.4645 0.6384 0.4787 0.6484 0.4929 0.6583 0.5071 0.6681
## 461 462 463 464 465 466 467 468 469 470
## 0.3783 0.5739 0.3920 0.5845 0.4057 0.5949 0.4196 0.6053 0.4336 0.6156
## 471 472 473 474 475 476 477 478 479 480
## 0.4477 0.6258 0.4619 0.6359 0.4760 0.6460 0.4903 0.6559 0.5045 0.6657
## 481 482 483 484 485 486 487 488 489 490
## 0.3758 0.5714 0.3895 0.5819 0.4032 0.5924 0.4171 0.6027 0.4311 0.6131
## 491 492 493 494 495 496 497 498 499 500
## 0.4451 0.6233 0.4593 0.6335 0.4734 0.6435 0.4877 0.6535 0.5019 0.6634
##
## $se.fit
## 1 2 3 4 5 6 7 8 9
## 0.10906 0.11971 0.09785 0.10469 0.08926 0.09169 0.08427 0.08147 0.08364
## 10 11 12 13 14 15 16 17 18
## 0.07488 0.08750 0.07263 0.09527 0.07477 0.10602 0.08065 0.11883 0.08923
## 19 20 21 22 23 24 25 26 27
## 0.13296 0.09953 0.10626 0.11718 0.09465 0.10190 0.08564 0.08862 0.08033
## 28 29 30 31 32 33 34 35 36
## 0.07816 0.07956 0.07148 0.08352 0.06934 0.09157 0.07183 0.10267 0.07818
## 37 38 39 40 41 42 43 44 45
## 0.11584 0.08725 0.13030 0.09800 0.10365 0.11478 0.09163 0.09923 0.08220
## 46 47 48 49 50 51 52 53 54
## 0.08568 0.07653 0.07498 0.07562 0.06818 0.07968 0.06617 0.08801 0.06903
## 55 56 57 58 59 60 61 62 63
## 0.09947 0.07586 0.11299 0.08542 0.12779 0.09661 0.10123 0.11251 0.08881
## 64 65 66 67 68 69 70 71 72
## 0.09671 0.07895 0.08288 0.07291 0.07193 0.07184 0.06502 0.07599 0.06315
## 73 74 75 76 77 78 79 80 81
## 0.08461 0.06638 0.09642 0.07372 0.11030 0.08376 0.12542 0.09538 0.09903
## 82 83 84 85 86 87 88 89 90
## 0.11040 0.08621 0.09435 0.07591 0.08025 0.06949 0.06905 0.06824 0.06203
## 91 92 93 94 95 96 97 98 99
## 0.07249 0.06030 0.08139 0.06393 0.09355 0.07176 0.10777 0.08229 0.12320
## 100 101 102 103 104 105 106 107 108
## 0.09432 0.09705 0.10845 0.08386 0.09217 0.07312 0.07780 0.06631 0.06636
## 109 110 111 112 113 114 115 116 117
## 0.06485 0.05922 0.06920 0.05766 0.07838 0.06170 0.09088 0.07003 0.10543
## 118 119 120 121 122 123 124 125 126
## 0.08102 0.12115 0.09344 0.09530 0.10667 0.08176 0.09017 0.07060 0.07556
## 127 128 129 130 131 132 133 134 135
## 0.06339 0.06388 0.06173 0.05665 0.06614 0.05525 0.07560 0.05971 0.08843
## 136 137 138 139 140 141 142 143 144
## 0.06853 0.10328 0.07997 0.11927 0.09276 0.09381 0.10508 0.07994 0.08839
## 145 146 147 148 149 150 151 152 153
## 0.06838 0.07355 0.06077 0.06166 0.05889 0.05435 0.06337 0.05313 0.07307
## 154 155 156 157 158 159 160 161 162
## 0.05801 0.08621 0.06730 0.10134 0.07914 0.11758 0.09227 0.09257 0.10369
## 163 164 165 166 167 168 169 170 171
## 0.07842 0.08683 0.06650 0.07179 0.05850 0.05973 0.05639 0.05236 0.06091
## 172 173 174 175 176 177 178 179 180
## 0.05134 0.07084 0.05662 0.08424 0.06635 0.09963 0.07856 0.11608 0.09198
## 181 182 183 184 185 186 187 188 189
## 0.09160 0.10251 0.07722 0.08551 0.06497 0.07032 0.05662 0.05811 0.05427
## 190 191 192 193 194 195 196 197 198
## 0.05072 0.05881 0.04992 0.06892 0.05558 0.08255 0.06570 0.09815 0.07823
## 199 200 201 202 203 204 205 206 207
## 0.11479 0.09191 0.09091 0.10154 0.07633 0.08445 0.06382 0.06915 0.05515
## 208 209 210 211 212 213 214 215 216
## 0.05686 0.05259 0.04949 0.05711 0.04891 0.06735 0.05491 0.08115 0.06536
## 217 218 219 220 221 222 223 224 225
## 0.09691 0.07817 0.11370 0.09206 0.09049 0.10081 0.07578 0.08366 0.06306
## 226 227 228 229 230 231 232 233 234
## 0.06831 0.05415 0.05599 0.05137 0.04869 0.05583 0.04833 0.06615 0.05464
## 235 236 237 238 239 240 241 242 243
## 0.08006 0.06536 0.09594 0.07837 0.11284 0.09243 0.09035 0.10031 0.07557
## 244 245 246 247 248 249 250 251 252
## 0.08315 0.06272 0.06780 0.05362 0.05553 0.05065 0.04836 0.05502 0.04823
## 253 254 255 256 257 258 259 260 261
## 0.06533 0.05477 0.07929 0.06568 0.09524 0.07885 0.11220 0.09303 0.09049
## 262 263 264 265 266 267 268 269 270
## 0.10005 0.07570 0.08293 0.06280 0.06765 0.05358 0.05549 0.05045 0.04852
## 271 272 273 274 275 276 277 278 279
## 0.05468 0.04860 0.06492 0.05532 0.07886 0.06634 0.09480 0.07959 0.11180
## 280 281 282 283 284 285 286 287 288
## 0.09384 0.09091 0.10004 0.07617 0.08301 0.06328 0.06785 0.05403 0.05589
## 289 290 291 292 293 294 295 296 297
## 0.05078 0.04916 0.05483 0.04944 0.06492 0.05627 0.07876 0.06733 0.09465
## 298 299 300 301 302 303 304 305 306
## 0.08061 0.11162 0.09488 0.09159 0.10028 0.07695 0.08338 0.06417 0.06842
## 307 308 309 310 311 312 313 314 315
## 0.05496 0.05671 0.05163 0.05027 0.05547 0.05074 0.06533 0.05761 0.07900
## 316 317 318 319 320 321 322 323 324
## 0.06864 0.09478 0.08188 0.11168 0.09614 0.09252 0.10077 0.07806 0.08406
## 325 326 327 328 329 330 331 332 333
## 0.06543 0.06934 0.05633 0.05796 0.05295 0.05183 0.05657 0.05247 0.06615
## 334 335 336 337 338 339 340 341 342
## 0.05932 0.07957 0.07026 0.09519 0.08342 0.11198 0.09762 0.09371 0.10151
## 343 344 345 346 347 348 349 350 351
## 0.07945 0.08502 0.06705 0.07061 0.05812 0.05959 0.05473 0.05380 0.05811
## 352 353 354 355 356 357 358 359 360
## 0.05459 0.06735 0.06138 0.08048 0.07218 0.09587 0.08519 0.11251 0.09930
## 361 362 363 364 365 366 367 368 369
## 0.09513 0.10250 0.08113 0.08628 0.06900 0.07221 0.06028 0.06160 0.05691
## 370 371 372 373 374 375 376 377 378
## 0.05616 0.06004 0.05707 0.06891 0.06375 0.08170 0.07436 0.09682 0.08721
## 379 380 381 382 383 384 385 386 387
## 0.11328 0.10118 0.09678 0.10372 0.08306 0.08781 0.07124 0.07412 0.06277
## 388 389 390 391 392 393 394 395 396
## 0.06394 0.05945 0.05885 0.06234 0.05987 0.07082 0.06642 0.08322 0.07681
## 397 398 399 400 401 402 403 404 405
## 0.09804 0.08945 0.11426 0.10326 0.09863 0.10518 0.08524 0.08960 0.07374
## 406 407 408 409 410 411 412 413 414
## 0.07633 0.06555 0.06659 0.06230 0.06184 0.06496 0.06294 0.07304 0.06934
## 415 416 417 418 419 420 421 422 423
## 0.08503 0.07949 0.09951 0.09189 0.11548 0.10552 0.10067 0.10687 0.08762
## 424 425 426 427 428 429 430 431 432
## 0.09164 0.07649 0.07880 0.06858 0.06951 0.06541 0.06508 0.06787 0.06626
## 433 434 435 436 437 438 439 440 441
## 0.07554 0.07249 0.08710 0.08238 0.10122 0.09454 0.11690 0.10797 0.10289
## 442 443 444 445 446 447 448 449 450
## 0.10877 0.09021 0.09393 0.07944 0.08152 0.07183 0.07267 0.06875 0.06856
## 451 452 453 454 455 456 457 458 459
## 0.07101 0.06979 0.07829 0.07585 0.08942 0.08548 0.10316 0.09737 0.11853
## 460 461 462 463 464 465 466 467 468
## 0.11058 0.10528 0.11087 0.09297 0.09643 0.08258 0.08447 0.07527 0.07605
## 469 470 471 472 473 474 475 476 477
## 0.07228 0.07223 0.07437 0.07351 0.08127 0.07940 0.09197 0.08875 0.10531
## 478 479 480 481 482 483 484 485 486
## 0.10038 0.12036 0.11335 0.10782 0.11318 0.09588 0.09914 0.08587 0.08763
## 487 488 489 490 491 492 493 494 495
## 0.07886 0.07963 0.07598 0.07608 0.07791 0.07739 0.08445 0.08311 0.09473
## 496 497 498 499 500
## 0.09219 0.10767 0.10354 0.12238 0.11627
##
## $residual.scale
## [1] 1
It is then relatively straight forward to plot the predicted probabilities for all of our data. We'll start with the simple models, then look at the models with interactions and the additional covariate x3
.
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p1$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p1$fit[newdata1$x1 == 0] + 1.96 * p1$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p1$fit[newdata1$x1 == 0] - 1.96 * p1$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p1$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p1$fit[newdata1$x1 == 1] + 1.96 * p1$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p1$fit[newdata1$x1 == 1] - 1.96 * p1$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
The above plot shows two predicted probability curves with heavily overlapping confidence bands. While the effect of x2
is clearly different from zero for both x1==0
and x1==1
, the difference between the two curves is not significant.
But this model is based on data with no underlying interaction. Let's look next at the outcome that is a function of an interaction between covariates.
Recall that the interaction model (with outcome y2s
) was estimated in two different ways. The first estimated model did not account for the interaction, while the second estimated model did account for the interaction. Let's see the two models side-by-side to compare the inference we would draw about the interaction:
# Model estimated without interaction
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1",
main = "Estimated without interaction")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p2a$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p2a$fit[newdata1$x1 == 0] + 1.96 * p2a$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p2a$fit[newdata1$x1 == 0] - 1.96 * p2a$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p2a$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p2a$fit[newdata1$x1 == 1] + 1.96 * p2a$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p2a$fit[newdata1$x1 == 1] - 1.96 * p2a$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
# Model estimated with interaction
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1",
main = "Estimated with interaction")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] + 1.96 * p2b$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] - 1.96 * p2b$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] + 1.96 * p2b$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] - 1.96 * p2b$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
The lefthand model leads us to some incorrect inference. Both predicted probability curves are essentially identical, suggesting that the influence of x2
is constant at both levels of x1
. This is because our model did not account for any interaction.
The righthand model leads us to substantially different inference. When x1==0
(shown in red), there appears to be almost no effect of x2
, but when x1==1
, the effect of x2
is strongly positive.
When we add an additional covariate to the model, things become much more complicated. Recall that the predicted probabilities have to be calculated on some value of each covariate. In other words, we have to define the predicted probability in terms of all of the covariates in the model. Thus, when we add an additional covariate (even if it does not interact with our focal covariates x1
and x2
), we need to account for it when estimating our predicted probabilities. We'll see this at work when we plot the predicted probabilities for our (incorrect) model estimated without the x1*x2
interaction and in our (correct) model estimated with that interaction.
No-Interaction model with an additional covariate
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
s <- sapply(unique(newdata2$x3), function(i) {
# `x1==0`
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3a$fit[newdata2$x1 ==
0 & newdata2$x3 == i], col = rgb(1, 0, 0, 0.5))
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3a$fit[newdata2$x1 ==
0 & newdata2$x3 == i] + 1.96 * p3a$se.fit[newdata2$x1 == 0 & newdata2$x3 ==
i], col = rgb(1, 0, 0, 0.5), lty = 2)
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3a$fit[newdata2$x1 ==
0 & newdata2$x3 == i] - 1.96 * p3a$se.fit[newdata2$x1 == 0 & newdata2$x3 ==
i], col = rgb(1, 0, 0, 0.5), lty = 2)
# `x1==1`
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3a$fit[newdata2$x1 ==
1 & newdata2$x3 == i], col = rgb(0, 0, 1, 0.5))
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3a$fit[newdata2$x1 ==
1 & newdata2$x3 == i] + 1.96 * p3a$se.fit[newdata2$x1 == 1 & newdata2$x3 ==
i], col = rgb(0, 0, 1, 0.5), lty = 2)
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3a$fit[newdata2$x1 ==
1 & newdata2$x3 == i] - 1.96 * p3a$se.fit[newdata2$x1 == 1 & newdata2$x3 ==
i], col = rgb(0, 0, 1, 0.5), lty = 2)
})
Note how the above code is much more complicated than previously because we now need to draw a separate predicted probability curve (with associated confidence interval) at each level of x3
even though we're not particularly interested in x3. The result is a very confusing plot because the predicted probability curves at each level of
x3 are the essentially the same, but the confidence intervals vary widely because of different levels of certainty due to the sparsity of the original data.
One common response is to simply draw the curve conditional on all other covariates (in this case x3
) being at their means, but this is an arbitrary choice. We could also select minimum or maximum, or any other value. Let's write a small function to redraw our curves at different values of x3
to see the impact of this choice:
ppcurve <- function(value_of_x3, title) {
tmp <- expand.grid(x1 = 0:1, x2 = seq(0, 1, length.out = 10), x3 = value_of_x3)
p3tmp <- predict(m3a, tmp, type = "response", se.fit = TRUE)
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1",
main = title)
# `x1==0`
lines(tmp$x2[tmp$x1 == 0], p3tmp$fit[tmp$x1 == 0], col = "red")
lines(tmp$x2[tmp$x1 == 0], p3tmp$fit[tmp$x1 == 0] + 1.96 * p3tmp$se.fit[tmp$x1 ==
0], col = "red", lty = 2)
lines(tmp$x2[tmp$x1 == 0], p3tmp$fit[tmp$x1 == 0] - 1.96 * p3tmp$se.fit[tmp$x1 ==
0], col = "red", lty = 2)
# `x1==1`
lines(tmp$x2[tmp$x1 == 1], p3tmp$fit[tmp$x1 == 1], col = "blue")
lines(tmp$x2[tmp$x1 == 1], p3tmp$fit[tmp$x1 == 1] + 1.96 * p3tmp$se.fit[tmp$x1 ==
1], col = "blue", lty = 2)
lines(tmp$x2[tmp$x1 == 1], p3tmp$fit[tmp$x1 == 1] - 1.96 * p3tmp$se.fit[tmp$x1 ==
1], col = "blue", lty = 2)
}
We can then draw a plot that shows the curves for the mean of x3
, the minimum of x3
and the maximum of x3
.
layout(matrix(1:3, nrow = 1))
ppcurve(mean(x3), title = "x3 at mean")
ppcurve(min(x3), title = "x3 at min")
ppcurve(max(x3), title = "x3 at max")
The above set of plots show that while the inference about the predicted probability curves is the same, the choice of what value of x3
to condition on is meaningful for the confidence intervals. The confidence intervals are much narrower when we condition on the mean value of x3
than the minimum or maximum.
Recall that this model did not properly account for the x1*x2
interaction. Thus while our inference is somewhat sensitive to the choice of conditioning value of the x3
covariate, it is unclear if this minimal sensitivity holds when we properly account for the interaction. Let's take a look at our m3b
model that accounts for the interaction.
Let's start by drawing a plot showing the predicted values of the outcome for every combination of x1
, x2
, and x3
:
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
s <- sapply(unique(newdata2$x3), function(i) {
# `x1==0`
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
0 & newdata2$x3 == i], col = rgb(1, 0, 0, 0.5))
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
0 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 ==
i], col = rgb(1, 0, 0, 0.5), lty = 2)
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
0 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 ==
i], col = rgb(1, 0, 0, 0.5), lty = 2)
# `x1==1`
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
1 & newdata2$x3 == i], col = rgb(0, 0, 1, 0.5))
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
1 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 ==
i], col = rgb(0, 0, 1, 0.5), lty = 2)
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
1 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 ==
i], col = rgb(0, 0, 1, 0.5), lty = 2)
})
This plot is incredibly messy. Now, not only our the confidence bands sensitive to what value of x3
we condition on, so too are the predicted probability curves themselves. It is therefore a fairly important decision what level of additional covariates to condition on when estimating the predicted probabilities.
A different approach when deal with interactions is to show marginal effects. Marginal effect, I think, are a bit abstract (i.e., a bit removed from the actual data because they attempt to summarize a lot of information in a single number). The marginal effect is the slope of the curve drawn by taking the difference between, e.g., the predicted probability that y==1
when x1==1
and the predicted probability that y==
when x1==0
, at each level of x2
. Thus, the marginal effect is simply the slope of the difference between the two curves that we were drawing in the above graphs (i.e., the slope of the change in predicted probabilities). Of course, we just saw, if any additional covariate(s) are involved in the data-generating process, then the marginal effect - like the predicted probabilities - is going to differ across levels of that covariate.
Let's see how this works by first returning to our simple interaction model (without x3
) and then look at the interaction model with the additional covariate.
To plot the change in predicted probabilities due to x1
across the values of x2
, we simply need to take our predicted probabilities from above and difference the values predicted for x1==0
and x1==1
. The predicted probabilities for our simple interaction model are stored in p2b
, based on new data from newdata1
. Let's separate out the values predicted from x1==0
and x1==1
and then take their difference. Let's create a new dataframe that binds newdata1
and the predicted probability and standard error values from p2b
together. Then we'll use the split
function to that dataframe based upon the value of x1
.
tmpdf <- newdata1
tmpdf$fit <- p2b$fit
tmpdf$se.fit <- p2b$se.fit
tmpsplit <- split(tmpdf, tmpdf$x1)
The result is a list of two dataframes, each containing values of x1
, x2
, and the associated predicted probabilities:
tmpsplit
## $`0`
## x1 x2 fit se.fit
## 1 0 0.0000 0.5014 0.09235
## 3 0 0.1111 0.5011 0.07665
## 5 0 0.2222 0.5007 0.06320
## 7 0 0.3333 0.5003 0.05373
## 9 0 0.4444 0.5000 0.05053
## 11 0 0.5556 0.4996 0.05470
## 13 0 0.6667 0.4992 0.06484
## 15 0 0.7778 0.4989 0.07867
## 17 0 0.8889 0.4985 0.09459
## 19 0 1.0000 0.4982 0.11171
##
## $`1`
## x1 x2 fit se.fit
## 2 1 0.0000 0.3494 0.09498
## 4 1 0.1111 0.3839 0.08187
## 6 1 0.2222 0.4194 0.06887
## 8 1 0.3333 0.4556 0.05769
## 10 1 0.4444 0.4921 0.05089
## 12 1 0.5556 0.5287 0.05093
## 14 1 0.6667 0.5650 0.05770
## 16 1 0.7778 0.6009 0.06862
## 18 1 0.8889 0.6358 0.08115
## 20 1 1.0000 0.6697 0.09362
To calculate the change in predicted probabilty of y==1
due to x1==1
at each value of x2
, we'll simply difference the the fit
variable from each dataframe:
me <- tmpsplit[[2]]$fit - tmpsplit[[1]]$fit
me
## [1] -0.152032 -0.117131 -0.081283 -0.044766 -0.007877 0.029079 0.065793
## [8] 0.101966 0.137309 0.171555
We also want the standard error of that difference:
me_se <- sqrt(0.5 * (tmpsplit[[2]]$se.fit + tmpsplit[[1]]$se.fit))
Now Let's plot the original predicted probability plot on the left and the change in predicted probability plot on the right:
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1",
main = "Predicted Probabilities")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] + 1.96 * p2b$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] - 1.96 * p2b$se.fit[newdata1$x1 ==
0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] + 1.96 * p2b$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] - 1.96 * p2b$se.fit[newdata1$x1 ==
1], col = "blue", lty = 2)
# plot of change in predicted probabilities:
plot(NA, type = "l", xlim = c(0, 1), ylim = c(-1, 1), xlab = "x2", ylab = "Change in Predicted Probability of y=1",
main = "Change in Predicted Probability due to x1")
abline(h = 0, col = "gray") # gray line at zero
lines(tmpsplit[[1]]$x2, me, lwd = 2) # change in predicted probabilities
lines(tmpsplit[[1]]$x2, me - 1.96 * me_se, lty = 2)
lines(tmpsplit[[1]]$x2, me + 1.96 * me_se, lty = 2)
As should be clear, the plot on the right is simply a further information reduction of the lefthand plot. Where the separate predicted probabilities show the predicted probability of the outcome at each combination of x1
and x2
, the righthand plot simply shows the difference between these two curves.
The marginal effect of x2
is thus a further information reduction: it is the slope of the line showing the difference in predicted probabilities.
Because our x2
variable is scaled [0,1], we can see the marginal effect simply by subtracting the value of change in predicted probabilities when x2==0
from the value of change in predicted probabilities when x2==1
, which is simply:
me[length(me)] - me[1]
## [1] 0.3236
Thus the marginal effect of x1
on the outcome, is the slope of the line representing the change in predicted probabilities between x1==1
and x1==0
across the range of x2
. I don't find that a particularly intuitive measure of effect and would instead prefer to draw some kind of plot rather than reduce that plot to a single number.
Things get more complicated, as we might expect, when we have to account for the additional covariate x3
, which influenced our predicted probabilities above. Our predicted probabilies for these data are stored in p3b
(based on input data in newdata2
). We'll follow the same procedure just used to add those predicted probabilities into a dataframe with the variables from newdata2
, then we'll split it based on x1
:
tmpdf <- newdata2
tmpdf$fit <- p3b$fit
tmpdf$se.fit <- p3b$se.fit
tmpsplit <- split(tmpdf, tmpdf$x1)
The result is a list of two large dataframes:
str(tmpsplit)
## List of 2
## $ 0:'data.frame': 250 obs. of 5 variables:
## ..$ x1 : int [1:250] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ x2 : num [1:250] 0 0.111 0.222 0.333 0.444 ...
## ..$ x3 : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ fit : num [1:250] 0.437 0.451 0.465 0.479 0.493 ...
## ..$ se.fit: num [1:250] 0.1091 0.0979 0.0893 0.0843 0.0836 ...
## ..- attr(*, "out.attrs")=List of 2
## .. ..$ dim : Named int [1:3] 2 10 25
## .. .. ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
## .. ..$ dimnames:List of 3
## .. .. ..$ x1: chr [1:2] "x1=0" "x1=1"
## .. .. ..$ x2: chr [1:10] "x2=0.0000" "x2=0.1111" "x2=0.2222" "x2=0.3333" ...
## .. .. ..$ x3: chr [1:25] "x3=0.0000" "x3=0.2083" "x3=0.4167" "x3=0.6250" ...
## $ 1:'data.frame': 250 obs. of 5 variables:
## ..$ x1 : int [1:250] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ x2 : num [1:250] 0 0.111 0.222 0.333 0.444 ...
## ..$ x3 : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ fit : num [1:250] 0.632 0.642 0.652 0.662 0.672 ...
## ..$ se.fit: num [1:250] 0.1197 0.1047 0.0917 0.0815 0.0749 ...
## ..- attr(*, "out.attrs")=List of 2
## .. ..$ dim : Named int [1:3] 2 10 25
## .. .. ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
## .. ..$ dimnames:List of 3
## .. .. ..$ x1: chr [1:2] "x1=0" "x1=1"
## .. .. ..$ x2: chr [1:10] "x2=0.0000" "x2=0.1111" "x2=0.2222" "x2=0.3333" ...
## .. .. ..$ x3: chr [1:25] "x3=0.0000" "x3=0.2083" "x3=0.4167" "x3=0.6250" ...
Now, we need to calculate the change in predicted probability within each of those dataframes, at each value of x3
. That is tedious. So let's instead split by both x1
and x3
:
tmpsplit <- split(tmpdf, list(tmpdf$x3, tmpdf$x1))
The result is a list of 50 dataframes, the first 25 of which contain data for x1==0
and the latter 25 of which contain data for x1==1
:
length(tmpsplit)
## [1] 50
names(tmpsplit)
## [1] "0.0" "0.208333333333333.0" "0.416666666666667.0"
## [4] "0.625.0" "0.833333333333333.0" "1.04166666666667.0"
## [7] "1.25.0" "1.45833333333333.0" "1.66666666666667.0"
## [10] "1.875.0" "2.08333333333333.0" "2.29166666666667.0"
## [13] "2.5.0" "2.70833333333333.0" "2.91666666666667.0"
## [16] "3.125.0" "3.33333333333333.0" "3.54166666666667.0"
## [19] "3.75.0" "3.95833333333333.0" "4.16666666666667.0"
## [22] "4.375.0" "4.58333333333333.0" "4.79166666666667.0"
## [25] "5.0" "0.1" "0.208333333333333.1"
## [28] "0.416666666666667.1" "0.625.1" "0.833333333333333.1"
## [31] "1.04166666666667.1" "1.25.1" "1.45833333333333.1"
## [34] "1.66666666666667.1" "1.875.1" "2.08333333333333.1"
## [37] "2.29166666666667.1" "2.5.1" "2.70833333333333.1"
## [40] "2.91666666666667.1" "3.125.1" "3.33333333333333.1"
## [43] "3.54166666666667.1" "3.75.1" "3.95833333333333.1"
## [46] "4.16666666666667.1" "4.375.1" "4.58333333333333.1"
## [49] "4.79166666666667.1" "5.1"
We can then calculate our change in predicted probabilities at each level of x1
and x3
. We'll use the mapply
function to do this quickly:
change <- mapply(function(a, b) b$fit - a$fit, tmpsplit[1:25], tmpsplit[26:50])
The resulting object change
is a matrix, each column of which is the change in predicted probability at each level of x3
. We can then use this matrix to plot each change in predicted probability on a single plot.
Let's again draw this side-by-side with the predicted probability plot:
layout(matrix(1:2, nrow = 1))
# predicted probabilities
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1",
main = "Predicted Probabilities")
s <- sapply(unique(newdata2$x3), function(i) {
# `x1==0`
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
0 & newdata2$x3 == i], col = rgb(1, 0, 0, 0.5))
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
0 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 ==
i], col = rgb(1, 0, 0, 0.5), lty = 2)
lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
0 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 ==
i], col = rgb(1, 0, 0, 0.5), lty = 2)
# `x1==1`
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
1 & newdata2$x3 == i], col = rgb(0, 0, 1, 0.5))
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
1 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 ==
i], col = rgb(0, 0, 1, 0.5), lty = 2)
lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 ==
1 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 ==
i], col = rgb(0, 0, 1, 0.5), lty = 2)
})
# change in predicted probabilities
plot(NA, type = "l", xlim = c(0, 1), ylim = c(-1, 1), xlab = "x2", ylab = "Change in Predicted Probability of y=1",
main = "Change in Predicted Probability due to x1")
abline(h = 0, col = "gray")
apply(change, 2, function(a) lines(tmpsplit[[1]]$x2, a))
## NULL
As we can see, despite the craziness of the left-hand plot, the marginal effect of x1
is actually not affected by x3
(which makes sense because it is not interacted with x3
in the data-generating process). Thus, while the choice of value of x3
on which to estimated the predicted probabilities matters, the marginal effect is constant. We can estimate it simply by following the same procedure above from any column of our change
matrix:
change[nrow(change), 1] - change[1, 1]
## 0.0
## -0.0409
The result here is a negligible marginal effect, which is what we would expect given the lack of an interaction between x1
and x3
in the underlying data. If such an interaction were in the actual data, then we should expect that this marginal effect would vary across values of x3
and we would need to further state the marginal effect as conditional on a particular value of x3
.
Unlike with linear models, interpreting GLMs requires looking at predicted values and this is often easiest to understand in the form of a plot. Let's start by creating some binary outcome data in a simple bivariate model:
set.seed(1)
n <- 100
x <- runif(n, 0, 1)
y <- rbinom(n, 1, x)
If we look at this data, we see that that there is a relationship between x
and y
, where we are more likely to observe y==1
at higher values of x
. We can fit a linear model to these data, but that fit is probably inappropriate, as we can see in the linear fit shown here:
plot(y ~ x, col = NULL, bg = rgb(0, 0, 0, 0.5), pch = 21)
abline(lm(y ~ x), lwd = 2)
We can use the predict
function to obtain predicted probabilities from other model fits to see if they better fit the data.
We can start by fitting a logit model to the data:
m1 <- glm(y ~ x, family = binomial(link = "logit"))
As with OLS, we then construct the input data over which we want to predict the outcome:
newdf <- data.frame(x = seq(0, 1, length.out = 100))
Because GLM relies on a link function, predict
allows us to both extract the linear predictions as well as predicted probabilities through the inverse link. The default value for the type
argument (type='link'
) gives predictions on the scale of the linear predicts. For logit models, these are directly interpretable as log-odds, but we'll come back to that in a minute. When we set type='response'
, we can obtain predicted probabilities:
newdf$pout_logit <- predict(m1, newdf, se.fit = TRUE, type = "response")$fit
We also need to store the standard errors of the predicted probabilities and we can use those to build confidence intervals:
newdf$pse_logit <- predict(m1, newdf, se.fit = TRUE, type = "response")$se.fit
newdf$pupper_logit <- newdf$pout_logit + (1.96 * newdf$pse_logit) # 95% CI upper bound
newdf$plower_logit <- newdf$pout_logit - (1.96 * newdf$pse_logit) # 95% CI lower bound
With these data in hand, it is trivial to plot the predicted probability of y
for each value of x
:
with(newdf, plot(pout_logit ~ x, type = "l", lwd = 2))
with(newdf, lines(pupper_logit ~ x, type = "l", lty = 2))
with(newdf, lines(plower_logit ~ x, type = "l", lty = 2))
# ## Predicted probabilities for the probit model ##
We can repeat the above procedure exactly in order to obtain predicted probabilities for the probit model. All we have to change is value of link
in our original call to glm
:
m2 <- glm(y ~ x, family = binomial(link = "probit"))
newdf$pout_probit <- predict(m2, newdf, se.fit = TRUE, type = "response")$fit
newdf$pse_probit <- predict(m2, newdf, se.fit = TRUE, type = "response")$se.fit
newdf$pupper_probit <- newdf$pout_probit + (1.96 * newdf$pse_probit)
newdf$plower_probit <- newdf$pout_probit - (1.96 * newdf$pse_probit)
Here's the resulting plot, which looks very similar to the one from the logit model:
with(newdf, plot(pout_probit ~ x, type = "l", lwd = 2))
with(newdf, lines(pupper_probit ~ x, type = "l", lty = 2))
with(newdf, lines(plower_probit ~ x, type = "l", lty = 2))
Indeed, we can overlay the logit model (in red) and the probit model (in blue) and see that both models provide essentially identical inference. It's also helpful to have the original data underneath to see how the predicted probabilities communicate information about the original data:
# data
plot(y ~ x, col = NULL, bg = rgb(0, 0, 0, 0.5), pch = 21)
# logit
with(newdf, lines(pout_logit ~ x, type = "l", lwd = 2, col = "red"))
with(newdf, lines(pupper_logit ~ x, type = "l", lty = 2, col = "red"))
with(newdf, lines(plower_logit ~ x, type = "l", lty = 2, col = "red"))
# probit
with(newdf, lines(pout_probit ~ x, type = "l", lwd = 2, col = "blue"))
with(newdf, lines(pupper_probit ~ x, type = "l", lty = 2, col = "blue"))
with(newdf, lines(plower_probit ~ x, type = "l", lty = 2, col = "blue"))
Clearly, the model does an adequate job predicting y
for high and low values of x
, but offers a less accurate prediction for middling values.
Note: You can see the influence of the logistic distribution's heavier tails in its higher predicted probabilities for y
at low values of x
(compared to the probit model) and the reverse at high values of x
.
Plotting is therefore superior to looking at coefficients in order to compare models. This is especially apparent when we compare the substantively identical plots to the values of the coefficients from each model, which seem (on face value) quite different:
summary(m1)$coef[, 1:2]
## Estimate Std. Error
## (Intercept) -2.449 0.5780
## x 4.311 0.9769
summary(m2)$coef[, 1:2]
## Estimate Std. Error
## (Intercept) -1.496 0.3289
## x 2.622 0.5568
As stated above, when dealing with logit
models, we can also directly interpet the log-odds predictions from predict
.
Let's take a look at these using the default type='link'
argument in predict
:
logodds <- predict(m1, newdf, se.fit = TRUE)$fit
Whereas the predicted probabilities (from above) are strictly bounded [0,1]:
summary(newdf$pout_logit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0795 0.2020 0.4270 0.4460 0.6870 0.8660
the log-odds are allowed to vary over any value:
summary(logodds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.450 -1.370 -0.294 -0.294 0.784 1.860
We can calculate standard errors, use those to build confidence intervals for the log-odds, and then plot to make a more direct interpretation:
logodds_se <- predict(m1, newdf, se.fit = TRUE)$se.fit
logodds_upper <- logodds + (1.96 * logodds_se)
logodds_lower <- logodds - (1.96 * logodds_se)
plot(logodds ~ newdf$x, type = "l", lwd = 2)
lines(logodds_upper ~ newdf$x, type = "l", lty = 2)
lines(logodds_lower ~ newdf$x, type = "l", lty = 2)
From this plot we can see that the log-odds of observing y==1
are positive when x>.5
and negative otherwise.
But operating in log-odds is itself confusing because logs are fairly difficult to directly understand. Thus we can translate log-odds to odds by taking exp
of the log-odds and redrawing the plot with the new data.
Recall that the odds-ratio is the ratio of the betting odds (i.e., the odds of y==1
divided by the odds of y==0
at each value of x
). The odds-ratio is strictly lower bounded by 0. When the OR is 1, the ratio of the odds is equal (i.e., at that value of x
, we are equally likely to see an observation as y==1
or y==0
). We saw this in the earlier plot, where the log-odds changed from negative to positive at x==.5
. An OR greater than 1 means that the odds of y==1
are higher than the odds of y==0
. When less than 1, the opposite is true.
plot(exp(logodds) ~ newdf$x, type = "l", lwd = 2)
lines(exp(logodds_upper) ~ newdf$x, type = "l", lty = 2)
lines(exp(logodds_lower) ~ newdf$x, type = "l", lty = 2)
This plot shows that when x
is low, the OR is between 0 and 1, but when x
is high, the odds-ratio is quite large. At x==1
, the OR is significantly larger than 1 and possibly higher than 6, suggesting that when x==
, the odds are 6 times higher for a unit having a value of y==1
than a value of y==0
.
The easiest way to understand bivariate regression is to view it as equivalent to a two-sample t-test. Imagine we have a binary variable (like male/female or treatment/control):
set.seed(1)
bin <- rbinom(1000, 1, 0.5)
Then we have an outcome that is influenced by that group:
out <- 2 * bin + rnorm(1000)
We can use by
to calculate the treatment group means:
by(out, bin, mean)
## bin: 0
## [1] -0.01588
## --------------------------------------------------------
## bin: 1
## [1] 1.966
This translates to a difference of:
diff(by(out, bin, mean))
## [1] 1.982
A two-sample t-test shows us whether there is a significant difference between the two groups:
t.test(out ~ bin)
##
## Welch Two Sample t-test
##
## data: out by bin
## t = -30.3, df = 992.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.111 -1.854
## sample estimates:
## mean in group 0 mean in group 1
## -0.01588 1.96624
If we run a linear regression, we find that the mean-difference is the same as the regression slope:
lm(out ~ bin)
##
## Call:
## lm(formula = out ~ bin)
##
## Coefficients:
## (Intercept) bin
## -0.0159 1.9821
And t-statistic (and its significance) for the regression slope matches that from the t.test:
summary(lm(out ~ bin))$coef[2, ]
## Estimate Std. Error t value Pr(>|t|)
## 1.982e+00 6.544e-02 3.029e+01 1.949e-143
It becomes quite easy to see this visually in a plot of the regression:
plot(out ~ bin, col = "gray")
points(0:1, by(out, bin, mean), col = "blue", bg = "blue", pch = 23)
abline(coef(lm(out ~ bin)), col = "blue")
A regression involving a continuous covariate is similar, but rather than representing the difference in means between two groups (with the covariate is binary), it represents the conditional mean of the outcome at each level of the covariate. We can see this in some simple fake data:
set.seed(1)
x <- runif(1000, 0, 10)
y <- 3 * x + rnorm(1000, 0, 5)
Here, we'll cut our covariate into five levels and estimate the density of the outcome y
in each of those levels:
x1 <- ifelse(x < 2, 1, ifelse(x >= 2 & x < 4, 2, ifelse(x >= 4 & x < 6, 3, ifelse(x >=
6 & x < 8, 4, ifelse(x >= 8 & x < 10, 5, NA)))))
d1 <- density(y[x1 == 1])
d2 <- density(y[x1 == 2])
d3 <- density(y[x1 == 3])
d4 <- density(y[x1 == 4])
d5 <- density(y[x1 == 5])
We'll then use those values to show how the regression models the mean of y
conditional on x
. Let's start with the model:
m1 <- lm(y ~ x)
It is also worth highlighting that, in a bivariate regression model, the regression slope is simply a weighted version of the correlation coefficient. We can see this by calculating the correlation between x
and y
and then weighting that by the ratio of the covariances of each. You'll see that this is exactly the slope coefficient reported by R:
cor(y, x)
## [1] 0.8593
slope <- cor(y, x) * sqrt(cov(y, y)/cov(x, x)) # manually calculate coefficient as weighted correlation
coef(m1)[2] # coefficient on x
## x
## 3.011
slope
## [1] 3.011
But let's plot the data to get a better understanding of what it looks like:
plot(x, y, col = "gray")
# add the regression equation:
abline(coef(m1), col = "blue")
# add the conditional densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + d1$y * 10, d1$x, type = "l", col = "black")
points(3 + d2$y * 10, d2$x, type = "l", col = "black")
points(5 + d3$y * 10, d3$x, type = "l", col = "black")
points(7 + d4$y * 10, d4$x, type = "l", col = "black")
points(9 + d5$y * 10, d5$x, type = "l", col = "black")
# add points representing conditional means:
points(1, mean(y[x1 == 1]), col = "red", pch = 15)
points(3, mean(y[x1 == 2]), col = "red", pch = 15)
points(5, mean(y[x1 == 3]), col = "red", pch = 15)
points(7, mean(y[x1 == 4]), col = "red", pch = 15)
points(9, mean(y[x1 == 5]), col = "red", pch = 15)
As is clear, the regression line travels through the conditional means of y
at each level of x
. We can also see in the densities that y
is approximately normally distributed at each value of x
(because we made our data that way). These data thus nicely satisfy the assumptions for linear regression.
Obviously, our data rarely satisfy those assumptions so nicely. We can modify our fake data to have less desirable properties and see how that affects our inference. Let's put a discontinuity in our y
value by simply increasing it by 10 for all values of x
greater than 6:
y2 <- y
y2[x > 6] <- y[x > 6] + 10
We can build a new model for these data:
m2 <- lm(y2 ~ x)
Let's estimate the conditional densities, as we did above, but for the new data:
e1 <- density(y2[x1 == 1])
e2 <- density(y2[x1 == 2])
e3 <- density(y2[x1 == 3])
e4 <- density(y2[x1 == 4])
e5 <- density(y2[x1 == 5])
And then let's look at how that model fits the new data:
plot(x, y2, col = "gray")
# add the regression equation:
abline(coef(m2), col = "blue")
# add the conditional densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + e1$y * 10, e1$x, type = "l", col = "black")
points(3 + e2$y * 10, e2$x, type = "l", col = "black")
points(5 + e3$y * 10, e3$x, type = "l", col = "black")
points(7 + e4$y * 10, e4$x, type = "l", col = "black")
points(9 + e5$y * 10, e5$x, type = "l", col = "black")
# add points representing conditional means:
points(1, mean(y2[x1 == 1]), col = "red", pch = 15)
points(3, mean(y2[x1 == 2]), col = "red", pch = 15)
points(5, mean(y2[x1 == 3]), col = "red", pch = 15)
points(7, mean(y2[x1 == 4]), col = "red", pch = 15)
points(9, mean(y2[x1 == 5]), col = "red", pch = 15)
As should be clear in the plot, the line no longer goes through the conditional means (see, especially, the third density curve) because the outcome y
is not a linear function of x
.
To obtain a better fit, we can estimate two separate lines, one on each side of the discontinuing:
m3a <- lm(y2[x <= 6] ~ x[x <= 6])
m3b <- lm(y2[x > 6] ~ x[x > 6])
Now let's redraw our data and the plot for x<=6
in red and the plot for x>6
in blue:
plot(x, y2, col = "gray")
segments(0, coef(m3a)[1], 6, coef(m3a)[1] + 6 * coef(m3a)[2], col = "red")
segments(6, coef(m3b)[1] + (6 * coef(m3b)[2]), 10, coef(m3b)[1] + 10 * coef(m3b)[2],
col = "blue")
# redraw the densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + e1$y * 10, e1$x, type = "l", col = "black")
points(3 + e2$y * 10, e2$x, type = "l", col = "black")
points(5 + e3$y * 10, e3$x, type = "l", col = "black")
points(7 + e4$y * 10, e4$x, type = "l", col = "black")
points(9 + e5$y * 10, e5$x, type = "l", col = "black")
# redraw points representing conditional means:
points(1, mean(y2[x1 == 1]), col = "red", pch = 15)
points(3, mean(y2[x1 == 2]), col = "red", pch = 15)
points(5, mean(y2[x1 == 3]), col = "red", pch = 15)
points(7, mean(y2[x1 == 4]), col = "blue", pch = 15)
points(9, mean(y2[x1 == 5]), col = "blue", pch = 15)
Our two new models m3a
and m3b
are better fits to the data because they satisfy the requirement that the regression line travel through the conditional means of y
.
Thus, regardless of the form of our covariate(s), our regression models only provide valid inference if the regression line travels through the conditional mean of y
for every value of x
.
Binary and continuous covariates are easy to model, but we often have data that are not binary or continuous, but instead are categorical. Building regression models with these kinds of variables gives us many options to consider. In particular, developing a model that points through the conditional means of the outcome can be more complicated because the relationship between the outcome and a categorical covariate (if treated as continuous) is unlikely to be linear. We then have to decide how best to model the data. Let's start with some fake data to illustrate this:
a <- sample(1:5, 500, TRUE)
b <- numeric(length = 500)
b[a == 1] <- a[a == 1] + rnorm(sum(a == 1))
b[a == 2] <- 2 * a[a == 2] + rnorm(sum(a == 2))
b[a == 3] <- 2 * a[a == 3] + rnorm(sum(a == 3))
b[a == 4] <- 0.5 * a[a == 4] + rnorm(sum(a == 4))
b[a == 5] <- 2 * a[a == 5] + rnorm(sum(a == 5))
Let's treat a
as a continuous covariate, assume the a-b relationship is linear, and build the corresponding linear regression model:
n1 <- lm(b ~ a)
We can see the relationship in the data by plotting b
as a function of a
:
plot(a, b, col = "gray")
abline(coef(n1), col = "blue")
# draw points representing conditional means:
points(1, mean(b[a == 1]), col = "red", pch = 15)
points(2, mean(b[a == 2]), col = "red", pch = 15)
points(3, mean(b[a == 3]), col = "red", pch = 15)
points(4, mean(b[a == 4]), col = "red", pch = 15)
points(5, mean(b[a == 5]), col = "red", pch = 15)
Clearly, the regression line misses the conditional mean values of b
at all values of a
. Our model is therefore not very good.
To correct for this, we can either (1) attempt to transform our variables to force a straight-line (which probably isn't possible in this case, but might be if the relationship were curvilinear) or (2) convert the a
covariate to a factor and thus model the relationship as a series of indicator (or “dummy”) variables.
When we treat a discrete covariate as a factor, R automatically transforms the variable into a series of indicator variables during the estimation of the regression. Let's compare our original model to this new model:
# our original model (treating `a` as continuous):
summary(n1)
##
## Call:
## lm(formula = b ~ a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.392 -0.836 0.683 1.697 4.605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4679 0.2569 -1.82 0.069 .
## a 1.7095 0.0759 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.48 on 498 degrees of freedom
## Multiple R-squared: 0.505, Adjusted R-squared: 0.504
## F-statistic: 507 on 1 and 498 DF, p-value: <2e-16
# our new model:
n2 <- lm(b ~ factor(a))
summary(n2)
##
## Call:
## lm(formula = b ~ factor(a))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9332 -0.6578 -0.0772 0.7627 2.9459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.063 0.104 10.21 < 2e-16 ***
## factor(a)2 2.772 0.154 18.03 < 2e-16 ***
## factor(a)3 4.881 0.150 32.47 < 2e-16 ***
## factor(a)4 0.848 0.152 5.56 4.4e-08 ***
## factor(a)5 8.946 0.143 62.35 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.908
## F-statistic: 1.23e+03 on 4 and 495 DF, p-value: <2e-16
Obviously, the regression output is quite different for the two models. For n1
, we see the slope of the line we drew in the plot above. For n2
, we instead see the slopes comparing b
for a==1
to b
for all other levels of a
(i.e., dummy coefficient slopes).
R defaults to taking the lowest factor level as the baseline, but we can change this by reordering the levels of the factor:
# a==5 as baseline:
summary(lm(b ~ factor(a, levels = 5:1)))
##
## Call:
## lm(formula = b ~ factor(a, levels = 5:1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9332 -0.6578 -0.0772 0.7627 2.9459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.0087 0.0987 101.4 <2e-16 ***
## factor(a, levels = 5:1)4 -8.0978 0.1487 -54.5 <2e-16 ***
## factor(a, levels = 5:1)3 -4.0642 0.1466 -27.7 <2e-16 ***
## factor(a, levels = 5:1)2 -6.1733 0.1501 -41.1 <2e-16 ***
## factor(a, levels = 5:1)1 -8.9456 0.1435 -62.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.908
## F-statistic: 1.23e+03 on 4 and 495 DF, p-value: <2e-16
# a==4 as baseline:
summary(lm(b ~ factor(a, levels = c(4, 1, 2, 3, 5))))
##
## Call:
## lm(formula = b ~ factor(a, levels = c(4, 1, 2, 3, 5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9332 -0.6578 -0.0772 0.7627 2.9459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.911 0.111 17.17 < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))1 -0.848 0.152 -5.56 4.4e-08
## factor(a, levels = c(4, 1, 2, 3, 5))2 1.925 0.159 12.13 < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))3 4.034 0.155 25.97 < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))5 8.098 0.149 54.45 < 2e-16
##
## (Intercept) ***
## factor(a, levels = c(4, 1, 2, 3, 5))1 ***
## factor(a, levels = c(4, 1, 2, 3, 5))2 ***
## factor(a, levels = c(4, 1, 2, 3, 5))3 ***
## factor(a, levels = c(4, 1, 2, 3, 5))5 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.908
## F-statistic: 1.23e+03 on 4 and 495 DF, p-value: <2e-16
Another approach is model the regression without an intercept:
# a==1 as baseline with no intercept:
n3 <- lm(b ~ 0 + factor(a))
summary(n3)
##
## Call:
## lm(formula = b ~ 0 + factor(a))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9332 -0.6578 -0.0772 0.7627 2.9459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## factor(a)1 1.0631 0.1042 10.2 <2e-16 ***
## factor(a)2 3.8354 0.1131 33.9 <2e-16 ***
## factor(a)3 5.9445 0.1084 54.9 <2e-16 ***
## factor(a)4 1.9109 0.1113 17.2 <2e-16 ***
## factor(a)5 10.0087 0.0987 101.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared: 0.968, Adjusted R-squared: 0.967
## F-statistic: 2.97e+03 on 5 and 495 DF, p-value: <2e-16
In this model, the coefficients are exactly the conditionals of b
for each value of a
:
coef(n3) # coefficients
## factor(a)1 factor(a)2 factor(a)3 factor(a)4 factor(a)5
## 1.063 3.835 5.944 1.911 10.009
sapply(1:5, function(x) mean(b[a == x])) # conditional means
## [1] 1.063 3.835 5.944 1.911 10.009
All of these models produce the same substantive inference, but might simplify interpretation in any particular situation.
##Variable transformations ##
Sometimes, rather than forcing the categorical variable to be a set of indicators through the use of factor
, we can treat the covariate as continuous once we transform it or the outcome in some way.
Let's start with some fake data (based on our previous example):
c <- a^3
d <- 2 * a + rnorm(length(a))
These data have a curvilinear relationship that is not well represented by a linear regression line:
plot(c, d, col = "gray")
sapply(unique(c), function(x) points(x, mean(d[c == x]), col = "red", pch = 15))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
abline(lm(d ~ c))
As before, we can model this by treating the covariate c
as a factor and find that model gives us the conditional means of d
:
coef(lm(d ~ 0 + factor(c))) # coefficients
## factor(c)1 factor(c)8 factor(c)27 factor(c)64 factor(c)125
## 1.906 3.988 6.078 7.917 10.188
sapply(sort(unique(c)), function(x) mean(d[c == x])) # conditional means
## [1] 1.906 3.988 6.078 7.917 10.188
We can also obtain the same substantive inference by transforming the variable(s) to produce a linear fit. In this case, we know (because we made up the data) that there is a cubic relationship between c
and d
. If we make a new version of the covariate c
that is the cube-root of c
, we should be able to force a linear fit:
c2 <- c^(1/3)
plot(c2, d, col = "gray")
sapply(unique(c2), function(x) points(x, mean(d[c2 == x]), col = "red", pch = 15))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
abline(lm(d ~ c2))
We could also transform the outcome d
by taking it cubed:
d2 <- d^3
plot(c, d2, col = "gray")
sapply(unique(c), function(x) points(x, mean(d2[c == x]), col = "red", pch = 15))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
abline(lm(d2 ~ c))
Again, the plot shows this transformation also produces a linear fit. Thus we can reasonably model the relationship between a discrete covariate and a continuous outcome in a number of ways that satisfy the basic assumption of drawing the regression line through the conditional means of the outcome.
Unlike other statistical packages, R has a robust and simple to use set of string manipulation functions. These functions become useful in a number of situations, including: dynamically creating variables, generating tabular and graphical output, reading and writing from text files and the web, and managing character data (e.g., recoding free response or other character data). This tutorial walks through some of the basic string manipulations functions.
The simplest and most important string manipulation function is paste
. It allows the user to concatenate character strings (and vectors of character strings) in a number of different ways.
The easiest way to use paste
is simply to concatenate several values together:
paste("1", "2", "3")
## [1] "1 2 3"
The result is a single string (i.e., one-element character vector) with the numbers separated by spaces (which is the default). We can also separate by other values:
paste("1", "2", "3", sep = ",")
## [1] "1,2,3"
A helpful feature of paste
is that it coerces objects to character before concatenating, so we can get the same result above using:
paste(1, 2, 3, sep = ",")
## [1] "1,2,3"
This also means we can combine objects of different classes (e.g., character and numeric):
paste("a", 1, "b", 2, sep = ":")
## [1] "a:1:b:2"
Another helpful feature of paste
is that it is vectorized, meaning that we can concatenate each element of two or more vectors in a single call:
a <- letters[1:10]
b <- 1:10
paste(a, b, sep = "")
## [1] "a1" "b2" "c3" "d4" "e5" "f6" "g7" "h8" "i9" "j10"
The result is a 10-element vector, where the first element of a
has been paste
d to the first element of b
and so forth.
We might want to collapse a multi-item vector into a single string and for this we can use the collapse
argument to paste
:
paste(a, collapse = "")
## [1] "abcdefghij"
Here, all of the elements of a
are concatenated into a single string.
We can also combine the sep
and collapse
arguments to obtain different results:
paste(a, b, sep = "", collapse = ",")
## [1] "a1,b2,c3,d4,e5,f6,g7,h8,i9,j10"
paste(a, b, sep = ",", collapse = ";")
## [1] "a,1;b,2;c,3;d,4;e,5;f,6;g,7;h,8;i,9;j,10"
The first result above concatenates corresponding elements from each vector without a space and then separates them by a comma. The second result concatenates corresponding elements with a comma between the elements and separates each pair of elements by semicolon.
The strsplit
function offers essentially the reversal of paste
, by cutting a string into parts based on a separator. Here we can collapse our a
vector and then split it back into a vector:
a1 <- paste(a, collapse = ",")
a1
## [1] "a,b,c,d,e,f,g,h,i,j"
strsplit(a1, ",")
## [[1]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
Note: strsplit
returns a list of results, so accessing the elements requires using the [[]]
(double bracket) operator. To get the second element from the split vector, use:
strsplit(a1, ",")[[1]][2]
## [1] "b"
The reason for this return value is that strsplit
is also vectorized. So we can split multiple elements of a character vectors in one call:
b1 <- paste(a, b, sep = ",")
b1
## [1] "a,1" "b,2" "c,3" "d,4" "e,5" "f,6" "g,7" "h,8" "i,9" "j,10"
strsplit(b1, ",")
## [[1]]
## [1] "a" "1"
##
## [[2]]
## [1] "b" "2"
##
## [[3]]
## [1] "c" "3"
##
## [[4]]
## [1] "d" "4"
##
## [[5]]
## [1] "e" "5"
##
## [[6]]
## [1] "f" "6"
##
## [[7]]
## [1] "g" "7"
##
## [[8]]
## [1] "h" "8"
##
## [[9]]
## [1] "i" "9"
##
## [[10]]
## [1] "j" "10"
The result is a list of split vectors.
Sometimes we want to get every single character from a character string, and for this we can use an empty separator:
strsplit(a1, "")[[1]]
## [1] "a" "," "b" "," "c" "," "d" "," "e" "," "f" "," "g" "," "h" "," "i"
## [18] "," "j"
The result is every letter and every separator split apart.
strsplit
also supports much more advanced character splitting using “regular expressions.” We address that in a separate tutorial.
Sometimes we want to know how many characters are in a string, or just get a subset of the characters. R provides two functions to help us with these operations: nchar
and substring
.
You can think of nchar
as analogous to length
but instead of telling you how many elements are in a vector it tells you how many characters are in a string:
d <- "abcd"
length(d)
## [1] 1
nchar(d)
## [1] 4
nchar
is vectorized, which means we can retrieve the number of characters in each element of a characte vector in one call:
e <- c("abc", "defg", "hi", "jklmnop")
nchar(e)
## [1] 3 4 2 7
substring
lets you extract a part of a string based on the position of characters in the string and can be combined with nchar
:
f <- "hello"
substring(f, 1, 1)
## [1] "h"
substring(f, 2, nchar(f))
## [1] "ello"
substring
is also vectorized. For example we could extract the first character from each element of a vector:
substring(e, 1, 1)
## [1] "a" "d" "h" "j"
Or even the last character of elements with different numbers of characters:
e
## [1] "abc" "defg" "hi" "jklmnop"
nchar(e)
## [1] 3 4 2 7
substring(e, nchar(e), nchar(e))
## [1] "c" "g" "i" "p"
The difference between a simple graph and a visually stunning graph is of course a matter of many features. But one of the biggest contributors to the “wow” factors that often accompanies R graphics is the careful use of color. By default, R graphs tend to be black-and-white and, in fact, rather unattractive. But R provides many functions for carefully controlling the colors that are used in plots. This tutorial looks at some of these functions.
To start, we need to have a baseline graph. We'll use a simple scatterplot. Let's start with some x
and y
data vectors and a z
grouping factor that we'll use later:
set.seed(100)
z <- sample(1:4, 100, TRUE)
x <- rnorm(100)
y <- rnorm(100)
Let's draw the basic scatterplot:
plot(x, y, pch = 15)
By default, the points in this plot are black. But we can change that color by specifying a col
argument and a character string containing a color. For example, we could make the points red:
plot(x, y, pch = 15, col = "red")
or blue:
plot(x, y, pch = 15, col = "blue")
R comes with hundreds of colors, which we can see using the colors()
function. Let's see the first 25 colors in this:
colors()[1:25]
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
## [9] "aquamarine1" "aquamarine2" "aquamarine3" "aquamarine4"
## [13] "azure" "azure1" "azure2" "azure3"
## [17] "azure4" "beige" "bisque" "bisque1"
## [21] "bisque2" "bisque3" "bisque4" "black"
## [25] "blanchedalmond"
You can specify any of these colors as is.
An important aspect of R's use of the col
argument is the notion of vector recyling. R expects the col
argument to have the same length as the number of things its plotting (in this case the number of points). So when we specify col='red'
, R actually “recycles” the color red for each point, effectively constructing a vector like c('red','red','red',...)
equal to the length of our data.
We can take advantage of recycling to specify multiple colors. For example, we can specify every other point in our data as being red and blue:
plot(x, y, pch = 15, col = c("red", "blue"))
Of course, these colors are not substantively meaningful. Our data are not organized in an alternating fashion. We did, however, have a grouping factor z
that takes four levels. We can imagine that these are four substantively important groups in our data that we would like to highlight with different colors. To do that, we could specify a vector of four colors and index it using our z
vector:
plot(x, y, pch = 15, col = c("red", "blue", "green", "orange")[z])
Now, the four groups each have their own color in the resulting plot. Another strategy is to use the pch
(“point character”) argument to identify groups, which we can do using the same logic:
plot(x, y, pch = c(15, 16, 17, 18)[z])
But I think colors look better here than different shapes. Of course, sometimes we have to print in grayscale or monochrome, so finding the best combination of shapes and colors may take a bit of work.
In addition to the named colors, R can also generate any other color pattern in the rainbow using one of several functions. For example, the rgb
function can generate a color based on levels of Red, Green, and Blue (thus the rgb
name). For example, the color red is simply:
rgb(1, 0, 0)
## [1] "#FF0000"
The result is the color red expressed in hexidecimal format. Two other functions - hsv
and hcl
- let you specify colors in other ways, but rgb
is the easiest, in part, because hexidecimal format is widely used in web publishing so there are many tools online for figuring out how to create the color you want as a combination of red, green, and blue. We can see that specifying col='red'
or col=rgb(1,0,0)
produce the same graphical result:
plot(x, y, pch = 15, col = "red")
plot(x, y, pch = 15, col = rgb(1, 0, 0))
But rgb
(and the other color-generation functions) are also “vectorized”, meaning that we can supply them with a vector of numbers in order to obtain different shades. For example, to get four shades of red, we can type:
rgb((1:4)/4, 0, 0)
## [1] "#400000" "#800000" "#BF0000" "#FF0000"
If we index this with z
(as we did above), we get a plot where are different groups are represented by different shades of red:
plot(x, y, pch = 15, col = rgb((1:4)/4, 0, 0)[z])
When we have to print in grayscale, R also supplies a function for building shades of gray, which is called - unsurprisingly - gray
. The gray
function takes a number between 0 and 1 that specifies a shade of gray between black (0) and white (1):
gray(0.5)
## [1] "#808080"
The response is, again, a hexidecimal color representation. Like rgb
, gray
is vectorized and we can use it to color our plot:
gray((1:4)/6)
## [1] "#2B2B2B" "#555555" "#808080" "#AAAAAA"
plot(x, y, pch = 15, col = gray((1:4)/6)[z])
But R doesn't restrict us to one color palette - just one color or just grayscale. We can also produce “rainbows” of color. For example, we could use the rainbow
function to get a rainbow of four different colors and use it on our plot.
plot(x, y, pch = 15, col = rainbow(4)[z])
rainbow
takes additional arguments, such as start
and end
that specify where on the rainbow (as measured from 0 to 1) the colors should come from. So, specifying low values for start
and end
will make a red/yellow-ish plot, middling values will produce a green/blue-ish plot, and high values will prdocue a blue/purple-ish plot:
plot(x, y, pch = 15, col = rainbow(4, start = 0, end = 0.25)[z])
plot(x, y, pch = 15, col = rainbow(4, start = 0.35, end = 0.6)[z])
plot(x, y, pch = 15, col = rainbow(4, start = 0.7, end = 0.9)[z])
Above we've used color to convey groups within the data. But we can also use color to convey a third variable on our two-dimensional plot. For example, we can imagine that we have some outcome val
to which x
and y
each contribute. We want to see the level of val
as it is affected by both x
and y
. Let's start by creating the val
vector as a function of x
and y
and then use it as a color value:
val <- x + y
Then let's rescale val
to be between 0 and 1 to make it easier to use in our color functions:
valcol <- (val + abs(min(val)))/max(val + abs(min(val)))
Now we can use the valcol
vector to color our plot using gray
:
plot(x, y, pch = 15, col = gray(valcol))
We could also use rgb
to create a spectrum of blues:
plot(x, y, pch = 15, col = rgb(0, 0, valcol))
There are endless other options, but this conveys the basic principles of plot coloring which rely on named colors or a color generation function, and the general R principles of recycling and vectorization.
Commenting is a way to describe the contents of an R script. Commenting is very important for reproducibility because it helps make sense of code to others and to a future you.
The scripts used in these materials include comments. Any text that follows a has symbol (#
) becomes an R comment.
Anything can be in a comment. It is ignored by R.
You can comment an entire line or just the end of a line, like:
2 + 2 # This is a comment. The code before the `#` is still evaluated by R.
## [1] 4
Some languages provide mutli-line comments. R doesn't have these. Every line has to be commented individually. Most script editors provide the ability to comment out multiple lines at once. This can be helpful if you change your mind about some code:
a <- 1:10
b <- 1:10
b <- 10:1
In the above example, we comment out the line we don't want to run.
If there are large blocks of valid R code that you decide you don't want to run, you can wrap them in an if
statement:
a <- 10
b <- 10
c <- 10
if (FALSE) {
a <- 1
b <- 2
c <- 3
}
a
## [1] 10
b
## [1] 10
c
## [1] 10
The lines inside the if(FALSE){...}
block are not run.
If you decide you want to run them after all, you can just change FALSE
to TRUE
.
comment
functionR also provides a quite useful function called comment
that stores a hidden description of an object.
This can be useful in interactive sessions for keeping track of a large number of objects.
It also has other uses in modelling and plotting that are discussed elsewhere.
To add a comment to an object, we simply assign something to the object and then assign it a comment:
d <- 1:10
d
## [1] 1 2 3 4 5 6 7 8 9 10
comment(d) <- "This is my first vector"
d
## [1] 1 2 3 4 5 6 7 8 9 10
Adding a comment is similar to adding a names
attribute to an object, but the comment is not printed when we call d
.
To see a comment for an object, we need to use the comment
function again:
comment(d)
## [1] "This is my first vector"
If an object has no comment, we receive a NULL result:
e <- 1:5
comment(e)
## NULL
Note: Comments must be valid character vectors. It is not possible to store a numeric value as a comment, but one can have multiple comments:
comment(e) <- c("hi", "bye")
comment(e)
## [1] "hi" "bye"
And this means that they can be indexed:
comment(e)[2]
## [1] "bye"
And that we can add additional comments:
comment(e)[length(comment(e)) + 1] <- "hello again"
comment(e)
## [1] "hi" "bye" "hello again"
Because comments are not printed by default, it is easy to forget about them. But they can be quite useful.
The correlation coefficients speaks to degree to which two variables can be summarized by a straight line.
set.seed(1)
n <- 1000
x1 <- rnorm(n, -1, 10)
x2 <- rnorm(n, 3, 2)
y <- 5 * x1 + x2 + rnorm(n, 1, 2)
To obtain the correlation of two variables we simply list them consecutively:
cor(x1, x2)
## [1] 0.006401
If we want to test the significance of the correlation, we need to use the cor.test
function:
cor.test(x1, x2)
##
## Pearson's product-moment correlation
##
## data: x1 and x2
## t = 0.2022, df = 998, p-value = 0.8398
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05561 0.06837
## sample estimates:
## cor
## 0.006401
To obtain a correlation matrix, we have to supply an input matrix:
cor(cbind(x1, x2, y))
## x1 x2 y
## x1 1.000000 0.006401 0.99838
## x2 0.006401 1.000000 0.04731
## y 0.998376 0.047312 1.00000
a <- rnorm(n)
b <- a^2 + rnorm(n)
If we plot the relationship of b
on a
, we see a strong (non-linear) relationship:
plot(b ~ a)
Yet the correlation between the two variables is low:
cor(a, b)
## [1] -0.01712
cor.test(a, b)
##
## Pearson's product-moment correlation
##
## data: a and b
## t = -0.541, df = 998, p-value = 0.5886
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07903 0.04492
## sample estimates:
## cor
## -0.01712
If we can identify the functional form of the relationship, however, we can figure out what the relationship really is. Clearly a linear relationship is inappropriate:
plot(b ~ a, col = "gray")
curve((x), col = "red", add = TRUE)
But what about y ~ x^2
(of course, we know this is correct):
plot(b ~ a, col = "gray")
curve((x^2), col = "blue", add = TRUE)
The correlation between b
and a
2 is thus much higher:
cor(a^2, b)
## [1] 0.843
We can see this visually by plotting b
against the transformed a
variable:
plot(b ~ I(a^2), col = "gray")
If we now overlay a linear relationship, we see how well the transformed data are represented by a line:
plot(b ~ I(a^2), col = "gray")
curve((x), col = "blue", add = TRUE)
Now let's see this side-by-side to see how the transform works:
layout(matrix(1:2, nrow = 1))
plot(b ~ a, col = "gray")
curve((x^2), col = "blue", add = TRUE)
plot(b ~ I(a^2), col = "gray")
curve((x), col = "blue", add = TRUE)
An approach that is sometimes used to examine the effects of variables involves “partial correlations.”
A partial correlation measures the strength of the linear relationship between two variables, controlling for the influence of one or more covariates.
For example, the correlation of y
and z
is:
z <- x1 + rnorm(n, 0, 2)
cor(y, z)
## [1] 0.9813
This correlation might be inflated or deflated to do the common antecedent variable x1
in both y
and z
.
Thus we may want to remove the variation due to x1
from both y
and z
via linear regression:
part1 <- lm(y ~ x1)
part2 <- lm(z ~ x1)
The correlation of the residuals of those two models is thus the partial correlation:
cor(part1$residual, part2$residual)
## [1] 0.03828
As we can see, the correlation between these variables is actually much lower once we account for the variation attributable to x1
.
We sometimes want to estimate models of count outcomes. Depending on substantive assumptions, we can model these using a linear model, an ordered outcome model, or a count-specific model. This tutorial talks about count models, specifically poisson models and negative beta binomial models.
Poisson models can be estimated using R's base glm
function, but negative beta binomial regression requires teh MASS add-on package, which is a recommended and therefore is pre-installed and you simply need to load it.
# poisson(link = 'log')
library(MASS)
# glm.nb()
As part of reproducible research, it is critical to make data and replication files publicly available. Within political science, The Dataverse Network is increasingly seen as the disciplinary standard for where and how to permanently archive data and replication files. This tutorial works through how to archive study data, files, and metadata at The Dataverse Network directly through R.
The Dataverse Network, created by the Institute for Quantitative Social Science at Harvard University, is software and an associated network of websites that permanently archive social data for postereity. The service is free to use, relatively simple, and strengthened by a recently added Application Programming Interface (API) that allows researchers to deposit into the Dataverse directly from R through the dvn package.
To deposit data in the Dataverse, you need to have an account. You can pick which dataverse website you want to use, but I recommend using the Harvard Dataverse, where much political science data is stored. Once you create an account and configure a personal dataverse, you can do almost everything else directly in R. To get started, install and load the dvn package:
install.packages("dvn", repos = "http://cran.r-project.org")
## Warning: package 'dvn' is in use and will not be installed
library(dvn)
Once installed, you'll need to setup your username and password using:
options(dvn.user = "username", dvn.pwd = "password")
Remember not to share your username and password with others. Since the remainder of this tutorial only works with a proper username and password, the following code is commented out, but should run on your machine: You can check to make sure your login credentials work by retrieving your Service Document:
# dvServiceDoc()
If that succeeds, then you can easily create a study by setting up some metadata (e.g., the title, author, etc. for your study) and then using dvCreateStudy
to create that the study listing.
# writeLines(dvBuildMetadata(title='My Study'),'mystudy.xml') created <-
# dvCreateStudy('mydataverse','mystudy.xml')
Then, you need to add files. dvn is versatile with regard to how to do this, allowing you to submit either filenames as character strings:
# dvAddFile(created$objectId,filename=c('file1.csv','file2.txt'))
dataframes currently loaded in memory:
# dvAddFile(created$objectId,dataframe=mydf)
or a .zip file containing multiple files:
# dvAddFile(created$objectId,filename='files.zip')
You can then confirm that everything has been uploaded successfully by examining the Study Statement:
# dvStudyStatement(created$objectId)
If everything looks good, you can then release the study publicly:
# dvReleaseStudy(created$objectid)
The dvn package also allows you to modify the metadata and delete files, but the above constitutes a complete workflow to making your data publicly available. See the package documentation for more details.
dvn additionally allows you to search for study data directly from R. For example, you can find all of my publicly available data using:
search <- dvSearch(list(authorName = "leeper"))
## 6 search results returned
Thus archiving your data on The Dataverse Network makes it readily accessible to R users everywhere, forever.
In addition to knowing how to index and view dataframes, as is discussed in other tutorials, it is also helpful to be able to adjust the arrangement of dataframes. By this I mean that it is sometimes helpful to split, sample, reorder, reshape, or otherwise change the organization of a dataframe. This tutorial explains a couple of functions that can help with these kinds of tasks. Note: One of the most important things to remember about R dataframes is that it rarely, if ever, matters what order observations or variables are have in a dataframe. Whereas in SPSS and SAS observations have to be sorted before performing operations, R does not require such sorting.
Sometimes we want to get dataframe columns in a different order from how they're read into the data. In most cases, though, we can just index the dataframe to see relevant columns rather reordering, but we can do the reordering if we want. Say we have the following 5-column dataframe:
set.seed(50)
mydf <- data.frame(a = rep(1:2, each = 10), b = rep(1:4, times = 5), c = rnorm(20),
d = rnorm(20), e = sample(1:20, 20, FALSE))
head(mydf)
## a b c d e
## 1 1 1 0.5497 -0.3499 11
## 2 1 2 -0.8416 -0.5869 1
## 3 1 3 0.0330 -1.5899 7
## 4 1 4 0.5241 1.6896 8
## 5 1 1 -1.7276 0.5636 9
## 6 1 2 -0.2779 2.6676 19
To view the columns in a different order, we can simply index the dataframe differently either by name or column position:
head(mydf[, c(3, 4, 5, 1, 2)])
## c d e a b
## 1 0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869 1 1 2
## 3 0.0330 -1.5899 7 1 3
## 4 0.5241 1.6896 8 1 4
## 5 -1.7276 0.5636 9 1 1
## 6 -0.2779 2.6676 19 1 2
head(mydf[, c("c", "d", "e", "a", "b")])
## c d e a b
## 1 0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869 1 1 2
## 3 0.0330 -1.5899 7 1 3
## 4 0.5241 1.6896 8 1 4
## 5 -1.7276 0.5636 9 1 1
## 6 -0.2779 2.6676 19 1 2
We can save the adjusted column order if we want:
mydf <- mydf[, c(3, 4, 5, 1, 2)]
head(mydf)
## c d e a b
## 1 0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869 1 1 2
## 3 0.0330 -1.5899 7 1 3
## 4 0.5241 1.6896 8 1 4
## 5 -1.7276 0.5636 9 1 1
## 6 -0.2779 2.6676 19 1 2
Changing row order works the same way as changing column order. We can simply index the dataframe in a different way. For example, let's say we want to reverse the order of the dataframe, we can simply write:
mydf[nrow(mydf):1, ]
## c d e a b
## 20 -0.3234 0.39322 5 2 4
## 19 -1.1660 0.40619 4 2 3
## 18 -0.7653 1.83968 3 2 2
## 17 -0.1568 0.41620 20 2 1
## 16 -0.3629 0.01910 2 2 4
## 15 -0.4555 -1.09605 14 2 3
## 14 0.1957 0.59725 6 2 2
## 13 -0.4986 -1.13045 13 2 1
## 12 0.5548 -0.85142 10 2 4
## 11 0.2952 0.19902 12 2 3
## 10 -1.4457 0.02867 17 1 2
## 9 0.9756 0.56875 18 1 1
## 8 -0.5909 -0.36212 16 1 4
## 7 0.3608 0.35653 15 1 3
## 6 -0.2779 2.66763 19 1 2
## 5 -1.7276 0.56358 9 1 1
## 4 0.5241 1.68956 8 1 4
## 3 0.0330 -1.58988 7 1 3
## 2 -0.8416 -0.58690 1 1 2
## 1 0.5497 -0.34993 11 1 1
And then we can save this new order if we want:
mydf <- mydf[nrow(mydf):1, ]
Rarely, however, do we want to just reorder by hand. Instead we might want to reorder according to the values of a column. One's intuition might be to use the sort
function because it is used to sort a vector:
mydf$e
## [1] 5 4 3 20 2 14 6 13 10 12 17 18 16 15 19 9 8 7 1 11
sort(mydf$e)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
But trying to run sort(mydf)
will produce an error.
Instead, we need to use the order
function, which returns the indexes of a sorted vector. Confusing? let's see how it works:
order(mydf$e)
## [1] 19 5 3 2 1 7 18 17 16 9 20 10 8 6 14 13 11 12 15 4
That doesn't look like a sorted vector, but this is because the values being shown are the indices of the vector, not the values themselves. If we index the mydf$e
vector by the output of order(mydf$e)
, it will be in the order we're expecting:
mydf$e[order(mydf$e)]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can apply this same logic to sorting a dataframe. We simply pick which column we want to order by and then use the output of order
as a row index. Let's compare the reordered dataframe to the original:
head(mydf[order(mydf$e), ])
## c d e a b
## 2 -0.8416 -0.5869 1 1 2
## 16 -0.3629 0.0191 2 2 4
## 18 -0.7653 1.8397 3 2 2
## 19 -1.1660 0.4062 4 2 3
## 20 -0.3234 0.3932 5 2 4
## 14 0.1957 0.5972 6 2 2
head(mydf) # original
## c d e a b
## 20 -0.3234 0.3932 5 2 4
## 19 -1.1660 0.4062 4 2 3
## 18 -0.7653 1.8397 3 2 2
## 17 -0.1568 0.4162 20 2 1
## 16 -0.3629 0.0191 2 2 4
## 15 -0.4555 -1.0960 14 2 3
Of course, we could save the reordered dataframe just as above:
mydf <- mydf[order(mydf$e), ]
Another common operation is to look at a subset of dataframe rows. For example, we might want to look at just the rows where mydf$a==1
. Remembering the rules for indexing a dataframe, we can simply index according to a logical rule:
mydf[mydf$a == 1, ]
## c d e a b
## 2 -0.8416 -0.58690 1 1 2
## 3 0.0330 -1.58988 7 1 3
## 4 0.5241 1.68956 8 1 4
## 5 -1.7276 0.56358 9 1 1
## 1 0.5497 -0.34993 11 1 1
## 7 0.3608 0.35653 15 1 3
## 8 -0.5909 -0.36212 16 1 4
## 10 -1.4457 0.02867 17 1 2
## 9 0.9756 0.56875 18 1 1
## 6 -0.2779 2.66763 19 1 2
And to get the rows where mydf$a==2
, we can do quite the same operation:
mydf[mydf$a == 2, ]
## c d e a b
## 16 -0.3629 0.0191 2 2 4
## 18 -0.7653 1.8397 3 2 2
## 19 -1.1660 0.4062 4 2 3
## 20 -0.3234 0.3932 5 2 4
## 14 0.1957 0.5972 6 2 2
## 12 0.5548 -0.8514 10 2 4
## 11 0.2952 0.1990 12 2 3
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 17 -0.1568 0.4162 20 2 1
We can also combine logical rules to get a further subset of values:
mydf[mydf$a == 1 & mydf$b == 4, ]
## c d e a b
## 4 0.5241 1.6896 8 1 4
## 8 -0.5909 -0.3621 16 1 4
And we need to restrict ourselves to equivalency logicals:
mydf[mydf$a == 1 & mydf$b > 2, ]
## c d e a b
## 3 0.0330 -1.5899 7 1 3
## 4 0.5241 1.6896 8 1 4
## 7 0.3608 0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4
R also supplies a subset
function, which can be used to select subsets of rows, subsets of columns, or both. It works like so:
# subset of rows:
subset(mydf, a == 1)
## c d e a b
## 2 -0.8416 -0.58690 1 1 2
## 3 0.0330 -1.58988 7 1 3
## 4 0.5241 1.68956 8 1 4
## 5 -1.7276 0.56358 9 1 1
## 1 0.5497 -0.34993 11 1 1
## 7 0.3608 0.35653 15 1 3
## 8 -0.5909 -0.36212 16 1 4
## 10 -1.4457 0.02867 17 1 2
## 9 0.9756 0.56875 18 1 1
## 6 -0.2779 2.66763 19 1 2
subset(mydf, a == 1 & b > 2)
## c d e a b
## 3 0.0330 -1.5899 7 1 3
## 4 0.5241 1.6896 8 1 4
## 7 0.3608 0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4
# subset of columns:
subset(mydf, select = c("a", "b"))
## a b
## 2 1 2
## 16 2 4
## 18 2 2
## 19 2 3
## 20 2 4
## 14 2 2
## 3 1 3
## 4 1 4
## 5 1 1
## 12 2 4
## 1 1 1
## 11 2 3
## 13 2 1
## 15 2 3
## 7 1 3
## 8 1 4
## 10 1 2
## 9 1 1
## 6 1 2
## 17 2 1
# subset of rows and columns:
subset(mydf, a == 1 & b > 2, select = c("c", "d"))
## c d
## 3 0.0330 -1.5899
## 4 0.5241 1.6896
## 7 0.3608 0.3565
## 8 -0.5909 -0.3621
Using indices and subset
are equivalent, but the indexing syntax is more general.
In one of the above examples, we extracted two separate dataframes: one for mydf$a==1
and one for mydf$a==2
. We can actually achieve that result using a single line of code involving the split
function, which returns a list of dataframes, separated out by a grouping factor:
split(mydf, mydf$a)
## $`1`
## c d e a b
## 2 -0.8416 -0.58690 1 1 2
## 3 0.0330 -1.58988 7 1 3
## 4 0.5241 1.68956 8 1 4
## 5 -1.7276 0.56358 9 1 1
## 1 0.5497 -0.34993 11 1 1
## 7 0.3608 0.35653 15 1 3
## 8 -0.5909 -0.36212 16 1 4
## 10 -1.4457 0.02867 17 1 2
## 9 0.9756 0.56875 18 1 1
## 6 -0.2779 2.66763 19 1 2
##
## $`2`
## c d e a b
## 16 -0.3629 0.0191 2 2 4
## 18 -0.7653 1.8397 3 2 2
## 19 -1.1660 0.4062 4 2 3
## 20 -0.3234 0.3932 5 2 4
## 14 0.1957 0.5972 6 2 2
## 12 0.5548 -0.8514 10 2 4
## 11 0.2952 0.1990 12 2 3
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 17 -0.1568 0.4162 20 2 1
We can also split by multiple factors, e.g., a dataframe for every unique combination of mydf$a
and mydf$b
:
split(mydf, list(mydf$a, mydf$b))
## $`1.1`
## c d e a b
## 5 -1.7276 0.5636 9 1 1
## 1 0.5497 -0.3499 11 1 1
## 9 0.9756 0.5687 18 1 1
##
## $`2.1`
## c d e a b
## 13 -0.4986 -1.1304 13 2 1
## 17 -0.1568 0.4162 20 2 1
##
## $`1.2`
## c d e a b
## 2 -0.8416 -0.58690 1 1 2
## 10 -1.4457 0.02867 17 1 2
## 6 -0.2779 2.66763 19 1 2
##
## $`2.2`
## c d e a b
## 18 -0.7653 1.8397 3 2 2
## 14 0.1957 0.5972 6 2 2
##
## $`1.3`
## c d e a b
## 3 0.0330 -1.5899 7 1 3
## 7 0.3608 0.3565 15 1 3
##
## $`2.3`
## c d e a b
## 19 -1.1660 0.4062 4 2 3
## 11 0.2952 0.1990 12 2 3
## 15 -0.4555 -1.0960 14 2 3
##
## $`1.4`
## c d e a b
## 4 0.5241 1.6896 8 1 4
## 8 -0.5909 -0.3621 16 1 4
##
## $`2.4`
## c d e a b
## 16 -0.3629 0.0191 2 2 4
## 20 -0.3234 0.3932 5 2 4
## 12 0.5548 -0.8514 10 2 4
Having our dataframes stored inside another object might seem inconvenient, but it actually is vary useful because we can use functions like lapply
to perform an operation on every dataframe in the list. For example, we could get the summary of every variable in each of two subsets of the dataframe in a single line of code:
lapply(split(mydf, mydf$a), summary)
## $`1`
## c d e a
## Min. :-1.728 Min. :-1.590 Min. : 1.00 Min. :1
## 1st Qu.:-0.779 1st Qu.:-0.359 1st Qu.: 8.25 1st Qu.:1
## Median :-0.122 Median : 0.193 Median :13.00 Median :1
## Mean :-0.244 Mean : 0.299 Mean :12.10 Mean :1
## 3rd Qu.: 0.483 3rd Qu.: 0.568 3rd Qu.:16.75 3rd Qu.:1
## Max. : 0.976 Max. : 2.668 Max. :19.00 Max. :1
## b
## Min. :1.00
## 1st Qu.:1.25
## Median :2.00
## Mean :2.30
## 3rd Qu.:3.00
## Max. :4.00
##
## $`2`
## c d e a
## Min. :-1.166 Min. :-1.1304 Min. : 2.00 Min. :2
## 1st Qu.:-0.488 1st Qu.:-0.6338 1st Qu.: 4.25 1st Qu.:2
## Median :-0.343 Median : 0.2961 Median : 8.00 Median :2
## Mean :-0.268 Mean : 0.0793 Mean : 8.90 Mean :2
## 3rd Qu.: 0.108 3rd Qu.: 0.4137 3rd Qu.:12.75 3rd Qu.:2
## Max. : 0.555 Max. : 1.8397 Max. :20.00 Max. :2
## b
## Min. :1.00
## 1st Qu.:2.00
## Median :3.00
## Mean :2.70
## 3rd Qu.:3.75
## Max. :4.00
Another common task is random sampling or permutation of rows in a dataframe. For example, we might want to build a regression model on a random subset of cases (a “training set”) and then test the model on the remaining case (a “test set”). Or, we might want to look at a random sample of the observations (e.g., perhaps to speed up a very time-consuming analysis).
Let's consider the case of sampling for “training” and “test” sets. To obtain a random sample, we have two choices. We can either sample a specified number of rows or we can use a logical index to sample rows based on a specified probability. Both use the sample
function.
To look at, e.g., exactly five randomly selected rows from our data frame as the training set, we can do the following:
s <- sample(1:nrow(mydf), 5, FALSE)
s
## [1] 11 19 12 1 17
Note: The third argument (FALSE
) refers to whether sampling should be done with replacement.
We can then use that directly as a row index:
mydf[s, ]
## c d e a b
## 1 0.5497 -0.34993 11 1 1
## 6 -0.2779 2.66763 19 1 2
## 11 0.2952 0.19902 12 2 3
## 2 -0.8416 -0.58690 1 1 2
## 10 -1.4457 0.02867 17 1 2
To see the test set, we simply drop all rows not in s
:
mydf[-s, ]
## c d e a b
## 16 -0.3629 0.0191 2 2 4
## 18 -0.7653 1.8397 3 2 2
## 19 -1.1660 0.4062 4 2 3
## 20 -0.3234 0.3932 5 2 4
## 14 0.1957 0.5972 6 2 2
## 3 0.0330 -1.5899 7 1 3
## 4 0.5241 1.6896 8 1 4
## 5 -1.7276 0.5636 9 1 1
## 12 0.5548 -0.8514 10 2 4
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 7 0.3608 0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4
## 9 0.9756 0.5687 18 1 1
## 17 -0.1568 0.4162 20 2 1
An alternative is to get a random 20% of the rows but not require that to be exactly five observations. To do that, we make 20 random draws (i.e., a number of draws equal to the number of rows in our dataframe) from a binomial distribution with probability .2:
s2 <- rbinom(nrow(mydf), 1, 0.2)
s2
## [1] 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
We can then use that directly as a row index:
mydf[s2, ]
## c d e a b
## 2 -0.8416 -0.5869 1 1 2
## 2.1 -0.8416 -0.5869 1 1 2
## 2.2 -0.8416 -0.5869 1 1 2
## 2.3 -0.8416 -0.5869 1 1 2
And again see the test set as those observations not in s2
.
mydf[!s2, ]
## c d e a b
## 16 -0.3629 0.01910 2 2 4
## 18 -0.7653 1.83968 3 2 2
## 20 -0.3234 0.39322 5 2 4
## 3 0.0330 -1.58988 7 1 3
## 4 0.5241 1.68956 8 1 4
## 5 -1.7276 0.56358 9 1 1
## 12 0.5548 -0.85142 10 2 4
## 1 0.5497 -0.34993 11 1 1
## 11 0.2952 0.19902 12 2 3
## 13 -0.4986 -1.13045 13 2 1
## 15 -0.4555 -1.09605 14 2 3
## 8 -0.5909 -0.36212 16 1 4
## 10 -1.4457 0.02867 17 1 2
## 9 0.9756 0.56875 18 1 1
## 6 -0.2779 2.66763 19 1 2
## 17 -0.1568 0.41620 20 2 1
Note: Here we use !s2
because s2
is a logical index, whereas above we used -s
because s
was a positional index.
Dataframes are integrally important to using R for any kind of data analysis. One of the most frustrating aspects of R for new users is that, unlike Excel, or even SPSS or Stata, it is not terribly easy to look at and modify data in a spreadsheet like format. In the tutorial on dataframes as a class, you should have learned a bit about what dataframes are and how to index and modify them. Here we are going to discuss how to look at dataframes in a variety of ways.
Looking at dataframes in R is actually pretty easy. Because a dataframe is an R object, we can simply print it to the console by calling its name. Let's create a dataframe and try this:
mydf <- data.frame(a = rbinom(100, 1, 0.5), b = rnorm(100), c = rnorm(100),
d = rnorm(100), e = sample(LETTERS, 100, TRUE))
mydf
## a b c d e
## 1 0 -0.65302 1.617287 0.35947 O
## 2 0 -1.56067 1.374269 1.16150 U
## 3 0 -0.88265 0.561109 1.50842 Y
## 4 0 -0.64753 1.414186 -1.33762 N
## 5 0 -0.94923 -0.964017 -2.29660 E
## 6 0 1.12688 -0.616431 -1.96846 F
## 7 0 1.72761 0.008532 -0.73825 W
## 8 1 -0.29763 1.572682 -0.19632 T
## 9 0 -0.24442 0.053971 2.59850 Z
## 10 0 -0.84921 -0.189399 -1.13353 K
## 11 0 0.11510 -0.043527 -1.89618 B
## 12 1 0.70786 0.024526 -1.08325 S
## 13 0 -0.92021 -3.408887 -0.70295 E
## 14 0 1.13397 -0.029900 0.55542 V
## 15 0 0.04453 0.373467 -0.61795 S
## 16 1 1.47634 0.944661 -0.36271 Z
## 17 0 1.62780 -0.603154 -0.07608 J
## 18 0 0.78341 -0.591424 0.36601 Q
## 19 0 -0.07220 -1.497778 0.70145 Y
## 20 0 -1.32925 1.888501 1.05821 J
## 21 0 1.08259 -2.293813 0.49702 C
## 22 1 0.73256 -0.552174 -0.72288 B
## 23 1 -0.30210 0.576488 -0.94125 R
## 24 1 -0.39293 -0.186210 0.82022 J
## 25 0 -2.64254 1.022059 -1.40601 T
## 26 0 -0.22410 1.673398 -2.00373 Z
## 27 0 1.95346 -1.285846 1.67366 V
## 28 0 -0.58287 0.930812 -1.99689 P
## 29 1 1.06114 0.512845 -0.96299 N
## 30 1 0.75882 -0.544033 -0.87342 Z
## 31 0 0.58825 -0.537684 0.27048 H
## 32 0 -0.43292 0.762805 -0.18115 V
## 33 0 -0.09822 0.144783 -1.51522 O
## 34 1 1.38690 0.202230 -0.92736 S
## 35 0 -1.31040 -1.456198 2.06663 W
## 36 0 -0.67905 1.053445 0.11093 G
## 37 0 1.20022 -0.397525 0.10330 Q
## 38 1 0.99828 0.810732 -0.43627 I
## 39 1 -0.55813 0.300040 0.82089 C
## 40 1 0.19107 -0.732265 1.64319 L
## 41 1 -0.93658 -0.803333 0.65210 R
## 42 0 1.71818 -0.259426 1.72735 L
## 43 0 0.79274 -1.577459 -2.33531 Y
## 44 1 -0.17978 0.387909 -0.04763 T
## 45 0 -1.27127 -0.731157 -0.23587 J
## 46 1 0.36220 -1.182620 -1.58457 H
## 47 1 2.26727 1.503092 -1.20872 D
## 48 0 -0.56679 -1.205823 0.30645 Q
## 49 0 1.18184 0.274242 -0.25508 H
## 50 0 -0.43997 -1.203856 0.03733 Z
## 51 0 -0.21525 0.175392 1.54721 T
## 52 0 0.17862 2.041101 0.48442 Y
## 53 1 2.82008 1.209535 -0.67040 X
## 54 0 -0.02909 -0.379774 0.13640 X
## 55 1 -0.52543 -0.976383 -0.44816 W
## 56 1 0.92736 -0.066320 -1.38853 A
## 57 0 0.81235 -1.163808 0.02140 N
## 58 1 -1.63686 -0.670042 -0.55861 P
## 59 1 -1.45887 -0.257498 0.66978 K
## 60 1 0.36716 0.092494 -0.59397 M
## 61 1 0.50476 -1.691161 0.13602 U
## 62 1 -0.53350 -0.781128 0.39872 T
## 63 0 0.13419 -1.218642 0.43340 X
## 64 0 0.68213 -0.262076 -0.57323 U
## 65 0 -2.09181 1.600879 0.16202 L
## 66 1 -1.35759 0.271196 -1.45684 R
## 67 0 -0.64975 0.404372 -0.44506 V
## 68 0 -0.33656 -0.662692 0.20784 R
## 69 1 -1.19379 -1.547217 -1.40629 Y
## 70 1 0.48648 -1.117218 -0.12517 R
## 71 0 -1.03210 -0.369793 -0.74953 X
## 72 0 0.34542 0.494358 -1.19533 Z
## 73 1 0.41408 0.264469 -2.49834 O
## 74 1 -0.20288 -0.076575 0.29039 X
## 75 0 -0.18147 0.019607 -1.31953 K
## 76 0 -0.57495 0.778011 -2.20197 I
## 77 1 -1.69877 0.636596 -0.33592 L
## 78 1 -2.07330 1.766734 2.43636 C
## 79 0 0.29462 -0.991969 -0.66017 B
## 80 1 0.29372 -0.573212 0.46335 C
## 81 0 0.85411 -0.371477 -0.06186 W
## 82 1 0.70678 0.274230 0.14330 K
## 83 0 -0.86584 0.313496 -0.82688 W
## 84 1 0.84311 -1.478058 0.25956 S
## 85 0 -1.11050 -0.501903 -2.30398 H
## 86 1 0.23547 2.010354 -0.88391 R
## 87 1 0.04245 -0.928369 -0.75509 U
## 88 0 1.09768 -1.806275 -0.64789 B
## 89 0 -0.85865 1.339204 0.42920 W
## 90 1 0.49483 1.133309 0.51501 W
## 91 0 -2.17343 -1.207055 -0.43024 D
## 92 0 1.56411 0.560760 1.52356 Y
## 93 1 0.23590 -1.444402 -0.48720 Y
## 94 0 -0.58226 -0.188818 -0.26365 W
## 95 1 0.33818 -0.462813 -0.65003 P
## 96 1 -0.25738 1.953699 1.68336 O
## 97 0 1.15532 -0.168700 -0.48666 S
## 98 0 -0.88605 -0.596704 -0.39284 D
## 99 0 1.03949 0.944495 0.02210 K
## 100 0 0.47307 -0.616859 0.72329 M
This output is fine but kind of inconvenient. It doesn't fit on one screen, we can't modify anything, and - if we had more variables and/or more observations - it would be pretty difficult to anything in this way.
Note: Calling the dataframe by name is the same as print
-ing it. So mydf
is the same as print(mydf)
.
As we already know, we can use summary
to see a more compact version of the dataframe:
summary(mydf)
## a b c d
## Min. :0.0 Min. :-2.6425 Min. :-3.409 Min. :-2.498
## 1st Qu.:0.0 1st Qu.:-0.6506 1st Qu.:-0.731 1st Qu.:-0.876
## Median :0.0 Median : 0.0067 Median :-0.123 Median :-0.259
## Mean :0.4 Mean : 0.0081 Mean :-0.072 Mean :-0.231
## 3rd Qu.:1.0 3rd Qu.: 0.7650 3rd Qu.: 0.565 3rd Qu.: 0.406
## Max. :1.0 Max. : 2.8201 Max. : 2.041 Max. : 2.599
##
## e
## W : 8
## Y : 7
## R : 6
## Z : 6
## K : 5
## S : 5
## (Other):63
Now, instead of all the data, we see a five-number summary of the data for numeric or integer variables and a tabulation of mydf$e
, which is a factor variable (you can confirm this with class(mydf$e)
).
We can also use str
to see a different kind of compact summary:
str(mydf)
## 'data.frame': 100 obs. of 5 variables:
## $ a: int 0 0 0 0 0 0 0 1 0 0 ...
## $ b: num -0.653 -1.561 -0.883 -0.648 -0.949 ...
## $ c: num 1.617 1.374 0.561 1.414 -0.964 ...
## $ d: num 0.359 1.161 1.508 -1.338 -2.297 ...
## $ e: Factor w/ 26 levels "A","B","C","D",..: 15 21 25 14 5 6 23 20 26 11 ...
This output has the advantage of additionally showing variable classes and the first few values of each variable, but doesn't provide a numeric summary of the data. Thus summary
and str
complement each other rather than provide duplicate information.
Remember, too, that dataframes also carry a “names” attribute, so we can see just the names of our variables using:
names(mydf)
## [1] "a" "b" "c" "d" "e"
This is very important for when a dataframe is very wide (i.e., has large numbers of variables) because even the compact output of summary
and str
can become unwieldy with more than 20 or so variables.
Two frequently neglected functions in R are head
and tail
. These offer exactly what their names suggest, the top and bottom few values of an object:
head(mydf)
## a b c d e
## 1 0 -0.6530 1.6173 0.3595 O
## 2 0 -1.5607 1.3743 1.1615 U
## 3 0 -0.8826 0.5611 1.5084 Y
## 4 0 -0.6475 1.4142 -1.3376 N
## 5 0 -0.9492 -0.9640 -2.2966 E
## 6 0 1.1269 -0.6164 -1.9685 F
Note the similarly between these values and those reported in str(mydf)
.
tail(mydf)
## a b c d e
## 95 1 0.3382 -0.4628 -0.6500 P
## 96 1 -0.2574 1.9537 1.6834 O
## 97 0 1.1553 -0.1687 -0.4867 S
## 98 0 -0.8861 -0.5967 -0.3928 D
## 99 0 1.0395 0.9445 0.0221 K
## 100 0 0.4731 -0.6169 0.7233 M
Both head
and tail
accept an additional argument referring to how many values to display:
head(mydf, 2)
## a b c d e
## 1 0 -0.653 1.617 0.3595 O
## 2 0 -1.561 1.374 1.1615 U
head(mydf, 15)
## a b c d e
## 1 0 -0.65302 1.617287 0.3595 O
## 2 0 -1.56067 1.374269 1.1615 U
## 3 0 -0.88265 0.561109 1.5084 Y
## 4 0 -0.64753 1.414186 -1.3376 N
## 5 0 -0.94923 -0.964017 -2.2966 E
## 6 0 1.12688 -0.616431 -1.9685 F
## 7 0 1.72761 0.008532 -0.7382 W
## 8 1 -0.29763 1.572682 -0.1963 T
## 9 0 -0.24442 0.053971 2.5985 Z
## 10 0 -0.84921 -0.189399 -1.1335 K
## 11 0 0.11510 -0.043527 -1.8962 B
## 12 1 0.70786 0.024526 -1.0832 S
## 13 0 -0.92021 -3.408887 -0.7029 E
## 14 0 1.13397 -0.029900 0.5554 V
## 15 0 0.04453 0.373467 -0.6180 S
These functions are therefore very helpful for looking quickly at a dataframe. They can also be applied to individual variables inside of a dataframe:
head(mydf$a)
## [1] 0 0 0 0 0 0
tail(mydf$e)
## [1] P O S D K M
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
R provides two ways to edit an R dataframe (or matrix) in a spreadsheet like fashion. They look the same, but are different! Both can be used to look at data in a spreadsheet-like way, but editing with them produces drastically different results.
Note: One point of confusion is that calling edit
or fix
on a non-dataframe object opens a completely different text editing window that can be used to modify vectors, functions, etc. If you try to edit
or fix
something and don't see a spreadsheet, the object you're trying to edit is not rectangular (i.e., not a dataframe or matrix).
The first of these is edit
, which opens an R dataframe as a spreadsheet. The data can then be directly edited.
When the spreadsheet window is closed, the resulting dataframe is returned to the user (and printed to the console).
This is a reminder that it didn't actually change the mydf
object. In other words, when we edit
a dataframe, we are actually copying the dataframe, changing its values, and then returning it to the console. The original mydf
is unchanged. If we want to use this modified dataframe, we need to save it as a new R object.
The second data editing function is fix
. This is probably the more intuitive function.
Like edit
, fix
opens the spreadsheet editor. But, when the window is closed, the result is used to replace the dataframe. Thus fix(mydf)
replaces mydf
with the edited data.
edit
and fix
can seem like a good idea. And if they are used simply to look at data, they're a great additional tool (along with summary
, str
, head
, tail
, and indexing).
But (!!!!) using edit
and fix
are non-reproducible ways of conducting data analysis. If we want to replace values in a dataframe, it is better (from the perspective of reproducible science) to write out the code to perform those replacements so that you or someone else can use them in the future to achieve the same results. So, in short, use edit
and fix
, but don't abuse them.
When it comes to performing statistical analysis in R, the most important object type is a dataframe. When we load data into R or use R to conduct statistical tests or build models, we want to have our data as a dataframe. A dataframe is actually a special type of list that has some properties that facilitate using it for data analysis.
To create a dataframe, we use the data.frame
function:
a <- data.frame(1:3)
a
## X1.3
## 1 1
## 2 2
## 3 3
This example is a single vector coerced into being a dataframe.
Our input vector 1:3
is printed as a column and the dataframe has row names:
rownames(a)
## [1] "1" "2" "3"
And the vector has been automatically given a column name:
colnames(a)
## [1] "X1.3"
Note: We can also see the column names of a dataframe using names
:
names(a)
## [1] "X1.3"
Like a matrix, we can see that this dataframe has dimensions:
dim(a)
## [1] 3 1
Which we can observe as row and column dimensions:
nrow(a)
## [1] 3
ncol(a)
## [1] 1
But having a dataframe consisting of one column vector isn't very helpful. In general we want to have multiple columns, where each column is a variable and each row is an observation.
b <- data.frame(1:3, 4:6)
b
## X1.3 X4.6
## 1 1 4
## 2 2 5
## 3 3 6
You can see the similarity to building a list and indeed if we check whether our dataframe is a list, it is:
is.data.frame(b)
## [1] TRUE
is.list(b)
## [1] TRUE
Our new dataframe b
now has two column variables and the same number of rows.
The names of the dataframe are assigned automatically, but we can change them:
names(b)
## [1] "X1.3" "X4.6"
names(b) <- c("var1", "var2")
names(b)
## [1] "var1" "var2"
We can also assign names when we create a dataframe, just as we did with a list:
d <- data.frame(var1 = 1:3, var2 = 4:6)
names(d)
## [1] "var1" "var2"
d
## var1 var2
## 1 1 4
## 2 2 5
## 3 3 6
Indexing dataframes works similarly to both lists and matrices. Even though our dataframe isn't a matrix:
is.matrix(d)
## [1] FALSE
We can still index it in two dimensions like a matrix to extract rows, columns, or elements:
d[1, ] #' row
## var1 var2
## 1 1 4
d[, 2] #' column
## [1] 4 5 6
d[3, 2] #' element
## [1] 6
Because dataframes are actually lists, we can index them just like we would a list: For example, to get a dataframe containing only our first column variable, we can use single brackets:
d[1]
## var1
## 1 1
## 2 2
## 3 3
The same result is possible with named indexing:
d["var1"]
## var1
## 1 1
## 2 2
## 3 3
To get that column variable as a vector instead of a one-column dataframe, we can use double brackets:
d[[1]]
## [1] 1 2 3
And we can also use named indexing as we would in a list:
d[["var1"]]
## [1] 1 2 3
d$var1
## [1] 1 2 3
And, we can combine indexing like we did with a list to get the elements of a column vector:
d[["var2"]][3]
## [1] 6
d$var2[3]
## [1] 6
We can also use -
indexing to exclude columns:
d[, -1]
## [1] 4 5 6
or rows:
d[-2, ]
## var1 var2
## 1 1 4
## 3 3 6
Thus, it is very easy to extract different parts of a dataframe in different ways, depending on what we want to do.
With those indexing rules, it is also very easy to change dataframe elements. For example, to add a column variable, we just need to add a vector with a name:
d$var3 <- 7:9
d
## var1 var2 var3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
If we try to add a vector that is shorter than the number of dataframe rows, recycling is invoked:
d$var4 <- 1
d
## var1 var2 var3 var4
## 1 1 4 7 1
## 2 2 5 8 1
## 3 3 6 9 1
If we try to add a vector that is longer than the number of dataframe rows, we get a error:
d$var4 <- 1:4
## Error: replacement has 4 rows, data has 3
So even though a dataframe is like a list, it has the restriction that all columns must have the same length.
We can also remove dataframe columns by setting them equal to NULL:
d
## var1 var2 var3 var4
## 1 1 4 7 1
## 2 2 5 8 1
## 3 3 6 9 1
d$var4 <- NULL
d
## var1 var2 var3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
This permanently removes the column variable from the dataframe and reduces its dimensions. To remove rows, you simply using positional indexing as described above and assign the result as itself:
d
## var1 var2 var3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
d[-2, ]
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
d <- d[-2, ]
d
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
This highlights an important point. Unless we assign using <-
, we are not modifying the dataframe, only changing what is displayed.
If we want to preserve a dataframe and a modified version of it, we can simply assign the modified version a new name:
d
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
d2 <- d[, -1]
This leaves our original dataframe unchanged:
d
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
And gives us a new object reflecting the modified dataframe:
d2
## var2 var3
## 1 4 7
## 3 6 9
Another similarly between dataframes and matrices is that we can bind them columnwise:
e1 <- data.frame(1:3, 4:6)
e2 <- data.frame(7:9, 10:12)
cbind(e1, e2)
## X1.3 X4.6 X7.9 X10.12
## 1 1 4 7 10
## 2 2 5 8 11
## 3 3 6 9 12
To bind them rowwise, however, the two dataframes need to have matching names
:
names(e1) <- names(e2) <- c("Var1", "Var2")
rbind(e1, e2)
## Var1 Var2
## 1 1 4
## 2 2 5
## 3 3 6
## 4 7 10
## 5 8 11
## 6 9 12
Dataframes can also be combined using the merge
function.
merge
is powerful, but can also be confusing.
Let's imagine that our two dataframes contain observations for the same three individuals, but in different orders:
e1$id <- 1:3
e2$id <- c(2, 1, 3)
We should also rename the variables in e2
to show that these are unique variables:
names(e2)[1:2] <- c("Var3", "Var4")
If we use cbind
to combine the data, variables from observations in the two dataframes will be mismatched:
cbind(e1, e2)
## Var1 Var2 id Var3 Var4 id
## 1 1 4 1 7 10 2
## 2 2 5 2 8 11 1
## 3 3 6 3 9 12 3
This is where merge comes in handy because we can specify a by
parameter:
e3 <- merge(e1, e2, by = "id")
The result is a single dataframe, with a single id
variable and observations from the two dataframes are matched appropriately.
That was a simple example, but what if our dataframes have different (but overlapping) sets of observations.
e4 <- data.frame(Var5 = 10:1, Var6 = c(5:1, 1:5), id = c(1:2, 4:11))
e4
## Var5 Var6 id
## 1 10 5 1
## 2 9 4 2
## 3 8 3 4
## 4 7 2 5
## 5 6 1 6
## 6 5 1 7
## 7 4 2 8
## 8 3 3 9
## 9 2 4 10
## 10 1 5 11
This new dataframe e4
has two observations common to the previous dataframes (1 and 2) but no observation for 3.
If we merge e3
and e4
, what do we get?
merge(e3, e4, by = "id")
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
The result is all variables (columns) for the two common observations (1 and 2). If we want to include observation 3, we can use:
merge(e3, e4, by = "id", all.x = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 3 3 6 9 12 NA NA
Note: The all.x
argument refers to which observations from the first dataframe (e3
) we want to preserve.
If we want to include observations 4 to 11, we can use:
merge(e3, e4, by = "id", all.y = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 4 NA NA NA NA 8 3
## 4 5 NA NA NA NA 7 2
## 5 6 NA NA NA NA 6 1
## 6 7 NA NA NA NA 5 1
## 7 8 NA NA NA NA 4 2
## 8 9 NA NA NA NA 3 3
## 9 10 NA NA NA NA 2 4
## 10 11 NA NA NA NA 1 5
Note: The all.y
argument refers to which observations from the second dataframe (e4
) we want to preserve.
Of course, we can preserve both with either:
merge(e3, e4, by = "id", all.x = TRUE, all.y = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 3 3 6 9 12 NA NA
## 4 4 NA NA NA NA 8 3
## 5 5 NA NA NA NA 7 2
## 6 6 NA NA NA NA 6 1
## 7 7 NA NA NA NA 5 1
## 8 8 NA NA NA NA 4 2
## 9 9 NA NA NA NA 3 3
## 10 10 NA NA NA NA 2 4
## 11 11 NA NA NA NA 1 5
merge(e3, e4, by = "id", all = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 3 3 6 9 12 NA NA
## 4 4 NA NA NA NA 8 3
## 5 5 NA NA NA NA 7 2
## 6 6 NA NA NA NA 6 1
## 7 7 NA NA NA NA 5 1
## 8 8 NA NA NA NA 4 2
## 9 9 NA NA NA NA 3 3
## 10 10 NA NA NA NA 2 4
## 11 11 NA NA NA NA 1 5
These two R statements are equivalent.
Note: If we set by=NULL
, we get a potentially unexpected result:
merge(e3, e4, by = NULL)
## id.x Var1 Var2 Var3 Var4 Var5 Var6 id.y
## 1 1 1 4 8 11 10 5 1
## 2 2 2 5 7 10 10 5 1
## 3 3 3 6 9 12 10 5 1
## 4 1 1 4 8 11 9 4 2
## 5 2 2 5 7 10 9 4 2
## 6 3 3 6 9 12 9 4 2
## 7 1 1 4 8 11 8 3 4
## 8 2 2 5 7 10 8 3 4
## 9 3 3 6 9 12 8 3 4
## 10 1 1 4 8 11 7 2 5
## 11 2 2 5 7 10 7 2 5
## 12 3 3 6 9 12 7 2 5
## 13 1 1 4 8 11 6 1 6
## 14 2 2 5 7 10 6 1 6
## 15 3 3 6 9 12 6 1 6
## 16 1 1 4 8 11 5 1 7
## 17 2 2 5 7 10 5 1 7
## 18 3 3 6 9 12 5 1 7
## 19 1 1 4 8 11 4 2 8
## 20 2 2 5 7 10 4 2 8
## 21 3 3 6 9 12 4 2 8
## 22 1 1 4 8 11 3 3 9
## 23 2 2 5 7 10 3 3 9
## 24 3 3 6 9 12 3 3 9
## 25 1 1 4 8 11 2 4 10
## 26 2 2 5 7 10 2 4 10
## 27 3 3 6 9 12 2 4 10
## 28 1 1 4 8 11 1 5 11
## 29 2 2 5 7 10 1 5 11
## 30 3 3 6 9 12 1 5 11
If we leave by blank, the default is to merge based on the common variable names in both dataframes.
We can also separately specify by
for each dataframe:
merge(e3, e4, by.x = "id", by.y = "id")
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
This would be helpful if the identifier variable had a different name in each dataframe.
Note: merge
only works with two dataframes. So, if multiple dataframes need to be merged, it must be done sequentially:
merge(merge(e1, e2), e4)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
This tutorial walks through some basics for how to export results to Word format files or similar.
install.packages(c("rtf"), repos = "http://cran.r-project.org")
## Warning: package 'rtf' is in use and will not be installed
As a running example, let's build a regression model, whose coefficients we want to output:
set.seed(1)
x1 <- runif(100, 0, 1)
x2 <- rbinom(100, 1, 0.5)
y <- x1 + x2 + rnorm(100)
s1 <- summary(lm(y ~ x1))
s2 <- summary(lm(y ~ x1 + x2))
One of the easiest ways to move results from R to Word is simply to copy and paste them. R results are printed in ASCII, though, so the results don't necessarily copy well (e.g., tables tend to lose their formatting unless they're pasted into the Word document using a fixed-width font like Courier).
For example, we could just copy print coef(s)
to the console and manually copy and paste the results:
round(coef(s2), 2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.07 0.24 0.28 0.78
## x1 0.92 0.36 2.52 0.01
## x2 0.89 0.19 4.57 0.00
Another, perhaps easier alternative, is to write the results to the console as a comma-separated value (CSV) format:
write.csv(round(coef(s2), 2))
## "","Estimate","Std. Error","t value","Pr(>|t|)"
## "(Intercept)",0.07,0.24,0.28,0.78
## "x1",0.92,0.36,2.52,0.01
## "x2",0.89,0.19,4.57,0
The output doesn't look very pretty in R and it won't look very pretty in Word, either, at least not right away. If you copy the CSV format and paste it into a Word document, it will look like a mess. But, if you select the pasted text, click the “Insert” menu, and press “Table”, a menu will open, one option of which is “Convert text to table…” Clicking this option, and selecting to “Separate text at” commmas, then pressing OK will produce a nicely formatted table, resembling the original R output.
As long as you can convert an R object (or set of R objects) to a table-like structure, you can use write.csv
and follow the instructions above to easily move that object into Word.
Thus the biggest challenge for writing output to Word format is not the actual output, but the work of building a table-like structure that can easily be output. For example, let's build a nicer-looking results table that includes summary statistics and both of our regression models. First, we'll get all of our relevant statistics and then bind them together into a table:
r1 <- round(coef(s1), 2)
r2 <- round(coef(s2), 2)
# coefficients
c1 <- paste(r1[, 1], " (", r1[, 2], ")", sep = "")
c2 <- paste(r2[, 1], " (", r2[, 2], ")", sep = "")
# summary statistics
sigma <- round(c(s1$sigma, s2$sigma), 2)
rsq <- round(c(s1$adj.r.squared, s2$adj.r.squared), 2)
# sample sizes
n <- c(length(s1$residuals), length(s1$residuals))
Now let's bind this all together into a table and look at the resulting table:
outtab <- rbind(cbind(c(c1, ""), c2), sigma, rsq, n)
colnames(outtab) <- c("Model 1", "Model 2")
rownames(outtab) <- c("Intercept", "x1", "x2", "sigma", "Adj. R-Squared", "n")
outtab
## Model 1 Model 2
## Intercept "0.61 (0.23)" "0.07 (0.24)"
## x1 "0.79 (0.4)" "0.92 (0.36)"
## x2 "" "0.89 (0.19)"
## sigma "1.06" "0.97"
## Adj. R-Squared "0.03" "0.19"
## n "100" "100"
Then we can just write it to the console and follow the directions above to copy it to a nice table in Word:
write.csv(outtab)
## "","Model 1","Model 2"
## "Intercept","0.61 (0.23)","0.07 (0.24)"
## "x1","0.79 (0.4)","0.92 (0.36)"
## "x2","","0.89 (0.19)"
## "sigma","1.06","0.97"
## "Adj. R-Squared","0.03","0.19"
## "n","100","100"
Another way to output results directly from R to Word is to use the rtf package. This package is designed to write Rich-Text Format (RTF) files, but can also be used to write Word files. It's actually very simple to use. You simply need to have the package create a Word (.doc) or RTF file to write to, then you can add plain text paragraphs or anything that can be structured as a dataframe directly to the file. You can then open the file directly with Word, finding the resulting text, tables, etc. neatly embedded. A basic example pasting our regression coefficient table and the nicer looking table is shown below:
library(rtf)
rtffile <- RTF("rtf.doc") # this can be an .rtf or a .doc
addParagraph(rtffile, "This is the output of a regression coefficients:\n")
addTable(rtffile, as.data.frame(round(coef(s2), 2)))
addParagraph(rtffile, "\n\nThis is the nicer looking table we made above:\n")
addTable(rtffile, cbind(rownames(outtab), outtab))
done(rtffile)
You can then find the rtf.doc
file in your working directory. Open it to take a look at the results.
The rtf package also allows you to specify additional options about fonts and the like, making it possible to write a considerable amount of your results directly from R. See ? rtf
for full details.
To extract the (unique) levels of a factor, use levels
:
levels(factor(c(1, 2, 3, 2, 3, 2, 3)))
## [1] "1" "2" "3"
Note: the levels of a factor are always character:
class(levels(factor(c(1, 2, 3, 2, 3, 2, 3))))
## [1] "character"
To obtain just the number of levels, use nlevels
:
nlevels(factor(c(1, 2, 3, 2, 3, 2, 3)))
## [1] 3
If the factor contains only integers, we can use unclass
to convert it (back) to an integer class vector:
unclass(factor(c(1, 2, 3, 2, 3, 2, 3)))
## [1] 1 2 3 2 3 2 3
## attr(,"levels")
## [1] "1" "2" "3"
Note: The “levels” attribute is still being reported but the new object is not a factor.
But if the factor contains other numeric values, we can get unexpected results:
unclass(factor(c(1, 2, 1.5)))
## [1] 1 3 2
## attr(,"levels")
## [1] "1" "1.5" "2"
We might have expected this to produce a numeric vector of the form c(1,2,1.5)
Instead, we have obtained an integer class vector of the form c(1,3,2)
This is because the factors levels reflect the ordering vector values, not their actual values
We can see this at work if we unclass a factor that was created from a character vector:
unclass(factor(c("a", "b", "a")))
## [1] 1 2 1
## attr(,"levels")
## [1] "a" "b"
The result is an integer vector: c(1,2,1)
This can be especially confusing if we create a factor from a combination of numeric and character elements:
unclass(factor(c("a", "b", 1, 2)))
## [1] 3 4 1 2
## attr(,"levels")
## [1] "1" "2" "a" "b"
The result is an integer vector, c(3,4,1,2)
, which we can see in several steps:
(1) the numeric values are coerced to character
c("a", "b", 1, 2)
## [1] "a" "b" "1" "2"
(2) the levels of the factor are sorted numerically then alphabetically
factor(c("a", "b", 1, 2))
## [1] a b 1 2
## Levels: 1 2 a b
(3) the result is thus a numeric vector, numbered according to the order of factor levels
unclass(factor(c("a", "b", 1, 2)))
## [1] 3 4 1 2
## attr(,"levels")
## [1] "1" "2" "a" "b"
Changing factors is similar to changing other types of data, but has some unique challenges We can see this if we compare a numeric vector to a factor version of the same data:
a <- 1:4
b <- factor(a)
a
## [1] 1 2 3 4
b
## [1] 1 2 3 4
## Levels: 1 2 3 4
We can see in the way that the two variables are printed that the numeric and factor look different This is also true if we use indexing to see a subset of the vector:
a[1]
## [1] 1
b[1]
## [1] 1
## Levels: 1 2 3 4
If we try to change the value of an item in the numeric vector using positional indexing, there's no problem:
a[1] <- 5
a
## [1] 5 2 3 4
If we try to do the same thing with the factor, we get a warning:
b[1] <- 5
## Warning: invalid factor level, NA generated
b
## [1] <NA> 2 3 4
## Levels: 1 2 3 4
And the result isn't what we wanted. We get a missing value.
This is because 5 wasn't a valid level of our factor.
Let's restore our b
variable:
b <- factor(1:4)
Then we can add 5 to the levels by simply replacing the current levels with a vector of the current levels and 5:
levels(b) <- c(levels(b), 5)
Our variable hasn't changed, but its available levels have:
b
## [1] 1 2 3 4
## Levels: 1 2 3 4 5
Now we can change the value using positional indexing, just like before:
b[1] <- 5
b
## [1] 5 2 3 4
## Levels: 1 2 3 4 5
And we get the intended result
This can be quite useful if we want to change the label for all values at a given level To see this, we need a vector containing repeated values:
c <- factor(c(1:4, 1:3, 1:2, 1))
c
## [1] 1 2 3 4 1 2 3 1 2 1
## Levels: 1 2 3 4
There are four levels to c
:
levels(c)
## [1] "1" "2" "3" "4"
If we want to change c
so that every 2 is now a 5, we can just change the appropriate level
This is easy because 2 is the second level, but we'll see a different example below:
levels(c)[2]
## [1] "2"
levels(c)[2] <- 5
levels(c)[2]
## [1] "5"
c
## [1] 1 5 3 4 1 5 3 1 5 1
## Levels: 1 5 3 4
Now c
is contains 5's in place of all of the 2's
But our replacement involved positional indexing
The second factor level isn't always equal to the number 2, it just depends on what data we have
So we can also replace factor levels using logicals (e.g., to change 5 to 9):
levels(c) == "5"
## [1] FALSE TRUE FALSE FALSE
levels(c)[levels(c) == "5"]
## [1] "5"
levels(c)[levels(c) == "5"] <- 9
levels(c)
## [1] "1" "9" "3" "4"
c
## [1] 1 9 3 4 1 9 3 1 9 1
## Levels: 1 9 3 4
As you can see, factors are a potentially useful way for storing different kinds of data and R uses them alot!
We often need to analyze data that fails to satisfy assumptions of the statistical techniques we use. One common violation of assumptions in OLS regression is the assumption of homoskedasticity. This assumption requires that the error term have constant variance across all values of the independent variable(s). When this assumption fails, the standard errors from our OLS regression estimates are inconsistent. But, we can calculate heteroskedasticity-consistent standard errors, relatively easily. Unlike in Stata, where this is simply an option for regular OLS regression, in R, these SEs are not built into the base package, but instead come in an add-on package called sandwich, which we need to install and load:
install.packages("sandwich", repos = "http://cran.r-project.org")
## Warning: package 'sandwich' is in use and will not be installed
library(sandwich)
To see the sandwich package in action, let's generate some heteroskedastic data:
set.seed(1)
x <- runif(500, 0, 1)
y <- 5 * rnorm(500, x, x)
A simple plot of y
against x
(and the associated regression line) will reveal any heteroskedasticity:
plot(y ~ x, col = "gray", pch = 19)
abline(lm(y ~ x), col = "blue")
Clearly, the variance of y
and thus of the error term in an OLS model of y~x
will increase as x
increases.
Now let's run the OLS model and see the results:
ols <- lm(y ~ x)
s <- summary(ols)
s
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.17 -1.27 -0.15 1.31 9.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.259 0.273 0.95 0.34
## x 4.241 0.479 8.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared: 0.136, Adjusted R-squared: 0.134
## F-statistic: 78.5 on 1 and 498 DF, p-value: <2e-16
It may be particularly helpful to look just as the coefficient matrix from the summary object:
s$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2592 0.2732 0.9488 3.432e-01
## x 4.2414 0.4786 8.8615 1.402e-17
The second column shows the SEs. These SEs are themselves generated from the variance-covariance matrix for the coefficients, which we can see with:
vcov(ols)
## (Intercept) x
## (Intercept) 0.07463 -0.1135
## x -0.11355 0.2291
The variance estimates for the coefficients are on the diagonal:
diag(vcov(ols))
## (Intercept) x
## 0.07463 0.22909
To convert these to SEs, we simply take the squared roote:
sqrt(diag(vcov(ols)))
## (Intercept) x
## 0.2732 0.4786
Now that we know where the regular SEs are coming from, let's get the heteroskedasticity-consistent SEs for this model from sandwich.
The SEs come from the vcovHC
function and the resulting object is the variance-covariance matrix for the coefficients:
vcovHC(ols)
## (Intercept) x
## (Intercept) 0.03335 -0.08751
## x -0.08751 0.29242
This is, again, a variance-covariance matrix for the coefficients. So to get SES, we take the square root of the diagonal, like we did above:
sqrt(diag(vcovHC(ols)))
## (Intercept) x
## 0.1826 0.5408
We can then compare the SE estimate from the standard formula to the heteroskedasticity-consistent formula:
sqrt(diag(vcov(ols)))
## (Intercept) x
## 0.2732 0.4786
sqrt(diag(vcovHC(ols)))
## (Intercept) x
## 0.1826 0.5408
One annoying thing about not having the heteroskedasticity-consistent formula built-in is that when we call summary
on ols
, it prints the default SEs rather than the ones we really want.
But, remember, everything in R is an object. So, we can overwrite the default SEs with the heteroskedasticity-consistent SEs quite easily. To do that, let's first look at the structure of our summary object s
:
str(s)
## List of 11
## $ call : language lm(formula = y ~ x)
## $ terms :Classes 'terms', 'formula' length 3 y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ residuals : Named num [1:500] 0.1231 0.7807 -0.0241 -0.6949 0.5952 ...
## ..- attr(*, "names")= chr [1:500] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] 0.259 4.241 0.273 0.479 0.949 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ sigma : num 3.03
## $ df : int [1:3] 2 498 2
## $ r.squared : num 0.136
## $ adj.r.squared: num 0.134
## $ fstatistic : Named num [1:3] 78.5 1 498
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 0.00813 -0.01237 -0.01237 0.02497
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..$ : chr [1:2] "(Intercept)" "x"
## - attr(*, "class")= chr "summary.lm"
s
is a list, one element of which is coefficients
(which we saw above when we first ran our OLS model). The s$coefficients
object is a matrix, with four columns, the second of which contains the default standard errors. If we replace those standard errors with the heteroskedasticity-robust SEs, when we print s
in the future, it will show the SEs we actually want. Let's see the effect by comparing the current output of s
to the output after we replace the SEs:
s
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.17 -1.27 -0.15 1.31 9.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.259 0.273 0.95 0.34
## x 4.241 0.479 8.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared: 0.136, Adjusted R-squared: 0.134
## F-statistic: 78.5 on 1 and 498 DF, p-value: <2e-16
s$coefficients[, 2] <- sqrt(diag(vcovHC(ols)))
s
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.17 -1.27 -0.15 1.31 9.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.259 0.183 0.95 0.34
## x 4.241 0.541 8.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared: 0.136, Adjusted R-squared: 0.134
## F-statistic: 78.5 on 1 and 498 DF, p-value: <2e-16
The summary output now reflects the correct SEs. But remember, if we call summary(ols)
, again, we'll see the original SEs. We need to call our s
object to see the updated version.
When we have continuous-by-continuous interactions in linear regression, it is impossible to directly interpret the coefficients on the interactions. In fact, it is just generally difficult to interpret these kinds of models. Often, a better approach is to translate one of the continuous variables into a factor and interpet the interaction-term coefficients on each level of that variable. Another approach is to visualize graphically. Both will give us the same inference.
Note: While interaction plots can help make effects interpretable, one of their major downsides is an inability to effectively convey statistical uncertainty. For this reason (and some of the other disadvantages that will become clear below), I would recommend these plots only for data summary but not for inference or prediction, or publication.
Let's start with some fake data:
set.seed(1)
x1 <- runif(100, 0, 1)
x2 <- sample(1:10, 100, TRUE)/10
y <- 1 + 2 * x1 + 3 * x2 + 4 * x1 * x2 + rnorm(100)
We've built a model that has a strong interaction between x1
and x2
. We can model this as a continuous interaction:
m <- lm(y ~ x1 * x2)
Alternatively, we can treat x2
as a factor (because, while approximately continuous, it only takes on 10 discrete values):
m2 <- lm(y ~ x1 * factor(x2))
Let's look at the output of both models and see if we can make sense of them:
summary(m)
##
## Call:
## lm(formula = y ~ x1 * x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.916 -0.603 -0.109 0.580 2.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.616 0.541 2.99 0.0036 **
## x1 0.783 0.878 0.89 0.3748
## x2 1.937 0.865 2.24 0.0274 *
## x1:x2 5.965 1.370 4.35 3.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.962 on 96 degrees of freedom
## Multiple R-squared: 0.802, Adjusted R-squared: 0.795
## F-statistic: 129 on 3 and 96 DF, p-value: <2e-16
summary(m2)
##
## Call:
## lm(formula = y ~ x1 * factor(x2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2086 -0.5368 -0.0675 0.5007 2.3648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5007 5.4427 0.64 0.52
## x1 -2.3644 10.4793 -0.23 0.82
## factor(x2)0.2 -1.5294 5.4711 -0.28 0.78
## factor(x2)0.3 -1.8828 5.5500 -0.34 0.74
## factor(x2)0.4 -1.2069 5.4991 -0.22 0.83
## factor(x2)0.5 0.2490 5.5039 0.05 0.96
## factor(x2)0.6 -0.8439 5.4614 -0.15 0.88
## factor(x2)0.7 -0.7045 5.4917 -0.13 0.90
## factor(x2)0.8 -0.3576 5.4764 -0.07 0.95
## factor(x2)0.9 -0.0365 5.5302 -0.01 0.99
## factor(x2)1 -0.3488 5.5207 -0.06 0.95
## x1:factor(x2)0.2 4.6778 10.5179 0.44 0.66
## x1:factor(x2)0.3 5.5067 10.5979 0.52 0.60
## x1:factor(x2)0.4 5.8601 10.5629 0.55 0.58
## x1:factor(x2)0.5 4.4851 10.6169 0.42 0.67
## x1:factor(x2)0.6 6.2043 10.5215 0.59 0.56
## x1:factor(x2)0.7 7.8928 10.5514 0.75 0.46
## x1:factor(x2)0.8 8.4370 10.5690 0.80 0.43
## x1:factor(x2)0.9 7.9411 10.5961 0.75 0.46
## x1:factor(x2)1 9.6787 10.5511 0.92 0.36
##
## Residual standard error: 1.01 on 80 degrees of freedom
## Multiple R-squared: 0.819, Adjusted R-squared: 0.776
## F-statistic: 19.1 on 19 and 80 DF, p-value: <2e-16
For our continuous-by-continuous interaction model, we have the interaction expressed as a single number: ~5.96. This doesn't tell us anything useful because its only interpretation is the additionaly expected value of y
as an amount added to the intercept plus the coefficients on each covariate (but only for the point at which x1==1
and x2==1
). Thus, while we might be inclined to talk about this as an interaction term, it really isn't…it's just a mostly meaningless number.
In the second, continuous-by-factor model, things are more interpretable. Here our factor dummies for x2
tell us the expected value of y
(if added to the intercept) when x1==0
. Similarly, the factor-“interaction” dummies tell us the expected value y
(if added to the intercept and the coefficient on x1
) when x1==1
. These seem more interpretable.
Another approach to understanding continuous-by-continuous interaction terms is to plot them. We saw above that the continuous-by-factor model, while intretable, required a lot of numbers (in a large table) to communicate the relationships between x1
, x2
, and y
. R offers a number of plotting functions to visualize these kinds of interaction “response surfaces”.
Let's start by estimating predicted values. Because by x1
and x2
are scaled [0,1], we'll just create a single vector of values on the 0-1 scale and use that for both of our prediction values.
nx <- seq(0, 1, length.out = 10)
The use of the outer
function here is, again, a convenience because our input values are scaled [0,1]. Essentially, it builds a 10-by-10 matrix of input values and predicts y
for each combination of x1
and x2
.
z <- outer(nx, nx, FUN = function(a, b) predict(m, data.frame(x1 = a, x2 = b)))
We can look at the z
matrix to see what is going on:
z
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1.616 1.831 2.046 2.262 2.477 2.692 2.907 3.122 3.338 3.553
## [2,] 1.703 1.992 2.281 2.570 2.858 3.147 3.436 3.725 4.014 4.303
## [3,] 1.790 2.152 2.515 2.877 3.240 3.602 3.965 4.327 4.690 5.052
## [4,] 1.877 2.313 2.749 3.185 3.621 4.058 4.494 4.930 5.366 5.802
## [5,] 1.964 2.474 2.983 3.493 4.003 4.513 5.022 5.532 6.042 6.552
## [6,] 2.051 2.634 3.218 3.801 4.384 4.968 5.551 6.135 6.718 7.301
## [7,] 2.138 2.795 3.452 4.109 4.766 5.423 6.080 6.737 7.394 8.051
## [8,] 2.225 2.955 3.686 4.417 5.147 5.878 6.609 7.339 8.070 8.801
## [9,] 2.312 3.116 3.920 4.725 5.529 6.333 7.138 7.942 8.746 9.550
## [10,] 2.399 3.277 4.155 5.033 5.910 6.788 7.666 8.544 9.422 10.300
All of the resulting functions require us to use this z
matrix as the “height” of the plot at each combination of x1
and x2
. Sounds a little crazy, but it will become clear once we do the plotting.
A perspective plot draws a “response surface” (i.e., the values of the z
matrix) across a two-dimensional grid. The plot is what you might typically think of when you hear “three-dimensional graph”.
Let's take a look:
persp(nx, nx, z, theta = 45, phi = 10, shade = 0.75, xlab = "x1", ylab = "x2",
zlab = "y")
Note: The theta
parameter refers to the horizontal rotation of the plot and the phi
parameter refers to the tilt of the plot (see ?persp
).
The plot shows us many things, especially:
1. The vertical height of the surface is the expected (predicted) value of y
at each combination of x1
and x2
.
2. The slope of the surface on each edge of the plot is a marginal effect. In other words, the shallow slope on the lefthand face of the plot is the marginal effect of x1
when x2==0
. Similarly, the steep slope on the righthand face of the plot is the marginal effect of x2
when x1==1
. The other marginal effects (x1|x2==1
and x2|x1==0
) are hidden from our view on the back of the plot.
There are two problems with perspective plots: 1. Because they are two-dimensional representations of three-dimensional objects, their scales are deceiving. Clearly the “height” of the plot is bigger in the front than in the back. It is therefore only a heuristic. 2. Because they are three-dimensional, we cannot see the entire plot at once (as evidence by the two hidden marginal effects discussed above). There is nothing we can do about the first point, unless you want to use a 3D printer to print out the response surface. On the second point, however, we can see different rotations of the plot in order to get a better grasp on the various marginal effects.
Let's look at two different sets of rotations. One showing four plots on the diagonal (like above):
par(mai = rep(0.2, 4))
layout(matrix(1:4, nrow = 2, byrow = TRUE))
s <- sapply(c(45, 135, 225, 315), function(i) persp(nx, nx, z, theta = i, phi = 10,
shade = 0.75, xlab = "x1", ylab = "x2", zlab = "y"))
The plot in the upper-left corner is the same one we saw above. But now, we see three additional rotations (imagine the plots rotating 90 degrees each, right-to-left), so the lower-right plot highlights the two “hidden” marginal effects from above.
Another set of plots shows the same plot at right angles, thus highlighting the marginal effects at approximately true scale but masking much of the curviture of the response surface:
par(mai = rep(0.2, 4))
layout(matrix(1:4, nrow = 2))
sapply(c(90, 180, 270, 360), function(i) persp(nx, nx, z, theta = i, phi = 10,
shade = 0.75, xlab = "x1", ylab = "x2", zlab = "y"))
## [,1] [,2] [,3] [,4]
## [1,] 1.225e-16 -2.000e+00 -3.674e-16 2.000e+00
## [2,] 2.000e+00 2.449e-16 -2.000e+00 -4.898e-16
## [3,] -1.410e-17 -1.727e-33 1.410e-17 3.454e-33
## [4,] -1.000e+00 1.000e+00 1.000e+00 -1.000e+00
## [5,] -3.473e-01 -4.253e-17 3.473e-01 8.506e-17
## [6,] 1.419e-16 -3.473e-01 5.680e-17 3.473e-01
## [7,] 2.268e-01 2.268e-01 2.268e-01 2.268e-01
## [8,] -1.178e+00 -1.178e+00 -1.525e+00 -1.525e+00
## [9,] 1.970e+00 2.412e-16 -1.970e+00 -4.824e-16
## [10,] -9.934e-17 1.970e+00 3.831e-16 -1.970e+00
## [11,] 3.999e-02 3.999e-02 3.999e-02 3.999e-02
## [12,] -3.955e+00 -3.955e+00 -1.986e+00 -1.986e+00
## [13,] -1.970e+00 -2.412e-16 1.970e+00 4.824e-16
## [14,] 9.934e-17 -1.970e+00 -3.831e-16 1.970e+00
## [15,] -3.999e-02 -3.999e-02 -3.999e-02 -3.999e-02
## [16,] 4.955e+00 4.955e+00 2.986e+00 2.986e+00
While this highlights the marginal effects somewhat nicely, the two left-hand plots are quite difficult to actually look at due to the shape of the interaction.
Note: The plots can be colored in many interesting ways, but the details are complicated (see ? persp
).
Because the perspective plots are somewhat difficult to interpret, we might want to produce a two-dimensional representation that better highlights our interaction without the confusion of flattening a three-dimensional surface to two dimensions. The image
function supplies us with a way to use color (or grayscale, in the case below) to show the values of y
across the x1
-by-x2
matrix. We again supply arguments quite similar to above:
layout(1)
par(mai = rep(1, 4))
image(nx, nx, z, xlab = "x1", ylab = "x2", main = "Expected Y", col = gray(50:1/50))
Here, the darker colors represent higher values of y
. Because the mind can't interpret color differences as well as it can interpret differences in slope, the interaction becomes somewhat muddled. For example, the marginal effect of x2|x1==0
is much less steep than the marginal effect of x2|x1==1
, but it is difficult to quantify that by comparing the difference between white and gray on the left-hand side of the plot to the difference between gray and black on the right-hand side of the plot (those differences in color representing the marignal effects.
We could redraw the plot with some contour lines to try to better see things:
image(nx, nx, z, xlab = "x1", ylab = "x2", main = "Expected Y", col = gray(50:1/50))
contour(z = z, add = TRUE)
Here we see that when x1==0
, a change in x2
from 0 to 1 increases y
from about 1 to about 4. By contrast, when x1==0
, the same change in x2
is from about 3 to about 10, which is substantially larger.
Since the contours seemed to make all of the difference in terms of interpretability above, we could just draw those instead without the underlying image
matrix:
filled.contour(z = z, xlab = "x1", ylab = "x2", main = "Expected Y", col = gray(20:1/20))
Here we see the same relationship highlighted by the contour lines, but they are nicely scaled and the plot supplies a gradient scale (at right) to help quantify the different colors.
Thus we have several different ways to look at continuous-by-continuous interactions. All of these techniques have advantages and disadvantages, but all do a better job at clarifying the nature of the relationships between x1
, x2
, and y
than does the standard regression model or even the continuous-by-factor model.
Lists are a very helpful data structure, especially for large projects.
Lists allow us to store other types of R objects together inside another object.
For example, instead of having two vectors a
and b
, we could put those vectors in a list:
a <- 1:10
b <- 11:20
x <- list(a, b)
x
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 11 12 13 14 15 16 17 18 19 20
The result is a list with two elements, where the elements are the original vectors. We can also build lists without defining the list elements beforehand:
x <- list(1:5, 6:10)
x
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] 6 7 8 9 10
Positional indexing of lists is similar to positional indexing of vectors, with a few important differences.
If we index our list x
with []
, the result is a list:
x[1]
## [[1]]
## [1] 1 2 3 4 5
x[2]
## [[1]]
## [1] 6 7 8 9 10
x[1:2]
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] 6 7 8 9 10
If we try to index with 0, we get an empty list:
x[0]
## list()
And if we try to index with a value larger than length(x)
, we get a list with a NULL element:
length(x)
## [1] 2
x[length(x) + 1]
## [[1]]
## NULL
Lists also allow us to use a different kind of positional indexing involving two brackets (e.g., [[]]
):
x[[1]]
## [1] 1 2 3 4 5
Rather than returning a list, this returns the vector that is stored in list element 1.
We aren't allowed to index like x[[1:2]]
because R doesn't know we want the first and second vectors combined.
The double bracket indexing also lets us index elements of the vector stored in a list element. For example, if we want to get the third element of the second list item, we can use two sets of indices:
x[[2]][3]
## [1] 8
Just like vectors, list elements can have names.
y <- list(first = 4:6, second = 7:9, third = 1:3)
y
## $first
## [1] 4 5 6
##
## $second
## [1] 7 8 9
##
## $third
## [1] 1 2 3
The result is a list with three named elements, each of which is a vector. We can still index this list positionally:
y[1]
## $first
## [1] 4 5 6
y[[3]]
## [1] 1 2 3
But we can also index by the names, like we did with vectors. This can involve single bracket indexing, to return a single-element list:
y["first"]
## $first
## [1] 4 5 6
Or a subset of the original list elements:
y[c("first", "third")]
## $first
## [1] 4 5 6
##
## $third
## [1] 1 2 3
It can also involve double bracket indexing, to return a vector:
y[["second"]]
## [1] 7 8 9
We can then combine this named indexing of the list with the numeric indexing of one of the list's vectors:
y[["second"]][3]
## [1] 9
Named indexing also allows us to use a new operator, the dollar sign ($
).
The $
sign is equivalent to named indexing:
y[["first"]]
## [1] 4 5 6
y$first
## [1] 4 5 6
And, just with named indexing in double brackets, we can combine $
indexing with vector positional indexing:
y[["first"]][2]
## [1] 5
y$first[2]
## [1] 5
We can easily modify the elements of a list using positional or named indexing.
w <- list(a = 1:5, b = 6:10)
w
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] 6 7 8 9 10
w[[1]] <- 5:1
w
## $a
## [1] 5 4 3 2 1
##
## $b
## [1] 6 7 8 9 10
w[["a"]] <- rep(1, 5)
w
## $a
## [1] 1 1 1 1 1
##
## $b
## [1] 6 7 8 9 10
We can also add new elements to a list using positions or names:
w[[length(w) + 1]] <- 1
w$d <- 2
w
## $a
## [1] 1 1 1 1 1
##
## $b
## [1] 6 7 8 9 10
##
## [[3]]
## [1] 1
##
## $d
## [1] 2
The result is a list with some named and some unnamed elements:
names(w)
## [1] "a" "b" "" "d"
We can fill in the empty (''
) name:
names(w)[3] <- "c"
names(w)
## [1] "a" "b" "c" "d"
w
## $a
## [1] 1 1 1 1 1
##
## $b
## [1] 6 7 8 9 10
##
## $c
## [1] 1
##
## $d
## [1] 2
Or we could change all the names entirely:
names(w) <- c("do", "re", "mi", "fa")
names(w)
## [1] "do" "re" "mi" "fa"
w
## $do
## [1] 1 1 1 1 1
##
## $re
## [1] 6 7 8 9 10
##
## $mi
## [1] 1
##
## $fa
## [1] 2
Lists are flexible and therefore important! The above exercises also showed that lists can contain different kinds of elements. Not every element in a list has to be the same length or the same class. Indeed, we can create a list that mixes many kinds of elements:
m <- list(a = 1, b = 1:5, c = "hello", d = factor(1:3))
m
## $a
## [1] 1
##
## $b
## [1] 1 2 3 4 5
##
## $c
## [1] "hello"
##
## $d
## [1] 1 2 3
## Levels: 1 2 3
This is important because many of the functions we will use to do analysis in R return lists with different kinds of information. To really use R effectively, we need to be able to extract information from those resulting lists.
It may at some point be helpful to have our list in the form of a vector.
For example, we may want to be able to see all of the elements of every vector in the list as a single vector:
To get this, we unlist
the list, which converts it into a vector and automatically names the vector elements according to the names of the original list:
z1 <- unlist(y)
z1
## first1 first2 first3 second1 second2 second3 third1 third2 third3
## 4 5 6 7 8 9 1 2 3
We could also turn this back into a list, with every element of unlist(y)
being a separate element of a new list:
z2 <- as.list(z1)
z2
## $first1
## [1] 4
##
## $first2
## [1] 5
##
## $first3
## [1] 6
##
## $second1
## [1] 7
##
## $second2
## [1] 8
##
## $second3
## [1] 9
##
## $third1
## [1] 1
##
## $third2
## [1] 2
##
## $third3
## [1] 3
Here all of the elements of the vector are separate list elements and vector names are transferred to the new list. We can see that the names of the vector are the same as the names of the list:
names(z1)
## [1] "first1" "first2" "first3" "second1" "second2" "second3" "third1"
## [8] "third2" "third3"
names(z2)
## [1] "first1" "first2" "first3" "second1" "second2" "second3" "third1"
## [8] "third2" "third3"
In order to use R for data analysis, we need to get our data into R. Unfortunately, because R lacks a graphical user interface, loading data is not particularly intuitive for those used to working with other statistical software. This tutorial explains how to load data into R as a dataframe object.
As a preliminary note, one of the things about R that causes a fair amount of confusion is that R reads character data, by default, as factor. In other words, when your data contain alphanumeric character strings (e.g., names of countries, free response survey questions), R will read those data in as factor variables rather than character variables. This can be changed when reading in data using almost any of the following techniques by setting a stringsAsFactors=FALSE
argument.
A second point of difficulty for beginners to R is that R offers no obvious visual way to load data into R. Lacking a full graphical user interface, there is no “open” button to read in a dataset. The closest thing to this is the file.choose
function. If you don't know the name or location of a file you want to load, you can use file.choose()
to open a dialog window that will let you select a file. The response, however, is just a character string containing the name and full path of the file. No action is taken with regard to that file. If, for example, you want to load a comma-separated value file (described below), you could make a call like the following:
# read.csv(file.choose())
This will first open the file choose dialog window and, when you select a file, R will then process that file with read.csv
and return a dataframe.
While file.choose
is a convenient function for interactively working with R. It is generally better to manually write filenames into your code to maximize reproducibility.
One of the neat little features of R is that it comes with some built-in datasets, and many add-on packages supply additional datasets to demonstrate their functionality. We can access these datasets with the data()
function. Here we'll just print the first few datasets:
head(data()$results)
## Package LibPath Item
## [1,] "car" "C:/Program Files/R/R-3.0.2/library" "AMSsurvey"
## [2,] "car" "C:/Program Files/R/R-3.0.2/library" "Adler"
## [3,] "car" "C:/Program Files/R/R-3.0.2/library" "Angell"
## [4,] "car" "C:/Program Files/R/R-3.0.2/library" "Anscombe"
## [5,] "car" "C:/Program Files/R/R-3.0.2/library" "Baumann"
## [6,] "car" "C:/Program Files/R/R-3.0.2/library" "Bfox"
## Title
## [1,] "American Math Society Survey Data"
## [2,] "Experimenter Expectations"
## [3,] "Moral Integration of American Cities"
## [4,] "U. S. State Public-School Expenditures"
## [5,] "Methods of Teaching Reading Comprehension"
## [6,] "Canadian Women's Labour-Force Participation"
Datasets in the dataset package are pre-loaded with R and can simply be called by name from the R console. For example, we can see the “Monthly Airline Passenger Numbers 1949-1960” dataset by simply calling:
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
To obtain detailed information about the datasets, you can just access the dataset documention: ? AirPassengers
.
We generally want to work with our own data, however, rather than some arbitrary dataset, so we'll have to load data into R.
Because a dataframe is just a collection of data vectors, we can always enter data by hand into the R console. For example, let's say we have two variables (height
and weight
) measured on each of six observations. We can enter these simply by typing them into the console and combining them into a dataframe, like:
height <- c(165, 170, 163, 182, 175, 190)
weight <- c(45, 60, 70, 80, 63, 72)
mydf <- cbind.data.frame(height, weight)
We can then call our dataframe by name:
mydf
## height weight
## 1 165 45
## 2 170 60
## 3 163 70
## 4 182 80
## 5 175 63
## 6 190 72
R also provides a function called scan
that allows us to type data into a special prompt. For example, we might want to read in six values of gender for our observations above and we could do that by typing mydf$gender <- scan(n=6, what="numeric")
and entering the six values, one per line when prompted.
But entering data manually in this fashion is inefficient and doesn't make sense if we already have data saved in an external file.
The easiest data to load into R comes in tabular file formats, like comma-separated value (CSV) or tab-separated value (TSV) files. These can easily be created using a spreadsheet editor (like Microsoft Excel), a text editor (like Notepad), or exported from many other computer programs (including all statistical packages).
read.table
and its variantsThe general function for reading these kinds of data is called read.table
. Two other functions, read.csv
and read.delim
, provide convenient wrappers for reading CSV and TSV files, respectively. (Note: read.csv2
and read.delim2
provide slightly different wrappers designed for reading data that uses a semicolon rather than comma separator and a comma rather than a period as the decimal point.)
Reading in data that is in CSV format is easy. For example, let's read in the following file, which contains some data about patient admissions for five patients:
patient,dob,entry,discharge,fee,sex
001,10/21/1946,12/12/2004,12/14/2004,8000,1
002,05/01/1980,07/08/2004,08/08/2004,12000,2
003,01/01/1960,01/01/2004,01/04/2004,9000,2
004,06/23/1998,11/11/2004,12/25/2004,15123,1
We can read these data in from from the console by copying and pasting them into a command like the following:
mydf <- read.csv(text = "\npatient,dob,entry,discharge,fee,sex\n001,10/21/1946,12/12/2004,12/14/2004,8000,1\n002,05/01/1980,07/08/2004,08/08/2004,12000,2\n003,01/01/1960,01/01/2004,01/04/2004,9000,2\n004,06/23/1998,11/11/2004,12/25/2004,15123,1")
mydf
## patient dob entry discharge fee sex
## 1 1 10/21/1946 12/12/2004 12/14/2004 8000 1
## 2 2 05/01/1980 07/08/2004 08/08/2004 12000 2
## 3 3 01/01/1960 01/01/2004 01/04/2004 9000 2
## 4 4 06/23/1998 11/11/2004 12/25/2004 15123 1
Or, we can read them from the local file directly:
mydf <- read.csv("../Data/patient.csv")
Reading them in either way will produce the exact same dataframe. If the data were tab- or semicolon-separated, the call would be exactly the same except for the use of read.delim
and read.csv2
, respectively.
Note: Any time we read data into R, we need to store it as a variable, otherwise it will simply be printed to the console and we won't be able to do anything with it. You can name dataframes whatever you want.
scan
and readLines
Occasionally, we need to read in data as a vector of character strings rather than as delimited data to make a dataframe. For example, we might have a file that contains textual data (e.g., from a news story) and we want to read in each word or each line of the file as a separate element of a vector in order to perform some kind of text processing on it.
To do this kind of analysis we can use one of two functions. The scan
function we used above to manually enter data at the console can also be used to read data in from a file, as can another function called readLines
.
We can see how the two functions work by first writing some miscellaneous text to a file (using cat
) and then reading in that content:
cat("TITLE", "A first line of text", "A second line of text", "The last line of text",
file = "ex.data", sep = "\n")
We can use scan
to read in the data as a vector of words:
scan("ex.data", what = "character")
## [1] "TITLE" "A" "first" "line" "of" "text" "A"
## [8] "second" "line" "of" "text" "The" "last" "line"
## [15] "of" "text"
The scan
function accepts additional arguments such n
to specify the number of lines to read from the file and sep
to specify how to divide the file into separate entries in the resulting vector:
scan("ex.data", what = "character", sep = "\n")
## [1] "TITLE" "A first line of text" "A second line of text"
## [4] "The last line of text"
scan("ex.data", what = "character", n = 1, sep = "\n")
## [1] "TITLE"
We can do the same thing with readLines
, which assumes that we want to read each line as a complete string rather than separating the file contents in some way:
readLines("ex.data")
## [1] "TITLE" "A first line of text" "A second line of text"
## [4] "The last line of text"
It also accepts an n
argument:
readLines("ex.data", n = 2)
## [1] "TITLE" "A first line of text"
Let's delete the file we created just to cleanup:
unlink("ex.data") # tidy up
R has its own fill format called .RData that can be used to store data for use in R. It is fairly rare to encounter data in this format, but reading it into R is - as one might expect - very easy. You simply need to call load('thefile.RData')
and the objects stored in the file will be loaded into memory in R.
One context in which you might use an .RData file is when saving your R workspace. When you quite R (using q()
), R asks if you want to save your workspace. If you select “yes”, R stores all of the objects currently in memory to a .RData file. This file can then be load
ed in a subsequent R session to pick up quite literally exactly where you left off when you saved the file.
Because many people use statistical packages like SAS, SPSS, and Stata for statistical analysis, much of the data available in the world is saved in proprietary file formats created and owned by the the companies that publish that software. This is bad because those data formats are deprecated (i.e., made irrelevant) quite often (e.g., when Stata upgraded to version 11, it introduced a new file format and its older file formats were no longer compatible with the newest version of the software). This creates problems for reproducibility because not everyone has access to Stata (or to SPSS or SAS) and storing data in these formats makes it harder to share data and ties data to specific software owned by specific companies. Editorializing aside, R can import data from a variety of proprietary file formats. Doing so requires one of the recommended add-on packages called foreign. Let's load it here:
library(foreign)
The foreign package can be used to import data from a variety of proprietary formats, including Stata .dta formats (using the read.dta
function), Octave or Matlab .mat formats (using read.octave), SPSS .sav formats (using
read.spss), SAS permanent .sas7bdat formats (using
read.ssd) and SAS XPORT .stx or .xpt formats (using
read.xport), Systat .syd formats (using
read.systat), and Minitab .tmp formats (using
read.mtp).
Note: The **foreign** package sometimes has trouble with SPSS formats, but these files can also be opened with the
spss.getfunction from the **Hmisc** package or one of several functions from the **memisc** package (
spss.fixed.file,
spss.portable.file, and
spss.system.file).
We can try loading some “foreign” data stored in Stata format:
englebert <- read.dta("../Data/EnglebertPRQ2000.dta")
## Warning: cannot read factor labels from Stata 5 files
We can then look at the loaded data using any of our usual object examination functions:
dim(englebert) # dimensions
## [1] 50 27
head(englebert) # first few rows
## country wbcode indep paris london brussels lisbon commit exprop
## 1 ANGOLA AGO 1975 0 0 0 1 3.820 5.36
## 2 BENIN BEN 1960 1 0 0 0 4.667 6.00
## 3 BOTSWANA BWA 1966 0 1 0 0 6.770 7.73
## 4 BURKINA FASO BFA 1960 1 0 0 0 5.000 4.45
## 5 BURUNDI BDI 1962 0 0 1 0 6.667 7.00
## 6 CAMEROON CMR 1960 1 0 0 0 6.140 6.45
## corrupt instqual buroqual goodgov ruleolaw pubadmin growth lcon
## 1 5.000 2.7300 4.470 4.280 3.970 4.73 -0.0306405 6.594
## 2 1.333 3.0000 2.667 3.533 4.556 2.00 -0.0030205 6.949
## 3 6.590 8.3300 6.140 7.110 7.610 6.36 0.0559447 6.358
## 4 6.060 5.3000 4.170 5.000 4.920 5.11 -0.0000589 6.122
## 5 3.000 0.8333 4.000 4.300 4.833 3.50 -0.0036746 6.461
## 6 4.240 4.5500 6.670 5.610 5.710 5.45 0.0147910 6.463
## lconsq i g vlegit hlegit elf hieafvm hieafvs warciv language
## 1 43.49 3.273 34.22 0 0.5250 0.78 1.00 0.00 24 4.2
## 2 48.29 6.524 22.79 0 0.6746 0.62 2.67 0.47 0 5.3
## 3 40.42 22.217 27.00 1 0.9035 0.51 2.00 0.00 0 3.1
## 4 37.48 7.858 17.86 0 0.5735 0.68 1.25 0.97 0 4.8
## 5 41.75 4.939 13.71 1 0.9800 0.04 3.00 0.00 8 0.6
## 6 41.77 8.315 20.67 0 0.8565 0.89 1.50 0.76 0 8.3
names(englebert) # column/variable names
## [1] "country" "wbcode" "indep" "paris" "london" "brussels"
## [7] "lisbon" "commit" "exprop" "corrupt" "instqual" "buroqual"
## [13] "goodgov" "ruleolaw" "pubadmin" "growth" "lcon" "lconsq"
## [19] "i" "g" "vlegit" "hlegit" "elf" "hieafvm"
## [25] "hieafvs" "warciv" "language"
str(englebert) # object structure
## 'data.frame': 50 obs. of 27 variables:
## $ country : chr "ANGOLA" "BENIN" "BOTSWANA" "BURKINA FASO" ...
## $ wbcode : chr "AGO" "BEN" "BWA" "BFA" ...
## $ indep : num 1975 1960 1966 1960 1962 ...
## $ paris : num 0 1 0 1 0 1 0 1 1 1 ...
## $ london : num 0 0 1 0 0 0 0 0 0 0 ...
## $ brussels: num 0 0 0 0 1 0 0 0 0 0 ...
## $ lisbon : num 1 0 0 0 0 0 1 0 0 0 ...
## $ commit : num 3.82 4.67 6.77 5 6.67 ...
## $ exprop : num 5.36 6 7.73 4.45 7 ...
## $ corrupt : num 5 1.33 6.59 6.06 3 ...
## $ instqual: num 2.73 3 8.33 5.3 0.833 ...
## $ buroqual: num 4.47 2.67 6.14 4.17 4 ...
## $ goodgov : num 4.28 3.53 7.11 5 4.3 ...
## $ ruleolaw: num 3.97 4.56 7.61 4.92 4.83 ...
## $ pubadmin: num 4.73 2 6.36 5.11 3.5 ...
## $ growth : num -3.06e-02 -3.02e-03 5.59e-02 -5.89e-05 -3.67e-03 ...
## $ lcon : num 6.59 6.95 6.36 6.12 6.46 ...
## $ lconsq : num 43.5 48.3 40.4 37.5 41.8 ...
## $ i : num 3.27 6.52 22.22 7.86 4.94 ...
## $ g : num 34.2 22.8 27 17.9 13.7 ...
## $ vlegit : num 0 0 1 0 1 0 1 0 0 0 ...
## $ hlegit : num 0.525 0.675 0.904 0.573 0.98 ...
## $ elf : num 0.78 0.62 0.51 0.68 0.04 ...
## $ hieafvm : num 1 2.67 2 1.25 3 ...
## $ hieafvs : num 0 0.47 0 0.97 0 ...
## $ warciv : num 24 0 0 0 8 0 0 0 29 0 ...
## $ language: num 4.2 5.3 3.1 4.8 0.6 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr "25 Mar 2000 18:07"
## - attr(*, "formats")= chr "%21s" "%9s" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 148 133 102 102 102 102 102 102 102 102 ...
## - attr(*, "val.labels")= chr "" "" "" "" ...
## - attr(*, "var.labels")= chr "Name of country" "World Bank three-letter code" "Date of independence" "Colonization by France" ...
## - attr(*, "version")= int 5
summary(englebert) # summary
## country wbcode indep paris
## Length:50 Length:50 Min. : -4 Min. :0.00
## Class :character Class :character 1st Qu.:1960 1st Qu.:0.00
## Mode :character Mode :character Median :1962 Median :0.00
## Mean :1921 Mean :0.38
## 3rd Qu.:1968 3rd Qu.:1.00
## Max. :1993 Max. :1.00
## NA's :2
## london brussels lisbon commit exprop
## Min. :0.00 Min. :0.00 Min. :0.0 Min. :1.68 Min. :2.00
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:4.00 1st Qu.:4.50
## Median :0.00 Median :0.00 Median :0.0 Median :5.00 Median :6.05
## Mean :0.34 Mean :0.06 Mean :0.1 Mean :4.94 Mean :5.90
## 3rd Qu.:1.00 3rd Qu.:0.00 3rd Qu.:0.0 3rd Qu.:6.04 3rd Qu.:6.88
## Max. :1.00 Max. :1.00 Max. :1.0 Max. :8.00 Max. :9.33
## NA's :7 NA's :7
## corrupt instqual buroqual goodgov
## Min. :0.00 Min. :0.833 Min. : 0.667 Min. :1.95
## 1st Qu.:3.00 1st Qu.:3.180 1st Qu.: 3.130 1st Qu.:3.99
## Median :4.39 Median :3.790 Median : 3.940 Median :4.87
## Mean :4.38 Mean :4.154 Mean : 4.239 Mean :4.72
## 3rd Qu.:5.79 3rd Qu.:5.340 3rd Qu.: 5.300 3rd Qu.:5.53
## Max. :8.71 Max. :8.330 Max. :10.000 Max. :7.40
## NA's :7 NA's :7 NA's :7 NA's :7
## ruleolaw pubadmin growth lcon
## Min. :2.33 Min. :1.25 Min. :-0.038 Min. :5.53
## 1st Qu.:4.33 1st Qu.:3.25 1st Qu.:-0.005 1st Qu.:6.32
## Median :5.02 Median :4.17 Median : 0.002 Median :6.60
## Mean :5.00 Mean :4.31 Mean : 0.004 Mean :6.67
## 3rd Qu.:5.97 3rd Qu.:5.49 3rd Qu.: 0.013 3rd Qu.:7.01
## Max. :7.61 Max. :9.36 Max. : 0.056 Max. :8.04
## NA's :7 NA's :7 NA's :6 NA's :6
## lconsq i g vlegit
## Min. :30.6 Min. : 1.40 Min. :11.1 Min. :0.000
## 1st Qu.:39.9 1st Qu.: 5.41 1st Qu.:18.7 1st Qu.:0.000
## Median :43.5 Median : 9.86 Median :22.9 Median :0.000
## Mean :44.8 Mean :10.25 Mean :23.9 Mean :0.213
## 3rd Qu.:49.1 3rd Qu.:14.32 3rd Qu.:27.8 3rd Qu.:0.000
## Max. :64.6 Max. :25.62 Max. :44.2 Max. :1.000
## NA's :6 NA's :6 NA's :6 NA's :3
## hlegit elf hieafvm hieafvs
## Min. :0.000 Min. :0.040 Min. :0.67 Min. :0.000
## 1st Qu.:0.330 1st Qu.:0.620 1st Qu.:1.52 1st Qu.:0.000
## Median :0.582 Median :0.715 Median :1.84 Median :0.480
## Mean :0.572 Mean :0.651 Mean :1.86 Mean :0.503
## 3rd Qu.:0.850 3rd Qu.:0.827 3rd Qu.:2.00 3rd Qu.:0.790
## Max. :1.000 Max. :0.930 Max. :3.00 Max. :1.490
## NA's :4 NA's :12 NA's :12 NA's :12
## warciv language
## Min. : 0.0 Min. : 0.10
## 1st Qu.: 0.0 1st Qu.: 1.90
## Median : 0.0 Median : 4.00
## Mean : 6.2 Mean : 6.53
## 3rd Qu.: 8.0 3rd Qu.: 8.30
## Max. :38.0 Max. :27.70
## NA's :9
If you ever encounter trouble importing foreign data formats into R, a good option is to use a piece of software called StatTransfer, which can convert between dozens of different file formats. Using StatTransfer to convert a file format into a CSV or R .RData format will essentially guarantee that it is readable by R.
Sometimes we need to read data in from Excel. In almost every situation, it is easiest to use Excel to convert this kind of file into a comma-separated CSV file first and then load it into R using read.csv
. That said, there are several packages designed to read Excel foramts directly, but all have disadvantages.
read.xls
function that can read Excel .xls files, but requires having Perl installed on your machineSometimes one encounters data in formats that are neither traditional, text-based tabular formats (like CSV or TSV) or proprietary statistical formats (like .dta, .sav, etc.). For example, you sometimes encounter data that is recorded in an XML markup format or that is saved in “fixed-width format”, and so forth. So long as the data is human-readable (i.e., text), you will be able to find or write R code to deal with these files and convert them to an R dataframe. Depending on the file format, this may be time consuming, but everything is possible.
XML files can easily be read using the XML package. Indeed, its functions xmlToDataFrame
and xmlToList
easily convert almost any well-formed XML document into a dataframe or list, respectively.
Fixed-width file formats are some of the hardest file formats to deal with. These files, typically built during the 20th Century, are digitized versions of data that was originally stored on punch cards. For example, much of the pre-2000 public opinion data archived at the Roper Center for Public Opinion Research's iPoll databank is stored in fixed width format. These formats store data as rows of numbers without variable names, value delimiters (like the comma or tab), and require a detailed codebook to translate them into human- or computer-readable data. For example, the following 14 lines represent the first two records of a public opinion data file from 1998:
000003204042898 248 14816722 1124 13122292122224442 2 522 1
0000032222222444444444444444144444444424424 2
000003 2 1 1 2 312922 3112422222121222 42115555 3
00000355554115 553722211212221122222222352 42 4567 4567 4
000003108 41 52 612211 1 229 5
000003 6
000003 20 01.900190 0198 7
000012212042898 248 14828523 1113 1312212111111411142 5213 1
0000122112221111141244412414114224444444144 2
000012 1 2 1 2 11212213123112232322113 31213335 3
00001255333115 666722222222221122222226642 72 4567 4567 4
000012101261 511112411 1 212 5
000012 6
000012 32 01.630163 0170 7
Clearly, these data are not easily interpretable despite the fact that there is some obvious pattern to the data. As long as we have a file indicating what each number means, we can use the read.fwf
function (from base R) to translate this file into a dataframe. The code is tedious, so there isn't space to demonstrate it here, but know that it is possible.
Sometimes we have bivariate data that are not well represented by a linear function (even if the variables are transformed). We might be able to see a relationship between the data in a scatterplot, but we are unable to fit a parametric model that properly describes the relationship between outcome and predictor. This might be particularly common when our predictor is a time variable and the outcome is a time-series. In these situations, one way to grasp and convey the relationship is with “local regression,” which fits a nonparametric curve to a scatterplot. Note: Local regression also works in multivariate contexts, but we'll focus on the bivariate form here for sake of simplicity.
Let's create a simple bivariate relationship and a complex one to see how local regression works in both cases.
set.seed(100)
x <- sample(1:50, 100, TRUE)
y1 <- 2 * x + rnorm(50, 0, 10)
y2 <- 5 + rnorm(100, 5 * abs(x - 25), abs(x - 10) + 10)
We can fit the local regression using the loess
function, which takes a formula object as its argument, just like any other regression:
localfit <- loess(y1 ~ x)
We can look at the summary
of the localfit
object, but - unlike parametric regression methods - the summary won't tell us much.
summary(localfit)
## Call:
## loess(formula = y1 ~ x)
##
## Number of Observations: 100
## Equivalent Number of Parameters: 4.51
## Residual Standard Error: 12
## Trace of smoother matrix: 4.92
##
## Control settings:
## normalize: TRUE
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
Local regression doesn't produce coefficients, so there's no way to see the model in tabular form. Instead we have to look at its predicted values and plot them visually.
We can calculate predicted values at each possible value of x
:
localp <- predict(localfit, data.frame(x = 1:50), se = TRUE)
The result is a vector of predicted values:
localp
## $fit
## 1 2 3 4 5 6 7 8 9 10
## NA 1.817 3.919 6.025 8.136 10.255 12.383 14.522 16.675 18.843
## 11 12 13 14 15 16 17 18 19 20
## 21.035 23.249 25.475 27.699 29.910 32.156 34.459 36.770 39.038 41.213
## 21 22 23 24 25 26 27 28 29 30
## 43.297 45.333 47.332 49.304 51.261 53.158 54.964 56.707 58.418 60.124
## 31 32 33 34 35 36 37 38 39 40
## 61.856 63.577 65.254 66.916 68.595 70.319 72.120 73.977 75.854 77.754
## 41 42 43 44 45 46 47 48 49 50
## 79.680 81.636 83.624 85.649 87.711 89.808 91.937 94.099 96.292 98.515
##
## $se.fit
## 1 2 3 4 5 6 7 8 9 10 11 12
## NA 5.600 4.910 4.287 3.736 3.261 2.867 2.557 2.331 2.180 2.092 2.048
## 13 14 15 16 17 18 19 20 21 22 23 24
## 2.037 2.046 2.066 2.075 2.083 2.123 2.190 2.239 2.230 2.214 2.249 2.327
## 25 26 27 28 29 30 31 32 33 34 35 36
## 2.372 2.329 2.252 2.209 2.230 2.289 2.323 2.295 2.249 2.228 2.245 2.278
## 37 38 39 40 41 42 43 44 45 46 47 48
## 2.280 2.248 2.208 2.168 2.137 2.128 2.158 2.244 2.409 2.668 3.024 3.475
## 49 50
## 4.014 4.634
##
## $residual.scale
## [1] 12.04
##
## $df
## [1] 94.97
To see the loess curve, we can simply plot the fitted values. We'll do something a little more interesting though. We'll start by plotting our original data (in blue), then plot the standard errors as polygons (using the polygon
) function (for 1-, 2-, and 3-SEs), then overlay the fitted loess curve in white.
The plot nicely shows the fit to the data and the increasing uncertainty about the conditional mean at the tails of the independent variable. We also see that these data are easily modeled by a linear regression, which we could add to the plot.
plot(y1 ~ x, pch = 15, col = rgb(0, 0, 1, 0.5))
# one SE
polygon(c(1:50, 50:1), c(localp$fit - localp$se.fit, rev(localp$fit + localp$se.fit)),
col = rgb(1, 0, 0, 0.2), border = NA)
# two SEs
polygon(c(1:50, 50:1), c(localp$fit - 2 * localp$se.fit, rev(localp$fit + 2 *
localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# three SEs
polygon(c(1:50, 50:1), c(localp$fit - 3 * localp$se.fit, rev(localp$fit + 3 *
localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# loess curve:
lines(1:50, localp$fit, col = "white", lwd = 2)
# overlay a linear fit:
abline(lm(y1 ~ x), lwd = 2)
Loess works well in a linear situation, but in those cases we're better off fitting the linear model because then we can get directly interpretable coefficients. The major downside of local regression is that we can only see it and understand it as a graph.
We can repeat the above process for our second outcome, which lacks a clear linear relationship between predictor x
and outcome y2
:
localfit <- loess(y2 ~ x)
localp <- predict(localfit, data.frame(x = 1:50), se = TRUE)
plot(y2 ~ x, pch = 15, col = rgb(0, 0, 1, 0.5))
# one SE
polygon(c(1:50, 50:1), c(localp$fit - localp$se.fit, rev(localp$fit + localp$se.fit)),
col = rgb(1, 0, 0, 0.2), border = NA)
# two SEs
polygon(c(1:50, 50:1), c(localp$fit - 2 * localp$se.fit, rev(localp$fit + 2 *
localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# three SEs
polygon(c(1:50, 50:1), c(localp$fit - 3 * localp$se.fit, rev(localp$fit + 3 *
localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# loess curve:
lines(1:50, localp$fit, col = "white", lwd = 2)
# overlay a linear fit and associated standard errors:
lmfit <- lm(y2 ~ x)
abline(lmfit, lwd = 2)
lmp <- predict(lmfit, data.frame(x = 1:50), se.fit = TRUE)
lines(1:50, lmp$fit - lmp$se.fit, lty = 2)
lines(1:50, lmp$fit + lmp$se.fit, lty = 2)
In contrast to the data where y1
was a simple function of x
, these data are far messier. They are not well-represented by a straight line fit (as evidenced by our overlay of a linear fit to the data). Instead, the local regression approach shows how y2
is not a clean function of the predictor. In these situations, the local regression curve can be helpful for understanding the relationship between outcome and predictor and potentially for building a subsequent parametric model that approximates the data better than a straight line.
Logicals are a fundamental tool for using R in a sophisticated way. Logicals allow us to precisely select elements of an R object (e.g., a vector or dataframe) based upon criteria and to selectively perform operations.
R supports all of the typical mathematical comparison operators: Equal to:
1 == 2
## [1] FALSE
Note: Double equals ==
is a logical test. Single equals =
means right-to-left assignment.
Greater than:
1 > 2
## [1] FALSE
Greater than or equal to:
1 >= 2
## [1] FALSE
Less than:
1 < 2
## [1] TRUE
Less than or equal to:
1 <= 2
## [1] TRUE
Note: Less than or equal to <=
looks like <-
, which means right-to-left assignment.
Spacing between the numbers and operators is not important:
1 == 2
## [1] FALSE
1 == 2
## [1] FALSE
But, spacing between multiple operators is! The following:
# 1 > = 2
produces an error!
The results of these comparisons is a logical vector that has values TRUE, FALSE, or NA:
is.logical(TRUE) #' valid logical
## [1] TRUE
is.logical(FALSE) #' valid logical
## [1] TRUE
is.logical(NA) #' valid logical
## [1] TRUE
is.logical(45) #' invalid
## [1] FALSE
is.logical("hello") #' invalid
## [1] FALSE
Because logicals only take values of TRUE or FALSE, values of 1 or 0 can be coerced to logical:
as.logical(0)
## [1] FALSE
as.logical(1)
## [1] TRUE
as.logical(c(0, 0, 1, 0, NA))
## [1] FALSE FALSE TRUE FALSE NA
And, conversely, logicals can be coerced back to integer using mathematical operators:
TRUE + TRUE + FALSE
## [1] 2
FALSE - TRUE
## [1] -1
FALSE + 5
## [1] 5
Logical comparisons can also be applied to vectors:
a <- 1:10
a > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
This produces a logical vector. This is often useful for indexing:
a[a > 5]
## [1] 6 7 8 9 10
We can also apply multiple logical conditions using boolean operators (AND and OR):
a > 4 & a < 9
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
a > 7 | a == 2
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Complex conditions can also be combined with parentheses to build a logical:
(a > 5 & a < 8) | (a < 3)
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
There is also a xor
function to enforce strict OR (but not AND) logic:
xor(TRUE, FALSE)
## [1] TRUE
xor(TRUE, TRUE)
## [1] FALSE
xor(FALSE, FALSE)
## [1] FALSE
This becomes helpful, for example, if we want to create a new vector based on values of an old vector:
b <- a
b[b > 5] <- 1
b
## [1] 1 2 3 4 5 1 1 1 1 1
It is also possible to convert a logical vector into a positional vector using which
:
a > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
which(a > 5)
## [1] 6 7 8 9 10
Of course, this is only helful in some contexts because:
a[a > 5]
## [1] 6 7 8 9 10
a[which(a > 5)]
## [1] 6 7 8 9 10
produce the same result.
We can also invert a logical (turn TRUE into FALSE, and vice versa) using the exclamaation point (!
):
!TRUE
## [1] FALSE
!FALSE
## [1] TRUE
b == 3
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!b == 3
## [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We can also use an if-else construction to define a new vector conditional on an old vector:
For example, we could produce our b
vector from above using the ifelse
function:
ifelse(a > 5, 1, a)
## [1] 1 2 3 4 5 1 1 1 1 1
This tests each element of a
. If that elements meets the condition, it returns the next value (1), otherwise it returns the value of a
.
We could modify this slightly to instead return 2 rather than the original value when an element fails the condition:
ifelse(a > 5, 1, 2)
## [1] 2 2 2 2 2 1 1 1 1 1
This gives us an indicator vector.
An especially helpful logical comparator checks of a vector in another vector:
d <- 1:5
e <- 4:7
d %in% e
## [1] FALSE FALSE FALSE TRUE TRUE
e %in% d
## [1] TRUE TRUE FALSE FALSE
R has several other functions related to sets (e.g., union
, intersection
) but these produce numeric, not logical output.
Note: The ifelse
function demonstrates an R feature called “vectorization.”
This means that the function operates on each element in the vector rather than having to test each element separately.
Many R functions rely on vectorization, which makes them easy to write and fast for the computer to execute.
Matrices are a two-dimensional data structure that are quite useful, especially for statistics in R.
Just like in mathematical notation, an R matrix is an m-by-n grid of elements.
To create a matrix, we use the matrix
function, which we supply with several parameters including the content of the matrix and its dimensions.
If we just give a matrix a data
parameter, it produces a column vector:
matrix(1:6)
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
## [6,] 6
If we want the matrix to have different dimensions we can specify nrow
and/or ncol
parameters:
matrix(1:6, nrow = 2)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix(1:6, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix(1:6, nrow = 2, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
By default, the data are filled into the resulting matrix “column-wise”.
If we specify byrow=TRUE
, the elements are instead filled in “row-wise”:
matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
Requesting a matrix smaller than the supplied data parameter will result in only some of the data being used and the rest discarded:
matrix(1:6, nrow = 2, ncol = 1)
## [,1]
## [1,] 1
## [2,] 2
Note: requesting a matrix with larger dimensions than the data produces a warning:
matrix(1:6, nrow = 2, ncol = 4)
## Warning: data length [6] is not a sub-multiple or multiple of the number
## of columns [4]
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 1
## [2,] 2 4 6 2
In this example, we still receive a matrix but the matrix elements outside of our data are filled in automatically. This process is called “recycling” in which R repeats the data until it fills in the requested dimensions of the matrix.
Just as with using length
to count the elements in a vector, we can use several functions to measure a matrix object.
If we apply the function length
to matrix, it still counts all the elements in the matrix, but doesn't tell us about dimensions:
a <- matrix(1:10, nrow = 2)
length(a)
## [1] 10
If we want to get the number of rows in the matrix, we can use nrow
:
nrow(a)
## [1] 2
If we want to get the number of columns in the matrix, we can use ncol
:
ncol(a)
## [1] 5
We can also get the number of rows and the number of columns in a single call to dim
:
dim(a)
## [1] 2 5
We can also combine (or bind) vectors and/or matrices together using cbind
and rbind
.
rbind
is used to “row-bind” by stacking vectors and/or matrices on top of one another vertically.
cbind
is used to “column-bind” by stacking vectors and/or matrices next to one another horizontally.
rbind(1:3, 4:6, 7:9)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
cbind(1:3, 4:6, 7:9)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
We can also easily transpose a matrix using t
:
rbind(1:3, 4:6, 7:9)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
t(rbind(1:3, 4:6, 7:9))
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Indexing a matrix is very similar to indexing a vector, except now we have to account for two dimensions. The first dimension is rows. The second dimension is columns.
b <- rbind(1:3, 4:6, 7:9)
b[1, ] #' first row
## [1] 1 2 3
b[, 1] #' first column
## [1] 1 4 7
b[1, 1] #' element in first row and first column
## [1] 1
Just with vector indexing, we can extract multiple elements:
b[1:2, ]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
b[1:2, 2:3]
## [,1] [,2]
## [1,] 2 3
## [2,] 5 6
And we can also use -
indexing:
b[-1, 2:3]
## [,1] [,2]
## [1,] 5 6
## [2,] 8 9
We can also use logical indexing in the same way:
b[c(TRUE, TRUE, FALSE), ]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
b[, c(TRUE, FALSE, TRUE)]
## [,1] [,2]
## [1,] 1 3
## [2,] 4 6
## [3,] 7 9
It is sometimes helpful to extract the diagonal of matrix (e.g., the diagonal of a variance-covariance matrix)
Diagonals can be extracted using diag
:
diag(b)
## [1] 1 5 9
It is also possible to use diag
to assign new values to the diagonal of a matrix.
For example, we might want to make all of the diagonal elements 0:
b
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
diag(b) <- 0
b
## [,1] [,2] [,3]
## [1,] 0 2 3
## [2,] 4 0 6
## [3,] 7 8 0
We can also extra the upper or lower triangles of a matrix (e.g., to extract one half of a correlation matrix)
upper.tri
and lower.tri
produce logical matrices of the same dimension as the original matrix, which can then be used to index:
upper.tri(b) #' upper triangle
## [,1] [,2] [,3]
## [1,] FALSE TRUE TRUE
## [2,] FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE
b[upper.tri(b)]
## [1] 2 3 6
lower.tri(b) #' lower triangle
## [,1] [,2] [,3]
## [1,] FALSE FALSE FALSE
## [2,] TRUE FALSE FALSE
## [3,] TRUE TRUE FALSE
b[lower.tri(b)]
## [1] 4 7 8
Recall that vectors can have named elements. Matrices can have named dimensions. Each row and column of a matrix can have a name that is supplied when it is created or added/modified later.
c <- matrix(1:6, nrow = 2)
Row names are added with rownames
:
rownames(c) <- c("Row1", "Row2")
Column names are added with colnames
:
colnames(c) <- c("x", "y", "z")
Dimension names can also be added initially when the matrix is created using the dimnames
parameter in matrix
:
matrix(1:6, nrow = 2, dimnames = list(c("Row1", "Row2"), c("x", "y", "z")))
## x y z
## Row1 1 3 5
## Row2 2 4 6
Dimension names can also be created in this way for only the rows or columns by using a NULL
value for one of the dimensions:
matrix(1:6, nrow = 2, dimnames = list(c("Row1", "Row2"), NULL))
## [,1] [,2] [,3]
## Row1 1 3 5
## Row2 2 4 6
Scalar addition and subtraction on a matrix works identically to addition or subtraction on a vector.
We simply use the standard addition (+
) and subtraction (-
) operators.
a <- matrix(1:6, nrow = 2)
a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
a + 1
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 3 5 7
a - 2
## [,1] [,2] [,3]
## [1,] -1 1 3
## [2,] 0 2 4
Scalar multiplication and division also work with the standard operators (*
and /
).
a * 2
## [,1] [,2] [,3]
## [1,] 2 6 10
## [2,] 4 8 12
a/2
## [,1] [,2] [,3]
## [1,] 0.5 1.5 2.5
## [2,] 1.0 2.0 3.0
As with a vector, it is possible to apply comparators to an entire matrix:
a > 2
## [,1] [,2] [,3]
## [1,] FALSE TRUE TRUE
## [2,] FALSE TRUE TRUE
We can then use the resulting logical matrix as an index:
a[a > 2]
## [1] 3 4 5 6
But the result is a vector, not a matrix. If we use the same statement to assign, however, the result is a matrix:
a[a > 2] <- 99
a
## [,1] [,2] [,3]
## [1,] 1 99 99
## [2,] 2 99 99
In statistics, an important operation is matrix multiplication. Unlike scalar multiplication, this procedure involves the multiplication of two matrices by one another.
Let's start by defining a function to demonstrate how matrix multiplication works:
mmdemo <- function(A, B) {
m <- nrow(A)
n <- ncol(B)
C <- matrix(NA, nrow = m, ncol = n)
for (i in 1:m) {
for (j in 1:n) {
C[i, j] <- paste("(", A[i, ], "*", B[, j], ")", sep = "", collapse = "+")
}
}
print(C, quote = FALSE)
}
Now let's generate two matrices, multiply them and see how it worked:
amat <- matrix(1:4, ncol = 2)
bmat <- matrix(1:6, nrow = 2)
amat
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
bmat
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
amat %*% bmat
## [,1] [,2] [,3]
## [1,] 7 15 23
## [2,] 10 22 34
mmdemo(amat, bmat)
## [,1] [,2] [,3]
## [1,] (1*1)+(3*2) (1*3)+(3*4) (1*5)+(3*6)
## [2,] (2*1)+(4*2) (2*3)+(4*4) (2*5)+(4*6)
Let's try it on a different set of matrices:
amat <- matrix(1:16, ncol = 4)
bmat <- matrix(1:32, nrow = 4)
amat
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
bmat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 5 9 13 17 21 25 29
## [2,] 2 6 10 14 18 22 26 30
## [3,] 3 7 11 15 19 23 27 31
## [4,] 4 8 12 16 20 24 28 32
amat %*% bmat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 90 202 314 426 538 650 762 874
## [2,] 100 228 356 484 612 740 868 996
## [3,] 110 254 398 542 686 830 974 1118
## [4,] 120 280 440 600 760 920 1080 1240
mmdemo(amat, bmat)
## [,1] [,2]
## [1,] (1*1)+(5*2)+(9*3)+(13*4) (1*5)+(5*6)+(9*7)+(13*8)
## [2,] (2*1)+(6*2)+(10*3)+(14*4) (2*5)+(6*6)+(10*7)+(14*8)
## [3,] (3*1)+(7*2)+(11*3)+(15*4) (3*5)+(7*6)+(11*7)+(15*8)
## [4,] (4*1)+(8*2)+(12*3)+(16*4) (4*5)+(8*6)+(12*7)+(16*8)
## [,3] [,4]
## [1,] (1*9)+(5*10)+(9*11)+(13*12) (1*13)+(5*14)+(9*15)+(13*16)
## [2,] (2*9)+(6*10)+(10*11)+(14*12) (2*13)+(6*14)+(10*15)+(14*16)
## [3,] (3*9)+(7*10)+(11*11)+(15*12) (3*13)+(7*14)+(11*15)+(15*16)
## [4,] (4*9)+(8*10)+(12*11)+(16*12) (4*13)+(8*14)+(12*15)+(16*16)
## [,5] [,6]
## [1,] (1*17)+(5*18)+(9*19)+(13*20) (1*21)+(5*22)+(9*23)+(13*24)
## [2,] (2*17)+(6*18)+(10*19)+(14*20) (2*21)+(6*22)+(10*23)+(14*24)
## [3,] (3*17)+(7*18)+(11*19)+(15*20) (3*21)+(7*22)+(11*23)+(15*24)
## [4,] (4*17)+(8*18)+(12*19)+(16*20) (4*21)+(8*22)+(12*23)+(16*24)
## [,7] [,8]
## [1,] (1*25)+(5*26)+(9*27)+(13*28) (1*29)+(5*30)+(9*31)+(13*32)
## [2,] (2*25)+(6*26)+(10*27)+(14*28) (2*29)+(6*30)+(10*31)+(14*32)
## [3,] (3*25)+(7*26)+(11*27)+(15*28) (3*29)+(7*30)+(11*31)+(15*32)
## [4,] (4*25)+(8*26)+(12*27)+(16*28) (4*29)+(8*30)+(12*31)+(16*32)
Note: matrix multiplication is noncommutative, so the order of matrices matters in a statement!
Another important operation is the crossproduct. See also: OLS in matrix form.
Sometimes we want to calculate a sum or mean for each row or column of a matrix. R provides built-in functions for each of these operations:
cmat <- matrix(1:20, nrow = 5)
cmat
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
rowSums(cmat)
## [1] 34 38 42 46 50
colSums(cmat)
## [1] 15 40 65 90
rowMeans(cmat)
## [1] 8.5 9.5 10.5 11.5 12.5
colMeans(cmat)
## [1] 3 8 13 18
These functions can be helpful for aggregating multiple variables and performing the sum or mean with these functions is much faster than manually adding (or taking the mean) of columns using +
and /
operators.
Missing data values in R are a major point of confusion. This script walks through some of the basics of missing data. Where some statistical packages have different kinds of missing data, R only has one
NA
## [1] NA
NA
means a missing value. For example, in a vector variable, we might be missing the third observation:
a <- c(1, 2, NA, 4, 5)
a
## [1] 1 2 NA 4 5
This impacts our ability to do calculations on the vector, like taking its sum:
sum(a)
## [1] NA
This is because R treats anything mathematically calculated with an NA as missing:
1 + NA
## [1] NA
0 + NA
## [1] NA
This can cause some confusion because many statistical packages omit missing values by default. The R approach is better because it forces you to be conscious about where data are missing.
Another point of confusion is that some things look like missing data but are not.
For example, the NULL
value is not missing. Note the difference between a
and b
:
a
## [1] 1 2 NA 4 5
b <- c(1, 2, NULL, 4, 5)
b
## [1] 1 2 4 5
b
has only four elements. NULL is not missing, it is simply dropped.
This can be especially confusing when a vector is of character class.
For example, compare c
to d
:
c <- c("do", "re", NA, "fa")
c
## [1] "do" "re" NA "fa"
d <- c("do", "re", "NA", "fa")
d
## [1] "do" "re" "NA" "fa"
The third element of c
is missing (NA
), whereas the third element of d
is a charater string 'NA'
.
We can see this with the logical test is.na
:
is.na(c)
## [1] FALSE FALSE TRUE FALSE
is.na(d)
## [1] FALSE FALSE FALSE FALSE
This tests whether each element in a vector is missing. Similarly, an empty character string is not missing:
is.na("")
## [1] FALSE
It is simply a character string that has no contents.
For example, compare c
to e
:
c
## [1] "do" "re" NA "fa"
e <- c("do", "re", "", "fa")
e
## [1] "do" "re" "" "fa"
is.na(c)
## [1] FALSE FALSE TRUE FALSE
is.na(e)
## [1] FALSE FALSE FALSE FALSE
There may be situations in which we want to change missing NA values or remove them entirely. For example, to change all NA values in a vector to 0, we could use logical indexing:
f <- c(1, 2, NA, NA, NA, 6, 7)
f
## [1] 1 2 NA NA NA 6 7
f[is.na(f)] <- 0
f
## [1] 1 2 0 0 0 6 7
Alternatively, there may be situations where we want convert NA values to NULL values, and thus shorten our vector:
g1 <- c(1, 2, NA, NA, NA, 6, 7)
g2 <- na.omit(g1)
g2
## [1] 1 2 6 7
## attr(,"na.action")
## [1] 3 4 5
## attr(,"class")
## [1] "omit"
We now have shorter vector:
length(g1)
## [1] 7
length(g2)
## [1] 4
But that vector has been given an additional attribute: a vector of positions of omitted missing values:
attributes(g2)$na.action
## [1] 3 4 5
## attr(,"class")
## [1] "omit"
Many functions also provide the ability to exclude missing values from a calculation.
For example, to calculate the sum of g1
we could either use the na.omit
function or an na.rm
parameter in sum
:
sum(na.omit(g1))
## [1] 16
sum(g1, na.rm = TRUE)
## [1] 16
Both provide the same answer.
Many functions in R allow an na.rm
parameter (or something similarly).
Missing data is a pain. It creates problems for simple and complicated analyses. It also tend to undermine our ability to make valid inferences. Most statistical packages tend to “brush missing data under the rug” and simply delete missing cases on the fly. This is nice because it makes analysis simple: e.g., if you want a mean of a variable with missing data, most packages drop the missing data and report the mean of the remaining values. But, a different view is also credible: the assumption that we should discard missing values may be a bad assumption. For example, let's say that we want to build a regression model to explain two outcomes but those outcome variables have different patterns of missing data. If we engage in “on-the-fly” case deletion, then we end up with two models that are built on different, non-comparable subsets of the original data. We are then limited in our ability to compare, e.g., the coefficients from one model to the other because they have different data bases. Choosing how to deal with missing values is thus better done as an intentional activity early in the process of data analysis rather than as an analysis-specific assumption.
This tutorial demonstrates some basic missing data handling procedures. A separate tutorial on multiple imputation covers advanced techniques.
When R encounters missing data, its typical behavior is to attempt to perform the requested procedure and then returns a missing (NA
) value as a result. We can see this if we attempt to calculate the mean of a vector containing missing data:
x <- c(1, 2, 3, NA, 5, 7, 9)
mean(x)
## [1] NA
R is telling us here that our vector contains missing data, so the requested statistic - the mean - is undefined for these data. If we want to do - as many statistical packages do by default - and calculate the mean by dropping the missing value, we just need to request that R remove the missing values using the na.rm=TRUE
argument:
mean(x, na.rm = TRUE)
## [1] 4.5
na.rm
can be found in many R functions, such as mean
, median
, sd
, var
, and so forth.
One exception to this is the summary
function when applied to a vector of data. By default it counts missing values and then reports the mean, median, and other statistics excluding those value:
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.25 4.00 4.50 6.50 9.00 1
Another common function that handles missing values atypically is the correlation (cor
) function. Rather than accepting an na.rm
argument, it has a use
argument that specifies what set of cases to use when calculating the correlation coefficient. Its default behavior - like mean
, median
, etc. - is to attempt to calculate the correlation coefficient with use="everything"
. This can result in an NA
result:
y <- c(3, 2, 4, 5, 1, 3, 4)
cor(x, y)
## [1] NA
The use
argument can take several values (see ?cor
), but the two most common useful are use="complete.obs"
and use="pairwise.complete.obs"
. The former deletes all cases with missing values before calculating the correlation. The latter applies where trying to build a correlation matrix (i.e., correlations between more than two variables) and instead of dropping all cases with any missing data, it only drops cases from each pairwise correlation calculation. We can see this if we build a three-variable matrix:
z <- c(NA, 2, 3, 5, 4, 3, 4)
m <- data.frame(x, y, z)
m
## x y z
## 1 1 3 NA
## 2 2 2 2
## 3 3 4 3
## 4 NA 5 5
## 5 5 1 4
## 6 7 3 3
## 7 9 4 4
cor(m) # returns all NAs
## x y z
## x 1 NA NA
## y NA 1 NA
## z NA NA 1
cor(m, use = "complete.obs")
## x y z
## x 1.0000 0.34819 0.70957
## y 0.3482 1.00000 0.04583
## z 0.7096 0.04583 1.00000
cor(m, use = "pairwise.complete.obs")
## x y z
## x 1.0000 0.2498 0.7096
## y 0.2498 1.0000 0.4534
## z 0.7096 0.4534 1.0000
Under default settings, the response is a matrix of NA
values. With use="complete.obs"
, the matrix m
first has all cases with missing values removed, then the correlation matrix is produced. Whereas with use="pairwise.complete.obs"
, the cases with missing values are only removed during the calculation of each pairwise correlation. Thus we see that the correlation between x
and z
is the same in both matrices but the correlation between y
and both x
and z
depends on the use
method (with dramatic effect).
Another places where missing data are handled atypically is in regression modeling. If we estimate a linear regression model for our x
, z
, and y
data, R will default to casewise deletion. We can see this here:
lm <- lm(y ~ x + z, data = m)
summary(lm)
##
## Call:
## lm(formula = y ~ x + z, data = m)
##
## Residuals:
## 2 3 5 6 7
## -0.632 1.711 -1.237 -0.447 0.605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.316 3.399 0.98 0.43
## x 0.289 0.408 0.71 0.55
## z -0.632 1.396 -0.45 0.70
##
## Residual standard error: 1.65 on 2 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.203, Adjusted R-squared: -0.594
## F-statistic: 0.254 on 2 and 2 DF, p-value: 0.797
The model, obviously, can only fit the model to the available data, so the resulting fitted values have a different length from the original data:
length(m$y)
## [1] 7
length(lm$fitted)
## [1] 5
Thus, if we tried to store our fitted values back into our m
dataframe (e.g., using m$fitted <- lm$fitted
) or plot our model residuals against the original outcome y
(e.g., with plot(lm$residuals ~ m$y)
), we would encounter an error.
This is typical of statistical packages, but highlights that we should really address missing data before we start any of our analysis.
How do we deal with missing data globally? Basically, we need to decide how we're going to use our missing data, if at all, then either remove cases from our data or impute missing values, and then proceed with our analysis. As mentioned, one strategy is multiple imputation, which is addressed in a separate tutorial.
Before we deal with missing data, it is helpful to know where it lies in our data:
We can look for missing data in a vector by simply wrapping it in is.na
:
is.na(x)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
We can also do the same for an entire dataframe:
is.na(m)
## x y z
## [1,] FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE
## [5,] FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE
That works fine in our small example, but in a very large dataset, that could get quite difficult to understand. Therefore, it is helpful to visualize missing data in a plot. We can use the image
function to visualize the is.na(m)
matrix:
image(is.na(m), main = "Missing Values", xlab = "Observation", ylab = "Variable",
xaxt = "n", yaxt = "n", bty = "n")
axis(1, seq(0, 1, length.out = nrow(m)), 1:nrow(m), col = "white")
axis(2, c(0, 0.5, 1), names(m), col = "white", las = 2)
Note: The syntax here is a little bit tricky, but it is simply to make the plot easier to understand. See ?image
for more details.
The plot shows we have two missing values: one in our z
variable for observation 1 and one in our x
variable for observation 4.
This plot can help us understand where our missing data is and if we systematically observe missing data for certain types of observations.
Once we know where our missing data are, we can deal with them in some way.
Casewise deletion is the easiest way to deal with missing data. It simply removes all cases that have missing data anywhere in the data.
To do casewise deletion, we simply using the na.omit
function on our entire dataframe:
na.omit(m)
## x y z
## 2 2 2 2
## 3 3 4 3
## 5 5 1 4
## 6 7 3 3
## 7 9 4 4
In our example data, this procedure removes two rows that contain missing values.
Note: using na.omit(m)
does not affect our original object m
. To use the new dataframe, we need to save it as an object:
m2 <- na.omit(m)
This let's us easily go back to our original data:
m
## x y z
## 1 1 3 NA
## 2 2 2 2
## 3 3 4 3
## 4 NA 5 5
## 5 5 1 4
## 6 7 3 3
## 7 9 4 4
m2
## x y z
## 2 2 2 2
## 3 3 4 3
## 5 5 1 4
## 6 7 3 3
## 7 9 4 4
Another strategy is some kind of imputation. There are an endless number of options here - and the best way is probably multiple imputation, which is described elsewhere - but two ways to do simple, single imputation is to replace missing values with the means of the other values in the variable or to randomly sample from those values. The former approach (mean imputation) preserves the mean of the variable, whereas the latter approach (random imputation) preserves both the mean and variance. Both might be unreasonable, but its worth seeing how to do them: To do mean imputation we simply need to identify missing values, calculate the mean of the remaining values, and store that mean into those missing value possitions:
x2 <- x
x2
## [1] 1 2 3 NA 5 7 9
is.na(x2)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
x2[is.na(x2)]
## [1] NA
mean(x2, na.rm = TRUE)
## [1] 4.5
x2[is.na(x2)] <- mean(x2, na.rm = TRUE)
x2
## [1] 1.0 2.0 3.0 4.5 5.0 7.0 9.0
To do random imputation is a bit more complicated because we need to sample the non-missing values with the sample
function, but the process is otherwise similar:
x3 <- x
x3[!is.na(x3)] # values from which we can sample
## [1] 1 2 3 5 7 9
x3[is.na(x3)] <- sample(x3[!is.na(x3)], sum(is.na(x3)), TRUE)
x3
## [1] 1 2 3 5 5 7 9
Thus these two imputation strategies produce different resulting data (and those data will reflect the statistical properties of the original data to varying extents), but they mean that all subsequent analysis will not have to worry about missing values.
One of the most important object classes for statistics in R is the “formula” class. Formula objects, while unimportant for R in general, are critical to many of statistical tests and statistical plots in R (as well as many add-on packages). Formulae convey a relationship among a set of variables in a simple, intuitive way. They are also data-independent, meaning that a formula can be constructed and then used with application to different dataframes or subsets of a dataframe. This means we can define formulae without having any data loaded. Note: We did not discuss formulas in the tutorials on object classes, because they are not one of the fundamental classes needed throughout R. They are only needed for statistical procedures, which we happen to need a lot in academic research but aren't as critical in other uses of R.
The basic structure of a formula is the tilde symbol (~
) and at least one independent (righthand) variable. In most (but not all) situations, a single dependent (lefthand) variable is also needed.
Thus we can construct a formula quite simply by just typing:
~x
## ~x
## <environment: 0x000000001c3d67b0>
Note: Spaces in formulae are not important. And, like any other object, we can store this as an R variable and see that it is, in fact, a formula:
myformula <- ~x
class(myformula)
## [1] "formula"
More commonly, we want to express a formula as a relationship between an outcome (lefthand) variable and one or more independent/predictor/covariate (righthand) variables:
myformula <- y ~ x
We can use multiple independent variables by simply separating them with the plus (+
) symbol:
y ~ x1 + x2
## y ~ x1 + x2
## <environment: 0x000000001c3d67b0>
If we use a minus (-
) symbol, objects in the formula are ignored in an analysis:
y ~ x1 - x2
## y ~ x1 - x2
## <environment: 0x000000001c3d67b0>
One particularly helpful feature when modelling with lots of variables is the .
operator. When used in a formula, .
refers to all other variables in the matrix not yet included in the model. So, if we plan to run a regression on a matrix (or dataframe) containing the variables y
, x1
, z3
, and areallylongvariablename
, we can simply use the formula:
y ~ .
## y ~ .
## <environment: 0x000000001c3d67b0>
and avoid having to type all of the variables.
In a regression modeling context, we often need to specify interaction terms. There are two ways to do this.
If we want to include two variables and their interaction, we use the star/asterisk (*
) symbol:
y ~ x1 * x2
## y ~ x1 * x2
## <environment: 0x000000001c3d67b0>
If we only want their interaction, but not the variables themselves, we use the colon (:
) symbol:
y ~ x1:x2
## y ~ x1:x2
## <environment: 0x000000001c3d67b0>
Note: We probably don't want to do this. This means that some formulae that look different are actually equivalent. The following formulae will produce the same regression:
y ~ x1 * x2
## y ~ x1 * x2
## <environment: 0x000000001c3d67b0>
y ~ x1 + x2 + x1:x2
## y ~ x1 + x2 + x1:x2
## <environment: 0x000000001c3d67b0>
In regression models, we may also want to know a few other tricks.
One trick is to drop the intercept, by either including a zero (0
) or a minus-one (-1
) in the formula:
y ~ -1 + x1 * x2
## y ~ -1 + x1 * x2
## <environment: 0x000000001c3d67b0>
y ~ 0 + x1 * x2
## y ~ 0 + x1 * x2
## <environment: 0x000000001c3d67b0>
We can also offset the intercept of a model using the offset
function. The use is kind of strange and not that common, but we can increase the intercept by, e.g., 2 using:
y ~ x1 + offset(rep(-2, n))
## y ~ x1 + offset(rep(-2, n))
## <environment: 0x000000001c3d67b0>
or reduce the intercept by, e.g., 3 using:
y ~ x1 + offset(rep(3, n))
## y ~ x1 + offset(rep(3, n))
## <environment: 0x000000001c3d67b0>
Note: The n
here would have to be tailed to the length of the actual. It's unclear in what context this functionality is really helpful, but it does mean that models can be adjusted in fairly sophisticated ways.
An important consideration in regression formulae is the handling of factor-class variables. When a factor is included in a regression model, it is automatically converted into a series of indicator (“dummy”) variables, with the factor's first level treated as a baseline. This also means that we can convert non-factor variables into a series of dummies, simply by wrapping them in factor
:
y ~ x
## y ~ x
## <environment: 0x000000001c3d67b0>
# to:
y ~ factor(x)
## y ~ factor(x)
## <environment: 0x000000001c3d67b0>
One trick to formulas is that they don't evaluate their contents. So, for example, if we wanted to include x
and x^2
in our model, we might intuit that we should type:
y ~ x + x^2
## y ~ x + x^2
## <environment: 0x000000001c3d67b0>
If we attempted to estimate a regression model using this formula, R would drop the x^2
term because it thinks it is a duplicate of x
. We therefore have to either calculate and store all of the variables we want to include in the model in advance, or we need to use the I()
“as-is” operator.
To obtain our desired two-term formula, we could use I()
as follows:
y ~ x + I(x^2)
## y ~ x + I(x^2)
## <environment: 0x000000001c3d67b0>
This tells R to calculate the values of x^2
before attempting to use the formula.
Aside from calculating powers, I()
can also be helpful when we want to rescale a variable for a model (e.g., to make two coefficients more comparable by using a common scale). Again, we simply wrap the relevant variable name in I()
:
y ~ I(2 * x)
## y ~ I(2 * x)
## <environment: 0x000000001c3d67b0>
This formula would, in a linear regression, produce a coefficient half as large as the model for y~x
.
One might be tempted to compare a formula to a character string. They look similar, but they are different. Their similar means, however, that a character string containing a formula can often be used where a formula-class object is required. Indeed the following is true:
("y ~ x") == (y ~ x)
## [1] TRUE
And we can easily convert between formula and character class:
as.formula("y~x")
## y ~ x
## <environment: 0x000000001c3d67b0>
as.character(y ~ x)
## [1] "~" "y" "x"
Note: The result of the latter is probably not what you expected. But relates to how formulae are indexed:
(y ~ x)[1]
## `~`()
## <environment: 0x000000001c3d67b0>
(y ~ x)[2]
## y()
(y ~ x)[3]
## x()
The ability to easily transform between formula and character class means that we can also build formulae on the fly using paste
. For example, if we want to add righthand variables to a formula, we can simply paste them:
paste("y~x", "x2", "x3", sep = "+")
## [1] "y~x+x2+x3"
One of the really nice features of formulae is that they have many methods. For example, we can use the terms
function to examine and compare different formulae:
terms(y ~ x1 + x2)
## y ~ x1 + x2
## attr(,"variables")
## list(y, x1, x2)
## attr(,"factors")
## x1 x2
## y 0 0
## x1 1 0
## x2 0 1
## attr(,"term.labels")
## [1] "x1" "x2"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: 0x000000001c3d67b0>
terms(y ~ 0 + x1)
## y ~ 0 + x1
## attr(,"variables")
## list(y, x1)
## attr(,"factors")
## x1
## y 0
## x1 1
## attr(,"term.labels")
## [1] "x1"
## attr(,"order")
## [1] 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: 0x000000001c3d67b0>
terms(~x1 + x2)
## ~x1 + x2
## attr(,"variables")
## list(x1, x2)
## attr(,"factors")
## x1 x2
## x1 1 0
## x2 0 1
## attr(,"term.labels")
## [1] "x1" "x2"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 0
## attr(,".Environment")
## <environment: 0x000000001c3d67b0>
The output above shows the formula itself, a list of its constitutive variables, the presence of intercept, the presence of an outcome, and so forth.
If we just want to know the names of the variables in the model, we can use all.vars
:
all.vars(y ~ x1 + x2)
## [1] "y" "x1" "x2"
We can also modify formulae without converting them to character (as we did above), using the update
function. This potentially saves a lot of typing:
update(y ~ x, ~. + x2)
## y ~ x + x2
## <environment: 0x000000001c3d67b0>
update(y ~ x, z ~ .)
## z ~ x
## <environment: 0x000000001c3d67b0>
This could be used, e.g., to run a model on a “small” model and then a larger version:
myformula <- y ~ a + b + c
update(myformula, "~.+d+e+f")
## y ~ a + b + c + d + e + f
## <environment: 0x000000001c3d67b0>
Or the same righthand variables to predict two different outcomes:
update(myformula, "z~.")
## z ~ a + b + c
## <environment: 0x000000001c3d67b0>
We can also drop terms using update
:
update(myformula, "~.-a")
## y ~ b + c
## <environment: 0x000000001c3d67b0>
One important, but sometimes problematic, class of regression models deals with nominal or multinomial outcomes (i.e., outcomes that are not continuous or even ordered). Estimating these models is not possible with glm
, but can be estimated using the nnet add-on package, which is recommended and therefore simply needs to be loaded.
Let's start by loading the package, or installing then loading it if it isn't already on our system:
install.packages("nnet", repos = "http://cran.r-project.org")
## Warning: package 'nnet' is in use and will not be installed
library(nnet)
Then let's create some simple bivariate data where the outcome y
takes three values
set.seed(100)
y <- sort(sample(1:3, 600, TRUE))
x <- numeric(length = 600)
x[1:200] <- -1 * x[1:200] + rnorm(200, 4, 2)
x[201:400] <- 1 * x[201:400] + rnorm(200)
x[401:600] <- 2 * x[401:600] + rnorm(200, 2, 2)
We can plot the data to see what's going on:
plot(y ~ x, col = rgb(0, 0, 0, 0.3), pch = 19)
abline(lm(y ~ x), col = "red") # a badly fitted regression line
Clearly, there is a relationship between x
and y
, but it's certainly not linear and if we tried to draw a line through through the data (i.e., the straight red regression line), many of the predicted values would be problematic because y
can only take on discrete values 1,2, and 3 and, in fact, the line hardly fits the data at all.
We might therefore rely on a multinomial model, which will give us the coefficients for x
for each level of the outcome. In other words, the coefficients from a multinomial logistic model express effects in terms of moving from the baseline category of the outcome to the other levels of the outcome (essentially combining several binary logistic regression models into a single model).
Let's look at the output from the multinom
function to see what these results look like:
m1 <- multinom(y ~ x)
## # weights: 9 (4 variable)
## initial value 659.167373
## iter 10 value 535.823756
## iter 10 value 535.823754
## final value 535.823754
## converged
summary(m1)
## Call:
## multinom(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 2 1.849 -0.8620
## 3 1.126 -0.3208
##
## Std. Errors:
## (Intercept) x
## 2 0.1900 0.07096
## 3 0.1935 0.05141
##
## Residual Deviance: 1072
## AIC: 1080
Our model only consists of one covariate, but we now see two intercept coefficients and two slope coefficients because the model is telling us the relationship between x
and y
in terms of moving from category 1 to category 2 in y
and from category 1 to category 3 in y
, respectively. The standard errors are printed below the coefficients.
Unfortunately, its almost impossible to interpret the coefficients here because a unit change in x
has some kind of negative impact on both levels of y
, but we don't know how much.
A better way to examine the effects in a multinomial model is to look at predicted probabilities. We need to start with some new data representing the full scale of the x
variable:
newdata <- data.frame(x = seq(min(x), max(x), length.out = 100))
Like with binary models, we can extract different kinds of predictions from the model using predict
. The first type of prediction is simply the fitted “class” or level of the outcome:
p1 <- predict(m1, newdata, type = "class")
The second is a predicted probability of being in each category of y
. In other words, for each value of our new data, predict
with type="class"
will return, in our example, three predicted probabilities.
p2 <- predict(m1, newdata, type = "probs")
These probabilities also all sum to a value of 1, which means that the model requires that the categories of y
be mutually exclusive and comprehensive. There's no opportunity for x
to predict a value outside of those included in the model.
You can verify this using rowSums
:
rowSums(p2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1
If you want to relax this constraint, you can separately model your data using two or more binary logistic regressions comparing different categories. For example, we could model the data, predicting y==2
against y==1
and separately y==3
against y==1
. Do this we can create two subsets of our original data to separately y==2
from y==3
.
df1 <- data.frame(x = x, y = y)[y %in% c(1, 2), ]
df1$y <- df1$y - 1 # recode 2 to 1 and 1 to 0
df2 <- data.frame(x = x, y = y)[y %in% c(1, 3), ]
df2$y[df2$y == 1] <- 0 # recode 1 to 0
df2$y[df2$y == 3] <- 1 # recode 3 to 1
We can then model this and compare to the coefficients from the multinomial model:
coef(glm(y ~ x, data = df1, family = binomial)) # predict 2 against 1
## (Intercept) x
## 1.8428 -0.8839
coef(glm(y ~ x, data = df2, family = binomial)) # predict 3 against 1
## (Intercept) x
## 1.1102 -0.3144
coef(m1) # multinomial model
## (Intercept) x
## 2 1.849 -0.8620
## 3 1.126 -0.3208
Clearly, the coefficients from the two modeling strategies are similar, but not identical. The multinomial probably imposes a more plausible assumption (that predicted probabilities sum to 1), but you can easily try both approaches.
One way to visualize the results of a multinomial model is simply to plot our fitted values for y
on top of our original data:
plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 19)
lines(newdata$x, p1, col = rgb(1, 0, 0, 0.75), lwd = 5)
This plot shows that as we increase along x
, observations are first likely to be in y==2
, then y==3
, and finally y==1
. Unfortunately, using the lines
function gives a slight misrepresentation because of the vertical discontinuities. We can draw three separate lines to have a more accurate picture:
plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 19)
lines(newdata$x[p1 == 1], p1[p1 == 1], col = "red", lwd = 5)
lines(newdata$x[p1 == 2], p1[p1 == 2], col = "red", lwd = 5)
lines(newdata$x[p1 == 3], p1[p1 == 3], col = "red", lwd = 5)
Plotting fitted values is helpful, but doesn't give us a sense of uncertainty. Obviously the red lines in the previous plots show the category that we are most likely to observe for a given value of x
, but it doesn't show us how likely an observation is to be in the other categories.
To see that, we need to look at predicted probabilities. Let's start by looking at the predicted probabilities object p2
:
head(p2)
## 1 2 3
## 1 0.01083 0.9021 0.08702
## 2 0.01210 0.8949 0.09298
## 3 0.01350 0.8872 0.09929
## 4 0.01506 0.8790 0.10596
## 5 0.01678 0.8702 0.11300
## 6 0.01868 0.8609 0.12041
As stated above, this object contains a predicted probability of being in each category of y
for a given value of x
.
The simplest plot of this is simply three lines, each of which is color coded to represent categories 1,2, and 3 of y
, respectively:
plot(NA, xlim = c(min(x), max(x)), ylim = c(0, 1), xlab = "x", ylab = "Predicted Probability")
lines(newdata$x, p2[, 1], col = "red", lwd = 2)
lines(newdata$x, p2[, 2], col = "blue", lwd = 2)
lines(newdata$x, p2[, 3], col = "green", lwd = 2)
# some text labels help clarify things:
text(9, 0.75, "y==1", col = "red")
text(6, 0.4, "y==3", col = "green")
text(5, 0.15, "y==2", col = "blue")
This plot gives us a bit more information than simply plotting predicted classes (as above). We now see that middling values of x
are only somewhat more likely to be in category y==3
than in the other categories, whereas at extreme values of x
, the data are much more likely to be in categories y==1
and y==2
.
A slightly more attractive variance of this uses the polygon
plotting function rather than lines
. Text labels might higlight We can also optionally add a horizontal bar at the base of the plot to highlight the predicted class for each value of x
:
plot(NA, xlim = c(min(x), max(x)), ylim = c(0, 1), xlab = "x", ylab = "Predicted Probability",
bty = "l")
# polygons
polygon(c(newdata$x, rev(newdata$x)), c(p2[, 1], rep(0, nrow(p2))), col = rgb(1,
0, 0, 0.3), border = rgb(1, 0, 0, 0.3))
polygon(c(newdata$x, rev(newdata$x)), c(p2[, 2], rep(0, nrow(p2))), col = rgb(0,
0, 1, 0.3), border = rgb(0, 0, 1, 0.3))
polygon(c(newdata$x, rev(newdata$x)), c(p2[, 3], rep(0, nrow(p2))), col = rgb(0,
1, 0, 0.3), border = rgb(0, 1, 0, 0.3))
# text labels
text(9, 0.4, "y=1", font = 2)
text(2.5, 0.4, "y=3", font = 2)
text(-1.5, 0.4, "y=2", font = 2)
# optionally highlight predicted class:
lines(newdata$x[p1 == 1], rep(0, sum(p1 == 1)), col = "red", lwd = 3)
lines(newdata$x[p1 == 2], rep(0, sum(p1 == 2)), col = "blue", lwd = 3)
lines(newdata$x[p1 == 3], rep(0, sum(p1 == 3)), col = "green", lwd = 3)
This plot nicely highlights both the fitted class but also the uncertainty associated with similar predicted probabilities at some values of x
.
Multinomial regression models can be difficult to interpret, but taking the few simple steps to estimate predicted probabilities and fitted classes and then plotting those estimates in some way can make the models much more intuitive.
This tutorial covers techniques of multiple imputation. Multiple imputation is a strategy for dealing with missing data. Whereas we typically (i.e., automatically) deal with missing data through casewise deletion of any observations that have missing values on key variables, imputation attempts to replace missing values with an estimated value. In single imputation, we guess that missing value one time (perhaps based on the means of observed values, or a random sampling of those values). In multiple imputation, we instead draw multiple values for each missing value, effectively building multiple datasets, each of which replaces the missing data in a different way. There are numerous algorithms for this, each of which builds those multiple datasets in different ways. We're not going to discuss the details here, but instead focus on executing multiple imputation in R. The main challenge of multiple imputation is not the analysis (it simply proceeds as usual on each imputed dataset) but instead the aggregation of those separate analyses. The examples below discuss how to do this.
To get a basic feel for the process, let's imagine that we're trying to calculate the mean of a vector of values that contains missing values. We can impute the missing values by drawing from the observed values, repeat the process several times, and then average across the estimated means to get an estimate of the mean with a measure of uncertainty that accounts for the uncertainty due to imputation. Let's create a vector of ten values, seven of which we observe and three of which are missing, and imagine that they are random draws from the population whose mean we're trying to estimate:
set.seed(10)
x <- c(sample(1:10, 7, TRUE), rep(NA, 3))
x
## [1] 6 4 5 7 1 3 3 NA NA NA
We can find the mean using case deletion:
mean(x, na.rm = TRUE)
## [1] 4.143
Our estimate of the sample standard error is then:
sd(x, na.rm = TRUE)/sqrt(sum(!is.na(x)))
## [1] 0.7693
Now let's impute several times to generate a list of imputed vectors:
imp <- replicate(15, c(x[!is.na(x)], sample(x[!is.na(x)], 3, TRUE)), simplify = FALSE)
imp
## [[1]]
## [1] 6 4 5 7 1 3 3 4 1 7
##
## [[2]]
## [1] 6 4 5 7 1 3 3 1 7 6
##
## [[3]]
## [1] 6 4 5 7 1 3 3 1 5 7
##
## [[4]]
## [1] 6 4 5 7 1 3 3 6 4 5
##
## [[5]]
## [1] 6 4 5 7 1 3 3 3 3 1
##
## [[6]]
## [1] 6 4 5 7 1 3 3 3 5 5
##
## [[7]]
## [1] 6 4 5 7 1 3 3 1 3 4
##
## [[8]]
## [1] 6 4 5 7 1 3 3 3 5 7
##
## [[9]]
## [1] 6 4 5 7 1 3 3 6 4 3
##
## [[10]]
## [1] 6 4 5 7 1 3 3 5 3 3
##
## [[11]]
## [1] 6 4 5 7 1 3 3 3 1 7
##
## [[12]]
## [1] 6 4 5 7 1 3 3 4 4 6
##
## [[13]]
## [1] 6 4 5 7 1 3 3 3 4 4
##
## [[14]]
## [1] 6 4 5 7 1 3 3 6 7 6
##
## [[15]]
## [1] 6 4 5 7 1 3 3 3 5 3
The result is a list of five vectors. The first seven values of each is the same as our original data, but the three missing values have been replaced with different combinations of the observed values. To get our new estimated maen, we simply take the mean of each vector, and then average across them:
means <- sapply(imp, mean)
means
## [1] 4.1 4.3 4.2 4.4 3.6 4.2 3.7 4.4 4.2 4.0 4.0 4.3 4.0 4.8 4.0
grandm <- mean(means)
grandm
## [1] 4.147
The result is 4.147, about the same as our original estimate. To get the standard error of our multiple imputation estimate, we need to combine the standard errors of each of our estimates, so that means we need to start by getting the SEs of each imputed vector:
ses <- sapply(imp, sd)/sqrt(10)
Aggregating the standard errors is a bit complicated, but basically sums the mean of the SEs (i.e., the “within-imputation variance”) with the variance across the different estimated means (the “between-imputation variance”). To calculate the within-imputation variance, we simply average the SE estimates:
within <- mean(ses)
To calculate the between-imputation variance, we calculate the sum of squared deviations of each imputed mean from the grand mean estimate:
between <- sum((means - grandm)^2)/(length(imp) - 1)
Then we sum the within- and between-imputation variances (multiply the latter by a small correction):
grandvar <- within + ((1 + (1/length(imp))) * between)
grandse <- sqrt(grandvar)
grandse
## [1] 0.8387
The resulting standard error is interesting because we increase the precision of our estimate by using 10 rather than 7 values (and standard errors are proportionate to sample size), but is larger than our original standard error because we have to account for uncertainty due to imputation. Thus if our missing values are truly missing at random, we can get a better estimate that is actually representative of our original population. Most multiple imputation algorithms are, however, applied to multivariate data rather than a single data vector and thereby use additional information about the relationship between observed values and missingness to reach even more precise estimates of target parameters.
There are three main R packages that offer multiple imputation techniques. Several other packages - described in the OfficialStatistics Task View - supply other imputation techniques, but packages Amelia (by Gary King and collaborators), mi (by Andrew Gelman and collaborators), and mice (by Stef van Buuren and collaborators) provide more than enough to work with. Let's start by installing these packages:
install.packages(c("Amelia", "mi", "mice"), repos = "http://cran.r-project.org")
## Warning: packages 'Amelia', 'mi', 'mice' are in use and will not be
## installed
Now, let's consider an imputation situation where we plan to conduct a regression analysis predicting y
by two covariates: x1
and x2
but we have missing data in x1
and x2
. Let's start by creating the dataframe:
x1 <- runif(100, 0, 5)
x2 <- rnorm(100)
y <- x1 + x2 + rnorm(100)
mydf <- cbind.data.frame(x1, x2, y)
Now, let's randomly remove some of the observed values of the independent variables:
mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA
mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA
The result is the removal of thirty values, 20 from x1
and 10 from x2
:
summary(mydf)
## x1 x2 y
## Min. :0.098 Min. :-2.321 Min. :-1.35
## 1st Qu.:1.138 1st Qu.:-0.866 1st Qu.: 1.17
## Median :2.341 Median : 0.095 Median : 2.39
## Mean :2.399 Mean :-0.038 Mean : 2.28
## 3rd Qu.:3.626 3rd Qu.: 0.724 3rd Qu.: 3.69
## Max. :4.919 Max. : 2.221 Max. : 6.26
## NA's :20 NA's :10
If we estimate the regression on these data, R will force casewise deletion of 28 cases:
lm <- lm(y ~ x1 + x2, data = mydf)
summary(lm)
##
## Call:
## lm(formula = y ~ x1 + x2, data = mydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5930 -0.7222 0.0018 0.7140 2.4878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0259 0.2196 0.12 0.91
## x1 0.9483 0.0824 11.51 < 2e-16 ***
## x2 0.7487 0.1203 6.23 3.3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.969 on 69 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.706, Adjusted R-squared: 0.698
## F-statistic: 82.9 on 2 and 69 DF, p-value: <2e-16
We should thus be quite skeptical of our results given taht we're discarding a substantial portion of our observations (28%, in fact). Let's see how the various multiple imputation packages address this and affect our inference.
library(Amelia)
imp.amelia <- amelia(mydf)
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7
Once we've run our multiple imputation, we can see where are missing data lie:
missmap(imp.amelia)
We can also run our regression model on each imputed dataset. We'll use the lapply
function to do this quickly on each of the imputed dataframes:
lm.amelia.out <- lapply(imp.amelia$imputations, function(i) lm(y ~ x1 + x2,
data = i))
If we look at lm.amelia.out
we'll see the results of the model run on each imputed dataframe separately:
lm.amelia.out
## $imp1
##
## Call:
## lm(formula = y ~ x1 + x2, data = i)
##
## Coefficients:
## (Intercept) x1 x2
## 0.247 0.854 0.707
##
##
## $imp2
##
## Call:
## lm(formula = y ~ x1 + x2, data = i)
##
## Coefficients:
## (Intercept) x1 x2
## 0.164 0.931 0.723
##
##
## $imp3
##
## Call:
## lm(formula = y ~ x1 + x2, data = i)
##
## Coefficients:
## (Intercept) x1 x2
## 0.0708 0.9480 0.8234
##
##
## $imp4
##
## Call:
## lm(formula = y ~ x1 + x2, data = i)
##
## Coefficients:
## (Intercept) x1 x2
## 0.0656 0.9402 0.6446
##
##
## $imp5
##
## Call:
## lm(formula = y ~ x1 + x2, data = i)
##
## Coefficients:
## (Intercept) x1 x2
## -0.064 0.956 0.820
To aggregate across the results is a little bit tricky because we have to extract the coefficients and standard errors from each model, format them in a particular way, and then feed that structure into the mi.meld
function:
coefs.amelia <- do.call(rbind, lapply(lm.amelia.out, function(i) coef(summary(i))[,
1]))
ses.amelia <- do.call(rbind, lapply(lm.amelia.out, function(i) coef(summary(i))[,
2]))
mi.meld(coefs.amelia, ses.amelia)
## $q.mi
## (Intercept) x1 x2
## [1,] 0.09683 0.926 0.7436
##
## $se.mi
## (Intercept) x1 x2
## [1,] 0.2291 0.08222 0.1267
Now let's compare these results to those of our original model:
t(do.call(rbind, mi.meld(coefs.amelia, ses.amelia)))
## [,1] [,2]
## (Intercept) 0.09683 0.22908
## x1 0.92598 0.08222
## x2 0.74359 0.12674
coef(summary(lm))[, 1:2] # original results
## Estimate Std. Error
## (Intercept) 0.02587 0.21957
## x1 0.94835 0.08243
## x2 0.74874 0.12026
library(mi)
Let's start by visualizing the missing data:
mp.plot(mydf)
We can then see some summary information about the dataset and the nature of the missingness:
mi.info(mydf)
## names include order number.mis all.mis type collinear
## 1 x1 Yes 1 20 No positive-continuous No
## 2 x2 Yes 2 10 No continuous No
## 3 y Yes NA 0 No continuous No
With that information confirmed, it is incredibly issue to conduct our multiple imputation using the mi
function:
imp.mi <- mi(mydf)
## Beginning Multiple Imputation ( Wed Nov 13 22:07:34 2013 ):
## Iteration 1
## Chain 1 : x1* x2*
## Chain 2 : x1* x2*
## Chain 3 : x1* x2*
## Iteration 2
## Chain 1 : x1* x2
## Chain 2 : x1* x2*
## Chain 3 : x1* x2*
## Iteration 3
## Chain 1 : x1* x2*
## Chain 2 : x1* x2
## Chain 3 : x1* x2
## Iteration 4
## Chain 1 : x1 x2
## Chain 2 : x1* x2
## Chain 3 : x1 x2
## Iteration 5
## Chain 1 : x1* x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2*
## Iteration 6
## Chain 1 : x1* x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2*
## Iteration 7
## Chain 1 : x1 x2*
## Chain 2 : x1* x2
## Chain 3 : x1 x2
## Iteration 8
## Chain 1 : x1 x2
## Chain 2 : x1* x2
## Chain 3 : x1* x2
## Iteration 9
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1* x2
## Iteration 10
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 11
## Chain 1 : x1* x2*
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 12
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 13
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 14
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## mi converged ( Wed Nov 13 22:07:37 2013 )
## Run 20 more iterations to mitigate the influence of the noise...
## Beginning Multiple Imputation ( Wed Nov 13 22:07:37 2013 ):
## Iteration 1
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 2
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 3
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 4
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 5
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 6
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 7
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 8
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 9
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 10
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 11
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 12
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 13
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 14
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 15
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 16
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 17
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 18
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 19
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## Iteration 20
## Chain 1 : x1 x2
## Chain 2 : x1 x2
## Chain 3 : x1 x2
## mi converged ( Wed Nov 13 22:07:41 2013 )
imp.mi
##
## Multiply imputed data set
##
## Call:
## .local(object = object, n.iter = ..3, R.hat = ..4, max.minutes = ..2,
## run.past.convergence = TRUE)
##
## Number of multiple imputations: 3
##
## Number and proportion of missing data per column:
## names type number.mis proportion
## 1 x1 positive-continuous 20 0.2
## 2 x2 continuous 10 0.1
## 3 y continuous 0 0.0
##
## Total Cases: 100
## Missing at least one item: 2
## Complete cases: 72
The results above report how many imputed datasets were produced and summarizes some of the results we saw above. For linear regression (and several other common models), the mi package includes functions that automatically run the model on each imputed dataset and aggregate the results:
lm.mi.out <- lm.mi(y ~ x1 + x2, imp.mi)
We can extract the results using the following:
coef.mi <- [email protected]
# or see them quickly with:
display(lm.mi.out)
## =======================================
## Separate Estimates for each Imputation
## =======================================
##
## ** Chain 1 **
## lm(formula = formula, data = mi.data[[i]])
## coef.est coef.se
## (Intercept) -0.01 0.18
## x1 0.96 0.06
## x2 0.75 0.09
## ---
## n = 100, k = 3
## residual sd = 0.89, R-Squared = 0.75
##
## ** Chain 2 **
## lm(formula = formula, data = mi.data[[i]])
## coef.est coef.se
## (Intercept) 0.03 0.18
## x1 0.94 0.07
## x2 0.67 0.09
## ---
## n = 100, k = 3
## residual sd = 0.91, R-Squared = 0.74
##
## ** Chain 3 **
## lm(formula = formula, data = mi.data[[i]])
## coef.est coef.se
## (Intercept) -0.02 0.20
## x1 0.96 0.07
## x2 0.69 0.10
## ---
## n = 100, k = 3
## residual sd = 0.96, R-Squared = 0.71
##
## =======================================
## Pooled Estimates
## =======================================
## lm.mi(formula = y ~ x1 + x2, mi.object = imp.mi)
## coef.est coef.se
## (Intercept) 0.00 0.19
## x1 0.95 0.07
## x2 0.71 0.10
## ---
Let's compare these results to our original model:
do.call(cbind, coef.mi) # multiply imputed results
## coefficients se
## (Intercept) 0.00123 0.18878
## x1 0.95311 0.06901
## x2 0.70687 0.10411
coef(summary(lm))[, 1:2] # original results
## Estimate Std. Error
## (Intercept) 0.02587 0.21957
## x1 0.94835 0.08243
## x2 0.74874 0.12026
library(mice)
To conduct the multiple imputation, we simply need to run the mice
function:
imp.mice <- mice(mydf)
##
## iter imp variable
## 1 1 x1 x2
## 1 2 x1 x2
## 1 3 x1 x2
## 1 4 x1 x2
## 1 5 x1 x2
## 2 1 x1 x2
## 2 2 x1 x2
## 2 3 x1 x2
## 2 4 x1 x2
## 2 5 x1 x2
## 3 1 x1 x2
## 3 2 x1 x2
## 3 3 x1 x2
## 3 4 x1 x2
## 3 5 x1 x2
## 4 1 x1 x2
## 4 2 x1 x2
## 4 3 x1 x2
## 4 4 x1 x2
## 4 5 x1 x2
## 5 1 x1 x2
## 5 2 x1 x2
## 5 3 x1 x2
## 5 4 x1 x2
## 5 5 x1 x2
We can see some summary information about the imputation process:
summary(imp.mice)
## Multiply imputed data set
## Call:
## mice(data = mydf)
## Number of multiple imputations: 5
## Missing cells per column:
## x1 x2 y
## 20 10 0
## Imputation methods:
## x1 x2 y
## "pmm" "pmm" ""
## VisitSequence:
## x1 x2
## 1 2
## PredictorMatrix:
## x1 x2 y
## x1 0 1 1
## x2 1 0 1
## y 0 0 0
## Random generator seed value: NA
To run our regression we use the lm
function wrapped in a with
call, which estimates our model on each imputed dataframe:
lm.mice.out <- with(imp.mice, lm(y ~ x1 + x2))
summary(lm.mice.out)
##
## ## summary of imputation 1 :
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7893 -0.7488 0.0955 0.7205 2.3768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0636 0.1815 0.35 0.73
## x1 0.9528 0.0660 14.44 < 2e-16 ***
## x2 0.6548 0.0916 7.15 1.6e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.915 on 97 degrees of freedom
## Multiple R-squared: 0.735, Adjusted R-squared: 0.73
## F-statistic: 135 on 2 and 97 DF, p-value: <2e-16
##
##
## ## summary of imputation 2 :
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4456 -0.6911 0.0203 0.6839 2.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1618 0.1855 0.87 0.39
## x1 0.8849 0.0671 13.19 < 2e-16 ***
## x2 0.7424 0.0942 7.88 4.7e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.939 on 97 degrees of freedom
## Multiple R-squared: 0.721, Adjusted R-squared: 0.716
## F-statistic: 126 on 2 and 97 DF, p-value: <2e-16
##
##
## ## summary of imputation 3 :
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5123 -0.6683 0.0049 0.6717 2.5072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0800 0.1872 0.43 0.67
## x1 0.9402 0.0674 13.95 < 2e-16 ***
## x2 0.8150 0.0947 8.61 1.3e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.94 on 97 degrees of freedom
## Multiple R-squared: 0.721, Adjusted R-squared: 0.715
## F-statistic: 125 on 2 and 97 DF, p-value: <2e-16
##
##
## ## summary of imputation 4 :
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5739 -0.7310 0.0152 0.6534 2.4748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0787 0.1815 0.43 0.67
## x1 0.9438 0.0660 14.30 < 2e-16 ***
## x2 0.7833 0.0916 8.55 1.8e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.922 on 97 degrees of freedom
## Multiple R-squared: 0.732, Adjusted R-squared: 0.726
## F-statistic: 132 on 2 and 97 DF, p-value: <2e-16
##
##
## ## summary of imputation 5 :
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6130 -0.6085 0.0085 0.6907 2.4719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0477 0.1858 0.26 0.8
## x1 0.9463 0.0672 14.07 < 2e-16 ***
## x2 0.7436 0.0893 8.33 5.4e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.91 on 97 degrees of freedom
## Multiple R-squared: 0.738, Adjusted R-squared: 0.733
## F-statistic: 137 on 2 and 97 DF, p-value: <2e-16
The results above are for each separate dataset. But, to pool them, we use pool
:
pool.mice <- pool(lm.mice.out)
Let's compare these results to our original model:
summary(pool.mice) # multiply imputed results
## est se t df Pr(>|t|) lo 95 hi 95 nmis
## (Intercept) 0.08637 0.19056 0.4533 81.41 6.516e-01 -0.2927 0.4655 NA
## x1 0.93361 0.07327 12.7422 50.19 0.000e+00 0.7865 1.0808 20
## x2 0.74782 0.11340 6.5944 22.53 1.104e-06 0.5130 0.9827 10
## fmi lambda
## (Intercept) 0.08664 0.06447
## x1 0.20145 0.17025
## x2 0.38954 0.33765
coef(summary(lm))[, 1:2] # original results
## Estimate Std. Error
## (Intercept) 0.02587 0.21957
## x1 0.94835 0.08243
## x2 0.74874 0.12026
It is useful at this point to compare the coefficients from each of our multiple imputation methods. To do so, we'll pull out the coefficients from each of the three packages' results, our original observed results (with case deletion), and the results for the real data-generating process (before we introduced missingness). Amelia package results
s.amelia <- t(do.call(rbind, mi.meld(coefs.amelia, ses.amelia)))
mi package results
s.mi <- do.call(cbind, coef.mi) # multiply imputed results
mice package results
s.mice <- summary(pool.mice)[, 1:2] # multiply imputed results
Original results (case deletion)
s.orig <- coef(summary(lm))[, 1:2] # original results
Real results (before missingness was introduced)
s.real <- summary(lm(y ~ x1 + x2))$coef[, 1:2]
Let's print the coefficients together to compare them:
allout <- cbind(s.real[, 1], s.amelia[, 1], s.mi[, 1], s.mice[, 1], s.orig[,
1])
colnames(allout) <- c("Real Relationship", "Amelia", "MI", "mice", "Original")
allout
## Real Relationship Amelia MI mice Original
## (Intercept) 0.04502 0.09683 0.00123 0.08637 0.02587
## x1 0.95317 0.92598 0.95311 0.93361 0.94835
## x2 0.82900 0.74359 0.70687 0.74782 0.74874
All three of the multiple imputation models - despite vast differences in underlying approaches to imputation in the three packages - yield strikingly similar inference. This was a relatively basic and all of the packages offer a number of options for more complicated situations than what we examined here. While executing multiple imputation requires choosing a package and typing some potentially tedious code, the results are almost always going to be better than doing the easier thing of deleting cases and ignoring the consequences thereof.
The bivariate OLS tutorial covers most of the details of model building and output, so this tutorial is comparatively short. It addresses some additional details about multivariate OLS models.
We'll begin by generating some fake data involving a few covariates. We'll then generate two outcomes, one that is a simple linear function of the covariates and one that involves an interaction.
set.seed(50)
n <- 200
x1 <- rbinom(n, 1, 0.5)
x2 <- rnorm(n)
x3 <- rnorm(n, 0, 4)
y1 <- x1 + x2 + x3 + rnorm(n)
y2 <- x1 + x2 + x3 + 2 * x1 * x2 + rnorm(n)
Now we can see how to model each of these processes.
As covered in the formulae tutorial, we can easily represent a multivariate model using a formula just like we did for a bivariate model. For example, a bivariate model might look like:
y1 ~ x1
## y1 ~ x1
## <environment: 0x000000001c3d67b0>
And a multivariate model would look like:
y1 ~ x1 + x2 + x3
## y1 ~ x1 + x2 + x3
## <environment: 0x000000001c3d67b0>
To include the interaction we need to use the *
operator, though we could also use :
for the same result:
y1 ~ x1 * x2 + x3
## y1 ~ x1 * x2 + x3
## <environment: 0x000000001c3d67b0>
y1 ~ x1 + x2 + x1 * x2 + x3
## y1 ~ x1 + x2 + x1 * x2 + x3
## <environment: 0x000000001c3d67b0>
The order of variables in a regression formula doesn't matter. Generally, R will print out the regression results in the order that the variables are listed in the formula, but there are exception. For example, all interactions are listed after main effects, as we'll see below.
Estimating a multivariate model is just like a bivariate model:
lm(y1 ~ x1 + x2 + x3)
##
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.111 0.924 1.012 0.993
We can do the same with an interaction model:
lm(y1 ~ x1 * x2 + x3)
##
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
##
## Coefficients:
## (Intercept) x1 x2 x3 x1:x2
## 0.116 0.917 0.892 0.992 0.228
By default, the estimated coefficients print to the console. If we want to do anything else with the model, we need to store it as an object and then we can perform further procedures:
m1 <- lm(y1 ~ x1 + x2 + x3)
m2 <- lm(y1 ~ x1 * x2 + x3)
To obtain just the coefficients themselves, we can use the coef
function applied to the model object:
coef(m1)
## (Intercept) x1 x2 x3
## 0.1106 0.9236 1.0119 0.9932
Similarly, we can use residuals
to see the model residuals. We'll just list the first 15 here:
residuals(m1)[1:15]
## 1 2 3 4 5 6 7 8
## 0.11464 -0.51139 -0.53711 -0.39099 -0.05157 0.46566 -0.11753 -1.25015
## 9 10 11 12 13 14 15
## -1.03919 -0.32588 -0.97016 0.89039 0.30830 -1.58698 0.83643
The model objects also include all of the data used to estimate the model in a sub-object called model
. Let's look at its first few rows:
head(m1$model)
## y1 x1 x2 x3
## 1 4.012 1 -1.524406 4.436
## 2 -10.837 0 0.043114 -10.551
## 3 -2.991 0 0.084210 -2.668
## 4 -7.166 1 0.354005 -8.223
## 5 4.687 1 1.104567 2.604
## 6 4.333 0 0.004345 3.778
There are lots of other things stored in a model object that don't concern us right now, but that you could see with str(m1)
or str(m2)
.
Much more information about a model becomes available when we use the summary
function:
summary(m1)
##
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5808 -0.6384 -0.0632 0.5966 2.8379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1106 0.0976 1.13 0.26
## x1 0.9236 0.1431 6.45 8.4e-10 ***
## x2 1.0119 0.0716 14.13 < 2e-16 ***
## x3 0.9932 0.0168 59.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01 on 196 degrees of freedom
## Multiple R-squared: 0.951, Adjusted R-squared: 0.951
## F-statistic: 1.28e+03 on 3 and 196 DF, p-value: <2e-16
Indeed, as with a bivariate model, a complete representation of the regression results is printed to the console, including coefficients, standard errors, t-statistics, p-values, some summary statistics about the regression residuals, and various model fit statistics. The summary object itself can be saved and objects extracted from it:
s1 <- summary(m1)
A look at the structure of s1
shows that there is considerable detail stored in the summary object:
str(s1)
## List of 11
## $ call : language lm(formula = y1 ~ x1 + x2 + x3)
## $ terms :Classes 'terms', 'formula' length 3 y1 ~ x1 + x2 + x3
## .. ..- attr(*, "variables")= language list(y1, x1, x2, x3)
## .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:4] "y1" "x1" "x2" "x3"
## .. .. .. ..$ : chr [1:3] "x1" "x2" "x3"
## .. ..- attr(*, "term.labels")= chr [1:3] "x1" "x2" "x3"
## .. ..- attr(*, "order")= int [1:3] 1 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0>
## .. ..- attr(*, "predvars")= language list(y1, x1, x2, x3)
## .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:4] "y1" "x1" "x2" "x3"
## $ residuals : Named num [1:200] 0.1146 -0.5114 -0.5371 -0.391 -0.0516 ...
## ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
## $ coefficients : num [1:4, 1:4] 0.1106 0.9236 1.0119 0.9932 0.0976 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:4] FALSE FALSE FALSE FALSE
## ..- attr(*, "names")= chr [1:4] "(Intercept)" "x1" "x2" "x3"
## $ sigma : num 1.01
## $ df : int [1:3] 4 196 4
## $ r.squared : num 0.951
## $ adj.r.squared: num 0.951
## $ fstatistic : Named num [1:3] 1278 3 196
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:4, 1:4] 0.009406 -0.009434 -0.000255 0.000116 -0.009434 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
## .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
## - attr(*, "class")= chr "summary.lm"
This includes all of the details that were printed to the console, which we extract separately, such as the coefficients:
coef(s1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1106 0.09757 1.133 2.584e-01
## x1 0.9236 0.14311 6.454 8.374e-10
## x2 1.0119 0.07159 14.135 9.787e-32
## x3 0.9932 0.01683 59.008 9.537e-127
s1$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1106 0.09757 1.133 2.584e-01
## x1 0.9236 0.14311 6.454 8.374e-10
## x2 1.0119 0.07159 14.135 9.787e-32
## x3 0.9932 0.01683 59.008 9.537e-127
Model fit statistics:
s1$sigma
## [1] 1.006
s1$r.squared
## [1] 0.9514
s1$adj.r.squared
## [1] 0.9506
s1$fstatistic
## value numdf dendf
## 1278 3 196
And so forth. These details become useful to be able to extract when we want to output our results to another format, such as Word, LaTeX, or something else.
The output from a model that includes interactions is essentially the same as for a model without any interactions, but note that the interaction coefficients are printed at the end of the output:
coef(m2)
## (Intercept) x1 x2 x3 x1:x2
## 0.1161 0.9172 0.8923 0.9925 0.2278
s2 <- summary(m2)
s2
##
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6387 -0.6031 -0.0666 0.6153 2.8076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1161 0.0973 1.19 0.23
## x1 0.9172 0.1426 6.43 9.5e-10 ***
## x2 0.8923 0.1035 8.62 2.2e-15 ***
## x3 0.9925 0.0168 59.17 < 2e-16 ***
## x1:x2 0.2278 0.1428 1.59 0.11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 195 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.951
## F-statistic: 967 on 4 and 195 DF, p-value: <2e-16
coef(s2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1161 0.09725 1.194 2.340e-01
## x1 0.9172 0.14261 6.431 9.536e-10
## x2 0.8923 0.10346 8.625 2.225e-15
## x3 0.9925 0.01677 59.171 1.549e-126
## x1:x2 0.2278 0.14282 1.595 1.123e-01
As with bivariate models, we can easily plot our observed data pairwise:
plot(y1 ~ x2)
And we can overlay predicted values of the outcome on that kind of plot:
plot(y1 ~ x2)
abline(m1$coef[1], m1$coef[3])
or plot the model residuals against included variables:
layout(matrix(1:2, nrow = 1))
plot(m1$residuals ~ x1)
plot(m1$residuals ~ x2)
For more details on plotting regressions, see the section of tutorials on Regression Plotting.
While most of R's default print setting are reasonable, it also provides fine-grained control over the display of output. This control is helpful both for looking at data and results, but also for correctly interpreting it and then outputing it to other formats (e.g., for use in a publication).
One of the biggest errors made by users of statistical software is the abuse of “false precision.” The idea of “false precision” is that the analyst uses the output of a statistical algorithm directly rather than formatting that output in line with precision of the actual data. Statistical algorithms, when executed by computers, will typically produce output to a finite but very large number of decimal places even though the underlying data only allow precision to a smaller number of decimals. Take for example, the task of calculating the average height of a group of individuals. Perhaps we have a tools capable of measuring height to the nearest centimeter. Let's say this is our data for five individuals:
height <- c(167, 164, 172, 158, 181, 179)
We can then use R to calculate the mean height of this group:
mean(height)
## [1] 170.2
The result is given to four decimal places. But because our data are only precise to whole centimeters, the concept of “significant figures” applies. According to those rules, we can only have a result that is precise to the number of digits in our original data plus one. Our original data have three significant digits so the result can only have one decimal place. The mean is thus 170.2 not 170.167. This is important because we might be tempted to compare our mean to another mean (as part of some analysis) and we can only detect differences at the tenths place but no further. A different group with a calculated mean height 170.191 (by the same measurement tool) would therefore have a mean indistinguishable from that in our group. These kinds of calculations must often be done by hand. But R can do them for using several different functions.
The most direct way to properly round our results is with either signif
or round
. signif
rounds to a specified number of significant digits. For our above example with four significant figures, we can use:
signif(mean(height), 4)
## [1] 170.2
An alternative approach is to use round
to specify a number of decimal places. For the above example, this would be 1:
round(mean(height), 1)
## [1] 170.2
round
also accepts negative values to round to, e.g., tens, hundreds, etc. places:
round(mean(height), -1)
## [1] 170
Figuring out significant figures can sometimes be difficult, particularly when the precision of original data is ambiguous. A good rule of thumb for social science data is two significant digits unless those data are known to have greater precision. As an example, surveys often measure constructs on an arbitrary scale (e.g., 1-7). There is one digit of precision in these data, so any results from them should have only two significant figures.
While R typically prints to a large number of digits (default on my machine is 7), the above reminds us that we shouldn't listen to R's defaults because they convey false precision. Rather than having to round everything that comes out of R each time, we can also specify a number of digits to round to globally. We might, for example, follow a rule of thumb of two decimal places for our results:
mean(height)
## [1] 170.2
sd(height)
## [1] 8.886
options(digits = 2)
mean(height)
## [1] 170
sd(height)
## [1] 8.9
But we can easily change this again to whatever value we so choose. Note: computers are limited in the number of decimals they can actually store, so requesting large number of decimal places may produce unexpected results.
options(digits = 20)
mean(height)
## [1] 170.16666666666665719
sd(height)
## [1] 8.8863190729720411554
options(digits = 7)
Another useful global option is scipen
, which decides whether R reports results in scientific notation.
If we specify a negative value for scipen
, R will tend to report results in scientific notation.
And, if we specify a positive value for scipen
, R will tend to report results in fixed notation, even when they are very small or very large.
Its default value is 0 (meaning no tendency either way).
options(scipen = -10)
1e+10
## [1] 1e+10
1e-10
## [1] 1e-10
options(scipen = 10)
1e+10
## [1] 10000000000
1e-10
## [1] 0.0000000001
options(scipen = 0)
1e+10
## [1] 1e+10
1e-10
## [1] 1e-10
Another strategy for formatting output is the sprintf
function.
sprintf
is very flexible, so I won't explain all the details here, but it can be used to format a number (and other things) into any variety of formats as a character string. Here are some examples from ?sprintf
for the display of pi:
sprintf("%f", pi)
## [1] "3.141593"
sprintf("%.3f", pi)
## [1] "3.142"
sprintf("%1.0f", pi)
## [1] "3"
sprintf("%5.1f", pi)
## [1] " 3.1"
sprintf("%05.1f", pi)
## [1] "003.1"
sprintf("%+f", pi)
## [1] "+3.141593"
sprintf("% f", pi)
## [1] " 3.141593"
One of the things that I find most difficult about regression is visualizing what is actually happening when a regression model is fit. One way to better understand that process is to recognize that a regression is simply a curve through the conditional mean values of an outcome at each value of one or more predictors. Thus, we can actually estimate (i.e., “figure out”) the regression line simply by determining the conditional mean of our outcome at each value of of our input. This is easiest to see in a bivariate regression, so let's create some data and build the model:
set.seed(100)
x <- sample(0:50, 10000, TRUE)
# xsq <- x^2 # a squared term just for fun the data-generating process:
y <- 2 + x + (x^2) + rnorm(10000, 0, 300)
Now let's calculate the conditional means of x
, xsq
, and y
:
condmeans_x <- by(x, x, mean)
condmeans_x2 <- by(x^2, x, mean)
condmeans_y <- by(y, x, mean)
If we run the regression on the original data (assuming we know the data-generating process), we'll get the following:
lm1 <- lm(y ~ x + I(x^2))
summary(lm1)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1229.5 -200.0 -0.7 200.6 1196.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0643 8.7796 0.92 0.36
## x -0.2293 0.8087 -0.28 0.78
## I(x^2) 1.0259 0.0156 65.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 300 on 9997 degrees of freedom
## Multiple R-squared: 0.871, Adjusted R-squared: 0.871
## F-statistic: 3.37e+04 on 2 and 9997 DF, p-value: <2e-16
If we run the regression instead on just the conditional means (i.e., one value of y
at each value of x
), we will get the following:
lm2 <- lm(condmeans_y ~ condmeans_x + condmeans_x2)
summary(lm2)
##
## Call:
## lm(formula = condmeans_y ~ condmeans_x + condmeans_x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.57 -17.20 0.85 17.21 50.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2157 9.7430 0.74 0.46
## condmeans_x -0.1806 0.9012 -0.20 0.84
## condmeans_x2 1.0250 0.0174 58.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.1 on 48 degrees of freedom
## Multiple R-squared: 0.999, Adjusted R-squared: 0.999
## F-statistic: 2.65e+04 on 2 and 48 DF, p-value: <2e-16
The results from the two models look very similar. Aside from some minor variations, they provide identical substantive inference about the process at-hand. We can see this if we plot the original data (in gray) and overlay it with the conditional means of y
(in red):
plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 16)
points(condmeans_y ~ condmeans_x, col = "red", pch = 15)
We can add predicted output lines (one for each model) to this plot to see the similarity of the two models even more clearly. Indeed, the lines overlay each other perfectly:
plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 16)
points(condmeans_y ~ condmeans_x, col = "red", pch = 15)
lines(0:50, predict(lm1, data.frame(x = 0:50)), col = "green", type = "l", lwd = 2)
lines(0:50, predict(lm2, data.frame(x = 0:50)), col = "blue", type = "l", lwd = 2)
So, if you ever struggle to think about what regression is doing, just remember that it is simply drawing a (potentially multidimensional) curve through the conditional means of the outcome at every value of the covariate(s).
When building regression models, one of the biggest question relates to “goodness-of-fit.” How well does our model of the data (i.e., the selected predictor variables) actually “fit” the outcome data? In other words, how much of the variation in an outcome can we explain with a particular model? R provides a number of useful ways of assessing model fit, some of which are common (but not necessarily good) and some which are uncommon (but probably much better). To see these statistics in action, we'll build some fake data and then model using a small, bivariate model that incompletely models the data-generating process and a large, multivariate model that does so much more completely:
set.seed(100)
x1 <- runif(500, 0, 10)
x2 <- rnorm(500, -2, 1)
x3 <- rbinom(500, 1, 0.5)
y <- -1 + x1 + x2 + 3 * x3 + rnorm(500)
Let's generate the bivariate model m1
and store it and its output sm1
for later:
m1 <- lm(y ~ x1)
sm1 <- summary(m1)
Then let's do the same for the multivariate model m2
and its output sm2
:
m2 <- lm(y ~ x1 + x2 + x3)
sm2 <- summary(m2)
Below we'll look at some different ways of assessing model fit.
One measure commonly used - perhaps against better wisdom - for assessing model fit is R-squared. R-squared speaks to the proportion of variance in the outcome that can be accounted for by the model.
Looking at our simple bivariate model m1
, we can extract R-squared as a measure of model fit in a number of ways. The easiest is simply to extract it from the sm1
summary object:
sm1$r.squared
## [1] 0.6223
But we can also calculate R-squared from our data in a number of ways:
cor(y, x1)^2 # manually, as squared bivariate correlation
## [1] 0.6223
var(m1$fitted)/var(y) # manually, as ratio of variances
## [1] 0.6223
(coef(m1)[2]/sqrt(cov(y, y)/cov(x1, x1)))^2 # manually, as weighted regression coefficient
## x1
## 0.6223
Commonly, we actually use the “Adjusted R-squared” because “regular” R-squared is sensitive to the number of independent variables in the model (i.e., as we put more variables into the model, R-squared increases even if those variables are unrelated to the outcome). Adjusted R-squared attempts to correct for this by deflating R-squared by the expect amount of increase from including irrelevant additional predictors. We can see this property of R-squared and Adjusted R-squared by adding two completely random variables unrelated to our other covariates or the outcome into our model and examine the impact on R-squared and Adjusted R-squared.
tmp1 <- rnorm(500, 0, 10)
tmp2 <- rnorm(500, 0, 10)
tmp3 <- rnorm(500, 0, 10)
tmp4 <- rnorm(500, 0, 10)
We can then compare the R-squared from our original bivariate model to that from the garbage dump model:
sm1$r.squared
## [1] 0.6223
summary(lm(y ~ x1 + tmp1 + tmp2 + tmp2 + tmp4))$r.squared
## [1] 0.6289
R-squared increased some, even though these variables are unrelated to y
. The adjusted R-squared value also changes, but less so than R-squared:
sm1$adj.r.squared
## [1] 0.6216
summary(lm(y ~ x1 + tmp1 + tmp2 + tmp2 + tmp4))$adj.r.squared
## [1] 0.6259
So, relying on adjusted R-squared is still imperfect and, more than anything, highlights the problems of relying on R-squared (in either form) as a reliable measure of model fit.
Of course, when we compare the R-squared and adjusted R-squared values from our bivariate model m1
to our fuller, multivariate model m2
, we see appropriate increases in R-squared:
# R-squared
sm1$r.squared
## [1] 0.6223
sm2$r.squared
## [1] 0.9166
# adjusted R-squared
sm1$adj.r.squared
## [1] 0.6216
sm2$adj.r.squared
## [1] 0.9161
In both cases, we see that R-squared and adjusted R-squared increase. The challenge is that because R-squared depends on factors other than model fit, it is an imperfect metric.
A very nice way to assess model fit is the standard error of the regression (SER), sometimes just sigma. In R regression output, the value is labeled “Residual standard error” and is stored in a model summary object as sigma
. You'll see it near the bottom of model output for the bivariate model m1
:
sm1
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.050 -1.634 -0.143 1.648 5.148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4499 0.1974 -7.35 8.4e-13 ***
## x1 0.9707 0.0339 28.65 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 498 degrees of freedom
## Multiple R-squared: 0.622, Adjusted R-squared: 0.622
## F-statistic: 821 on 1 and 498 DF, p-value: <2e-16
sm1$sigma
## [1] 2.113
This value captures the sum of squared residuals, over the number of degrees of freedom in the model:
sqrt(sum(residuals(m1)^2)/(m1$df.residual))
## [1] 2.113
In other words, the value is proportionate to the standard deviation of the the model residuals. In large samples, it will converge on the standard deviation of the residuals:
sd(residuals(m1))
## [1] 2.111
We can also see it in the multivariate model m2
:
sm2$sigma
## [1] 0.9949
sqrt(sum(residuals(m2)^2)/(m2$df.residual))
## [1] 0.9949
Because sigma is a standard deviation (and not a variance), it is on the scale of the original outcome data. Thus, we can actually directly compare the standard deviation of the original outcome data sd(y)
to the sigma of any model attepting to account for the variation in y
. We see that our models reduce that standard deviation considerably:
sd(y)
## [1] 3.435
sm1$sigma
## [1] 2.113
sm2$sigma
## [1] 0.9949
Because of this inherent comparability of scale, sigma provides a much nicer measure of model fit than R-squared. It can be difficult to interpret how much better a given model fits compared to a baseline model when using R-squared (as we saw just above). By contrast, we can easily quantify the extra explanation done by a larger model by looking at sigma. We can, for example, see that the addition of several random, unrelated variables in our model does almost nothing to sigma:
sm1$sigma
## [1] 2.113
summary(lm(y ~ x1 + tmp1 + tmp2 + tmp2 + tmp4))$sigma
## [1] 2.101
While we can see in the sigma values that the multivariate model fit the outcome data better than the bivariate model, the statistics alone don't supply a formal test of that between-model comparison. Typically, such comparisons are - incorrectly - made by comparing the R-squared of different models. Such comparisons are problematic because of the sensitivity of R-squared to the number of included variables (in addition to model fit).
We can make a formal comparison between “nested” models (i.e., models with a common outcome, where one contains a subset of the covariates of the other model and no additional coviariates). To do so, we conduct an F-test.
The F-test compares the fit of the larger model to the smaller model. By definition, the larger model (i.e., the one with more predictors) will always fit the data better than the smaller model. Thus, just with R-squared, we can be tempted to add more covariates simply to increase model fit even if that increase in fit is not particurly meaningful. The F-test thus compares the residuals from the larger model to the smaller model and tests whether there is a statistically significant reduction in the sum of squared residuals. We execute the test using the anova
function:
anova(m1, m2)
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2 + x3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 498 2224
## 2 496 491 2 1733 876 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output suggests that our multivariate model does a much better job of fitting the data than the bivariate model, as indicated by the much larger RSS value for m1
(listed first) and the very large F-statistic (and associated very small, heavily-starred p-value).
To see what's going on “under the hood,” we can calculate the RSS for both models, the sum of squares, and the associated F-statistic:
sum(m1$residuals^2) # residual sum of squares for m1
## [1] 2224
sum(m2$residuals^2) # residual sum of squares for m2
## [1] 490.9
sum(m1$residuals^2)-sum(m2$residuals^2) # sum of squares
## [1] 1733
# the F-statistic
f <- ((sum(m1$residuals^2)-sum(m2$residuals^2))/sum(m2$residuals^2)) * # ratio of sum of squares
(m2$df.residual/(m1$df.residual-m2$df.residual)) # ratio of degrees of freedom
f
## [1] 875.5
# p-value (from the F distribution)
pf(f, m1$df.residual-m2$df.residual, m2$df.residual, lower.tail = FALSE)
## [1] 1.916e-163
Thus there are some complicated calculations being performed in order for the anova
F-test to tell us whether the models differ in their fit. But, those calculations give us very clear inference about any improvement in model fit. Using the F-test rather than an ad-hoc comparison between (adjusted) R-squared values is a much more appropriate comparison of model fits.
Another very nice way to assess goodness of fit is to do so visually using the QQ-plot. This plot, in general, compares the distributions of two variables. In the regression context, we can use it to compare the quantiles of the outcome distribution to the quantiles of the distribution of fitted values from the model. To do this in R, we need to extracted the fitted values from our model using the fitted
function and the qqplot
function to do the plotting.
Let's compare the fit of our bivariate model to our multivariate model using two side-by-side qqplots.
layout(matrix(1:2, nrow = 1))
qqplot(y, fitted(m1), col = "gray", pch = 15, ylim = c(-5, 10), main = "Bivariate model")
curve((x), col = "blue", add = TRUE)
qqplot(y, fitted(m2), col = "gray", pch = 15, ylim = c(-5, 10), main = "Multivariate model")
curve((x), col = "blue", add = TRUE)
Note: The blue lines represent a y=x
line.
If the distribution of y
and the distribution of the fitted values matched perfectly, then the gray dots would line up perflectly along the y=x
line. We see, however, in the bivariate (underspecified) model (left panel) that the fitted values diverge considerably from the distribution of y
. By contrast, the fitted values from our multivariate model (right panel) match the disribution of y
much more closely. In both plots, however, the models clearly fail to precisely explain extreme values of y
.
While we cannot summarize the QQ-plot as a single numeric statistic, it provides a very rich characterization of fit that shows not only how well our model fits overall, but also where in the distribution of our outcome the model is doing a better or worse job of explaining the outcome.
Though different from a QQ-plot, we can also plot our fitted values directly again the outcome in order to see how well the model is capturing variation in the outcome. The closer this cloud of points looks to a single, straight line, the better the model fit. Such plots can also help us capture non-linearities and other things in data. Let's compare the fit of the two models side-by-side again:
layout(matrix(1:2, nrow = 1))
plot(y, fitted(m1), col = "gray", pch = 15, ylim = c(-5, 10), main = "Bivariate model")
curve((x), col = "blue", add = TRUE)
plot(y, fitted(m2), col = "gray", pch = 15, ylim = c(-5, 10), main = "Multivariate model")
curve((x), col = "blue", add = TRUE)
As above, we see that the bivariate does a particularly power job of explaining extreme cases in y
, whereas the multivariate model does much better but remains imperfect (due to random variation in y
from when we created the data).
The matrix representation of OLS is (X'X)-1(X'Y). Representing this in R is simple. Let's start with some made up data:
set.seed(1)
n <- 20
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
X <- cbind(x1, x2, x3)
y <- x1 + x2 + x3 + rnorm(n)
To transpose a matrix, we use the t
function:
X
## x1 x2 x3
## [1,] -0.62645 0.91898 -0.1645
## [2,] 0.18364 0.78214 -0.2534
## [3,] -0.83563 0.07456 0.6970
## [4,] 1.59528 -1.98935 0.5567
## [5,] 0.32951 0.61983 -0.6888
## [6,] -0.82047 -0.05613 -0.7075
## [7,] 0.48743 -0.15580 0.3646
## [8,] 0.73832 -1.47075 0.7685
## [9,] 0.57578 -0.47815 -0.1123
## [10,] -0.30539 0.41794 0.8811
## [11,] 1.51178 1.35868 0.3981
## [12,] 0.38984 -0.10279 -0.6120
## [13,] -0.62124 0.38767 0.3411
## [14,] -2.21470 -0.05381 -1.1294
## [15,] 1.12493 -1.37706 1.4330
## [16,] -0.04493 -0.41499 1.9804
## [17,] -0.01619 -0.39429 -0.3672
## [18,] 0.94384 -0.05931 -1.0441
## [19,] 0.82122 1.10003 0.5697
## [20,] 0.59390 0.76318 -0.1351
t(X)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## x1 -0.6265 0.1836 -0.83563 1.5953 0.3295 -0.82047 0.4874 0.7383
## x2 0.9190 0.7821 0.07456 -1.9894 0.6198 -0.05613 -0.1558 -1.4708
## x3 -0.1645 -0.2534 0.69696 0.5567 -0.6888 -0.70750 0.3646 0.7685
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## x1 0.5758 -0.3054 1.5118 0.3898 -0.6212 -2.21470 1.125 -0.04493
## x2 -0.4782 0.4179 1.3587 -0.1028 0.3877 -0.05381 -1.377 -0.41499
## x3 -0.1123 0.8811 0.3981 -0.6120 0.3411 -1.12936 1.433 1.98040
## [,17] [,18] [,19] [,20]
## x1 -0.01619 0.94384 0.8212 0.5939
## x2 -0.39429 -0.05931 1.1000 0.7632
## x3 -0.36722 -1.04413 0.5697 -0.1351
To multiply two matrices, we use the %*%
matrix multiplication operator:
t(X) %*% X
## x1 x2 x3
## x1 16.573 -3.314 4.711
## x2 -3.314 14.427 -3.825
## x3 4.711 -3.825 12.843
To invert a matrix, we use the solve
function:
solve(t(X) %*% X)
## x1 x2 x3
## x1 0.068634 0.009868 -0.02224
## x2 0.009868 0.076676 0.01922
## x3 -0.022236 0.019218 0.09175
Now let's put all of that together:
solve(t(X) %*% X) %*% t(X) %*% y
## [,1]
## x1 0.7818
## x2 1.2857
## x3 1.4615
Now let's compare it to the lm
function:
lm(y ~ x1 + x2 + x3)$coef
## (Intercept) x1 x2 x3
## 0.08633 0.76465 1.27869 1.44705
The numbers are close, but they're not quite right.
The reason is that we forgot to include the intercept in our matrix calculation.
If we use lm
again but leave out the intercept, we'll see this is the case:
lm(y ~ 0 + x1 + x2 + x3)$coef
## x1 x2 x3
## 0.7818 1.2857 1.4615
To include the intercept in matrix form, we need to add a vector of 1's to the matrix:
X2 <- cbind(1, X) #' this uses vector recycling
Now we redo our math:
solve(t(X2) %*% X2) %*% t(X2) %*% y
## [,1]
## 0.08633
## x1 0.76465
## x2 1.27869
## x3 1.44705
And compare to our full model using lm
:
lm(y ~ x1 + x2 + x3)$coef
## (Intercept) x1 x2 x3
## 0.08633 0.76465 1.27869 1.44705
The result is exactly what we would expect.
Interactions are important, but they're hard to understand without visualization. This script works through how to visualize interactions in linear regression models.
set.seed(1)
x1 <- rnorm(200)
x2 <- rbinom(200, 1, 0.5)
y <- x1 + x2 + (2 * x1 * x2) + rnorm(200)
Interactions (at least in fake data) tend to produce weird plots:
plot(y ~ x1)
plot(y ~ x2)
This means they also produce weird residual plots:
ols1 <- lm(y ~ x1 + x2)
plot(ols1$residuals ~ x1)
plot(ols1$residuals ~ x2)
For example, in the first plot we find that there are clearly two relationships between y
and x1
, one positive and one negative.
We thus want to model this using an interaction:
ols2 <- lm(y ~ x1 + x2 + x1:x2)
summary(ols2)
##
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.855 -0.680 -0.002 0.682 3.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0627 0.1103 -0.57 0.57
## x1 1.1199 0.1258 8.90 3.7e-16 ***
## x2 1.1303 0.1538 7.35 5.3e-12 ***
## x1:x2 1.9017 0.1672 11.37 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 196 degrees of freedom
## Multiple R-squared: 0.821, Adjusted R-squared: 0.818
## F-statistic: 299 on 3 and 196 DF, p-value: <2e-16
Note: This is equivalent to either of the following:
summary(lm(y ~ x1 + x2 + x1 * x2))
##
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.855 -0.680 -0.002 0.682 3.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0627 0.1103 -0.57 0.57
## x1 1.1199 0.1258 8.90 3.7e-16 ***
## x2 1.1303 0.1538 7.35 5.3e-12 ***
## x1:x2 1.9017 0.1672 11.37 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 196 degrees of freedom
## Multiple R-squared: 0.821, Adjusted R-squared: 0.818
## F-statistic: 299 on 3 and 196 DF, p-value: <2e-16
summary(lm(y ~ x1 * x2))
##
## Call:
## lm(formula = y ~ x1 * x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.855 -0.680 -0.002 0.682 3.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0627 0.1103 -0.57 0.57
## x1 1.1199 0.1258 8.90 3.7e-16 ***
## x2 1.1303 0.1538 7.35 5.3e-12 ***
## x1:x2 1.9017 0.1672 11.37 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 196 degrees of freedom
## Multiple R-squared: 0.821, Adjusted R-squared: 0.818
## F-statistic: 299 on 3 and 196 DF, p-value: <2e-16
However, specifying only the interaction…
summary(lm(y ~ x1:x2))
##
## Call:
## lm(formula = y ~ x1:x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.390 -0.873 0.001 0.971 4.339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5305 0.0988 5.37 2.2e-07 ***
## x1:x2 3.0492 0.1415 21.55 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.4 on 198 degrees of freedom
## Multiple R-squared: 0.701, Adjusted R-squared: 0.7
## F-statistic: 464 on 1 and 198 DF, p-value: <2e-16
produces an incomplete (and thus invalid) model. Now let's figure out how to visualize this interaction based upon the complete/correct model.
Our example data are particularly simple. There are two groups (defined by x2
) and one covariate (x1
).
We can plot these two groups separately in order to see their distributions of y
as a function of x1
.
We can index our vectors in order to plot the groups separately in red and blue:
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
It is already clear that there is an interaction. Let's see if we plot the estimated effects.
The easiest way of examining interactions is with predicted outcomes plots.
We simply want to show the predicted value of the outcome based upon combinations of input variables.
We know that we can do this with the predict
function applied to some new data.
The expand.grid
function is help to build the necessary new data:
xseq <- seq(-5, 5, length.out = 100)
newdata <- expand.grid(x1 = xseq, x2 = c(0, 1))
Let's build a set of predicted values for our no-interaction model:
fit1 <- predict(ols1, newdata, se.fit = TRUE, type = "response")
Then do the same for our full model with the interaction:
fit2 <- predict(ols2, newdata, se.fit = TRUE, type = "response")
Now let's plot the original data, again. Then we'll overlay it with the predicted values for the two groups.
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit1$fit[1:100], type = "l", col = "red")
points(xseq, fit1$fit[101:200], type = "l", col = "blue")
The result is a plot that differentiates the absolute levels of y
in the two groups, but forces them to have equivalent slopes.
We know this is wrong.
Now let's try to plot the data with the correct fitted values, accounting for the interaction:
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit2$fit[1:100], type = "l", col = "red")
points(xseq, fit2$fit[101:200], type = "l", col = "blue")
This looks better. The fitted values lines correspond nicely to the varying slopes in our two groups.
But, we still need to add uncertainty. Luckily, we have the necessarily information in fit2$se.fit
.
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit2$fit[1:100], type = "l", col = "red")
points(xseq, fit2$fit[101:200], type = "l", col = "blue")
points(xseq, fit2$fit[1:100] - fit2$se.fit[1:100], type = "l", col = "red",
lty = 2)
points(xseq, fit2$fit[1:100] + fit2$se.fit[1:100], type = "l", col = "red",
lty = 2)
points(xseq, fit2$fit[101:200] - fit2$se.fit[101:200], type = "l", col = "blue",
lty = 2)
points(xseq, fit2$fit[101:200] + fit2$se.fit[101:200], type = "l", col = "blue",
lty = 2)
We can also produce the same plot through bootstrapping.
tmpdata <- data.frame(x1 = x1, x2 = x2, y = y)
myboot <- function() {
thisboot <- sample(1:nrow(tmpdata), nrow(tmpdata), TRUE)
coef(lm(y ~ x1 * x2, data = tmpdata[thisboot, ]))
}
bootcoefs <- replicate(2500, myboot())
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
apply(bootcoefs, 2, function(coefvec) {
points(xseq, coefvec[1] + (xseq * coefvec[2]), type = "l", col = rgb(1,
0, 0, 0.01))
points(xseq, coefvec[1] + (xseq * (coefvec[2] + coefvec[4])) + coefvec[3],
type = "l", col = rgb(0, 0, 1, 0.01))
})
## NULL
points(xseq, fit2$fit[1:100], type = "l")
points(xseq, fit2$fit[101:200], type = "l")
points(xseq, fit2$fit[1:100] - fit2$se.fit[1:100], type = "l", lty = 2)
points(xseq, fit2$fit[1:100] + fit2$se.fit[1:100], type = "l", lty = 2)
points(xseq, fit2$fit[101:200] - fit2$se.fit[101:200], type = "l", lty = 2)
points(xseq, fit2$fit[101:200] + fit2$se.fit[101:200], type = "l", lty = 2)
If we overlay our previous lines of top of this, we see that they produce the same result, above.
Of course, we may want to show confidence intervals rather than SEs. And this is simple.
We can reproduce the graph with 95% confidence intervals, using qnorm
to determine how much to multiple our SEs by.
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit2$fit[1:100], type = "l", col = "red")
points(xseq, fit2$fit[101:200], type = "l", col = "blue")
points(xseq, fit2$fit[1:100] - qnorm(0.975) * fit2$se.fit[1:100], type = "l",
lty = 2, col = "red")
points(xseq, fit2$fit[1:100] + qnorm(0.975) * fit2$se.fit[1:100], type = "l",
lty = 2, col = "red")
points(xseq, fit2$fit[101:200] - qnorm(0.975) * fit2$se.fit[101:200], type = "l",
lty = 2, col = "blue")
points(xseq, fit2$fit[101:200] + qnorm(0.975) * fit2$se.fit[101:200], type = "l",
lty = 2, col = "blue")
We can also use plots to visualize why we need to include constitutive terms in our interaction models. Recall that our model is defined as:
ols2 <- lm(y ~ x1 + x2 + x1:x2)
We can compare this to a model with only one term and the interaction:
ols3 <- lm(y ~ x1 + x1:x2)
fit3 <- predict(ols3, newdata, se.fit = TRUE, type = "response")
And plot its results:
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)),
ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit3$fit[1:100], type = "l", col = "red")
points(xseq, fit3$fit[101:200], type = "l", col = "blue")
# We can compare these lines to those from the full model:
points(xseq, fit2$fit[1:100], type = "l", col = "red", lwd = 2)
points(xseq, fit2$fit[101:200], type = "l", col = "blue", lwd = 2)
By leaving out a term, we misestimate the effect of x1
in both groups.
This tutorial focuses on ordered outcome regression models. R's base glm
function does not support these, but they're very easy to execute using the MASS package, which is a recommended package.
library(MASS)
Ordered outcome data always create a bit of tension when it comes to analysis because it presents many options for how to analyze it. For example, imagine we are looking at the effect of some independent variables on response to a survey question that measures opinion on a five-point scale from extremely supportive to extremely opposed. We could dichotomize the measure to compare support versus opposition with a binary model. We could also assume that the categories are spaced equidistant on a latent scale and simply model the outcome using a linear model. Or, finally, we could use an ordered model (e.g., ordered logit or ordered probit) to model the unobserved latent scale of the outcome without requiring that the outcome categories are equidistant on that scale. We'll focus on the last of these options here, with comparison to the binary and linear alternative specifications.
Let's start by creating some data that have a linear relationship between an outcome y
and two covariates x1
and x2
:
set.seed(500)
x1 <- runif(500, 0, 10)
x2 <- rbinom(500, 1, 0.5)
y <- x1 + x2 + rnorm(500, 0, 3)
The y
vector is our latent linear scale that we won't actually observe. Instead let's collapse the y
variable into a new variable y2
, which will serve as our observed data and has 5 categories. We can do this using the cut
function:
y2 <- as.numeric(cut(y, 5))
Now let's plot our “observed” data y2
against our independent variables. We'll plot the values for x2==1
and x2==0
separately just to visualize the data. And we'll additionally fit a linear model to the data and draw separate lines for predictin y2
for values of x2==0
and x2==1
(which will be parallel lines):
lm1 <- lm(y2 ~ x1 + x2)
plot(y2[x2 == 0] ~ x1[x2 == 0], col = rgb(1, 0, 0, 0.2), pch = 16)
points(y2[x2 == 1] ~ x1[x2 == 1], col = rgb(0, 0, 1, 0.2), pch = 16)
abline(coef(lm1)[1], coef(lm1)[2], col = "red", lwd = 2)
abline(coef(lm1)[1] + coef(lm1)[3], coef(lm1)[2], col = "blue", lwd = 2)
The plot actually seems like a decent fit, but let's remember that the linear model is trying to predict the conditional means if of our outcome y2
for each value of x
but those conditional means can be kind of meaningless when our outcome can only take specific values rather than all values. Let's redraw the plot with points for the conditional means (at 10 values of x1
) to see the problem:
plot(y2[x2 == 0] ~ x1[x2 == 0], col = rgb(1, 0, 0, 0.2), pch = 16)
points(y2[x2 == 1] ~ x1[x2 == 1], col = rgb(0, 0, 1, 0.2), pch = 16)
x1cut <- as.numeric(cut(x1, 10))
s <- sapply(unique(x1cut), function(i) {
points(i, mean(y2[x1cut == i & x2 == 0]), col = "red", pch = 15)
points(i, mean(y2[x1cut == i & x2 == 1]), col = "blue", pch = 15)
})
# redraw the regression lines:
abline(coef(lm1)[1], coef(lm1)[2], col = "red", lwd = 1)
abline(coef(lm1)[1] + coef(lm1)[3], coef(lm1)[2], col = "blue", lwd = 1)
Overall, then, the previous approach doesn't seem to be doing that great of a job and the output of the model will be continuous values that fall outside of the set of discrete values we actually observed for y2
. Instead, we should try an ordered model (either ordered logit or ordered probit).
To estimate these models we need to use the polr
function from the MASS package. We can use the same formula interface that we used for the linear model. The default is an ordered logit model, but we can easily specify probit using a method='probot'
argument.
Note: One important issue is that the outcome needs to be a “factor” class object. But we can specify this atomically in the call to polr
:
ologit <- polr(factor(y2) ~ x1 + x2)
oprobit <- polr(factor(y2) ~ x1 + x2, method = "probit")
Let's look at the summaries of these objects, just to get familiar with the output:
summary(ologit)
##
## Re-fitting to get Hessian
## Error: object 'y2' not found
summary(oprobit)
##
## Re-fitting to get Hessian
## Error: object 'y2' not found
The output looks similar to a linear model but now instead of a single intercept, we have a set of intercepts listed separately from the other coefficients. These intercepts speak to the points (on a latent dimension) where the outcome transitions from one category to the next. Because they're on a latent scale, they're not particularly meaningful to us. Indeed, even the coefficients aren't particularly meaningful. Unlike in OLS, these are not directly interpretable. So let's instead look at some predicted probabilities.
Predicted probabilities can be estimated in the same way for ordered models as for binary GLMs. We simply need to create some covariate data over which we want to estimate predicted probabilities and then run predict
. We'll use expand.grid
to create our newdata
dataframe because we have two covariates and it simplifies creating data at each possible level of both variables.
newdata <- expand.grid(seq(0, 10, length.out = 100), 0:1)
names(newdata) <- c("x1", "x2")
When estimating outcomes we can actually choose between getting the discrete fitted class (i.e., which value of the outcome is most likely at each value of covariates) or the predicted probabilities. We'll get both for the logit model just to compare:
plogclass <- predict(ologit, newdata, type = "class")
plogprobs <- predict(ologit, newdata, type = "probs")
If we look at the head
of each object, we'll see that when type='class'
, the result is a single vector of discete fitted values, whereas when type='probs'
, the response is a matrix where (for each observation in our new data) the predicted probability of being in each outcome category is specified.
head(plogclass)
## [1] 2 2 2 2 2 2
## Levels: 1 2 3 4 5
head(plogprobs)
## 1 2 3 4 5
## 1 0.3564 0.5585 0.07889 0.005966 0.0002858
## 2 0.3428 0.5673 0.08326 0.006330 0.0003034
## 3 0.3295 0.5756 0.08785 0.006715 0.0003220
## 4 0.3165 0.5833 0.09267 0.007124 0.0003418
## 5 0.3038 0.5906 0.09771 0.007558 0.0003627
## 6 0.2913 0.5973 0.10299 0.008018 0.0003850
Note: The predicted probabilities necessarily sum to 1 in ordered models:
rowSums(plogprobs)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 199 200
## 1 1
The easiest way to make sense of these predictions is through plotting.
Let's start by plotting the original data and then overlying, as horizontal lines, the predicted classes for each value of x1
and x2
:
plot(y2[x2 == 0] ~ x1[x2 == 0], col = rgb(1, 0, 0, 0.2), pch = 16, xlab = "x1",
ylab = "y")
points(y2[x2 == 1] ~ x1[x2 == 1], col = rgb(0, 0, 1, 0.2), pch = 16)
s <- sapply(1:5, function(i) lines(newdata$x1[plogclass == i & newdata$x2 ==
0], as.numeric(plogclass)[plogclass == i & newdata$x2 == 0] + 0.1, col = "red",
lwd = 3))
s <- sapply(1:5, function(i) lines(newdata$x1[plogclass == i & newdata$x2 ==
1], as.numeric(plogclass)[plogclass == i & newdata$x2 == 1] - 0.1, col = "blue",
lwd = 3))
Note: We've drawn the predicted classes separately for x2==0
(red) and x2==1
(blue) and offset them vertically to see their values and the underlying data.
The above plot shows, for each combination of values of x1
and x2
, what the most likely category to observe for y2
is. Thus, where one horizontal bar ends, the next begins (i.e., the blue bars do not overlap each other and neither do the red bars). You'll also note for these data that the predictions are never expected to be in y==1
or y==5
, even though some of our observed y
values are.
Now that we've seen the fitted classes, we should acknowledge that we have some uncertainty about those classes. It's simply the most likely class for each observation to take, but there is a defined probability that we'll see values in each of the other y
classes. We can see this if we plot our predicted probability object plogprobs
.
We'll plot predicted probabilities when x2==0
on the left and when x2==1
on the right. The colored lines represent the predicted probability of falling in each category of y2
(in rainbow order, so that red represents y2==1
and purple represents y2==5
). We'll draw a thick horizontal line at the bottom of the plot representing the predicted classes at each value of x1
and x2
:
layout(matrix(1:2, nrow = 1))
# plot for `x2==0`
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 0], plogprobs[newdata$x2 ==
0, i], lwd = 1, col = col), 1:5, rainbow(5))
# optional horizontal line representing predicted class
s <- mapply(function(i, col) lines(newdata$x1[plogclass == i & newdata$x2 ==
0], rep(0, length(newdata$x1[plogclass == i & newdata$x2 == 0])), col = col,
lwd = 3), 1:5, rainbow(5))
# plot for `x2==1`
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 1], plogprobs[newdata$x2 ==
1, i], lwd = 1, col = col), 1:5, rainbow(5))
# optional horizontal line representing predicted class
s <- mapply(function(i, col) lines(newdata$x1[plogclass == i & newdata$x2 ==
1], rep(0, length(newdata$x1[plogclass == i & newdata$x2 == 1])), col = col,
lwd = 3), 1:5, rainbow(5))
We can see that the predicted probability curves strictly follow the logistic distribution (due to our use of a logit model). The lefthand plot also shows what we noted in the earlier plot: when x2==0
, the model never predicts y2==5
.
Note: We can redraw the same plot using prediction values from our ordered probit model and obtain essentially the same inference:
oprobprobs <- predict(oprobit, newdata, type = "probs")
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Predicted Probability (Probit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 0], oprobprobs[newdata$x2 ==
0, i], lwd = 1, col = col), 1:5, rainbow(5))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Predicted Probability (Probit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 1], oprobprobs[newdata$x2 ==
1, i], lwd = 1, col = col), 1:5, rainbow(5))
Though the above plot predicted probabilities plots communicate a lot of information. We can also present predicted probabilities in a different way. Because we use ordered outcome regression models when we believe the outcome has a meaningful ordinal scale, it may make sense to present the predicted probabilities stacked on top of one another as a “stacked area chart” (since they sum to 1 for every combination of covariates) to differently communicate the relative probability of being in each outcome class at each combination of covariates. To do this, we need to write a little bit of code to prep our data.
Specifically, our plogprobs
object is a matrix where, for each row, the columns are predicted probabilities of being in each category of the outcome. In order to plot them stacked on top of one another, we need the value in each column to instead be the cumulative probability (calculated left-to-right across the matrix). Luckily R has some nice built in function to do this. cumsum
returns the cumulative sum at each position of a vector. We can use apply
to calculate this cumulative sum for each row of the plogprobs
matrix, and then we simply need to transpose that result using the t
function to simply some things later on. Let's try it out:
cumprobs <- t(apply(plogprobs, 1, cumsum))
head(cumprobs)
## 1 2 3 4 5
## 1 0.3564 0.9149 0.9937 0.9997 1
## 2 0.3428 0.9101 0.9934 0.9997 1
## 3 0.3295 0.9051 0.9930 0.9997 1
## 4 0.3165 0.8999 0.9925 0.9997 1
## 5 0.3038 0.8944 0.9921 0.9996 1
## 6 0.2913 0.8886 0.9916 0.9996 1
Note: The cumulative probabilities will always be 1 for category 5 because rows sum to 1.
To plot this, we simply need to draw these new values on our plot. We'll again separate data for x2==0
from x2==1
.
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Cumulative Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 0], cumprobs[newdata$x2 ==
0, i], lwd = 1, col = col), 1:5, rainbow(5))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Cumulative Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 1], cumprobs[newdata$x2 ==
1, i], lwd = 1, col = col), 1:5, rainbow(5))
The result is a stacked area chart showing the cumulative probability of being in a set of categories of y
. If we think back to the first example at the top of this tutorial - about predicting opinions on a five-point scale - we could interpret the above plot as the cumulative probability of, e.g., opposing the issue. If y==1
(red) and y==2
(yellow) represent strong and weak opposition, respectively, we could interpret the above lefthand plot as saying that when x1==0
, there is about a 40% chance that an individual strongly opposes and an over 90% chance that they will oppose strongly or weakly.
This plot makes it somewhat more difficult to figure out what the most likely outcome category is, but it helps for making these kind of cumulative prediction statements. To see the most likely category, we have to visually estimate the widest vertical distance between lines at any given values of x1
, which can be tricky.
We can also use the polygon
plotting function to draw areas rather than lines, which produces a slight different effect:
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Cumulative Predicted Probability (Logit)",
bty = "l")
s <- mapply(function(i, col) polygon(c(newdata$x1[newdata$x2 == 0], rev(newdata$x1[newdata$x2 ==
0])), c(cumprobs[newdata$x2 == 0, i], rep(0, length(newdata$x1[newdata$x2 ==
0]))), lwd = 1, col = col, border = col), 5:1, rev(rainbow(5)))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Cumulative Predicted Probability (Logit)",
bty = "l")
s <- mapply(function(i, col) polygon(c(newdata$x1[newdata$x2 == 1], rev(newdata$x1[newdata$x2 ==
1])), c(cumprobs[newdata$x2 == 1, i], rep(0, length(newdata$x1[newdata$x2 ==
1]))), lwd = 1, col = col, border = col), 5:1, rev(rainbow(5)))
Note: We draw the polygons in reverse order so that the lower curves are drawn on top of the higher curves.
An increasingly common statistical tool for constructing sampling distributions is the permutation test (or sometimes called a randomization test). Like bootstrapping, a permutation test builds - rather than assumes - sampling distribution (called the “permutation distribution”) by resampling the observed data. Specifically, we can “shuffle” or permute the observed data (e.g., by assigning different outcome values to each observation from among the set of actually observed outcomes). Unlike bootstrapping, we do this without replacement.
Permutation tests are particularly relevant in experimental studies, where we are often interested in the sharp null hypothesis of no difference between treatment groups. In these situations, the permutation test perfectly represents our process of inference because our null hypothesis is that the two treatment groups do not differ on the outcome (i.e., that the outcome is observed independently of treatment assignment). When we permute the outcome values during the test, we therefore see all of the possible alternative treatment assignments we could have had and where the mean-difference in our observed data falls relative to all of the differences we could have seen if the outcome was independent of treatment assignment. While a permutation test requires that we see all possible permutations of the data (which can become quite large), we can easily conduct “approximate permutation tests” by simply conducting a vary large number of resamples. That process should, in expectation, approximate the permutation distribution.
For example, if we have only n=20 units in our study, the number of permutations is:
factorial(20)
## [1] 2.433e+18
That number exceeds what we can reasonably compute. But we can randomly sample from that permutation distribution to obtain the approximate permutation distribution, simply by running a large number of resamples. Let's look at this as an example using some made up data:
set.seed(1)
n <- 100
tr <- rbinom(100, 1, 0.5)
y <- 1 + tr + rnorm(n, 0, 3)
The difference in means is, as we would expect (given we made it up), about 1:
diff(by(y, tr, mean))
## [1] 1.341
To obtain a single permutation of the data, we simply resample without replacement and calculate the difference again:
s <- sample(tr, length(tr), FALSE)
diff(by(y, s, mean))
## [1] -0.2612
Here we use the permuted treatment vector s
instead of tr
to calculate the difference and find a very small difference. If we repeat this process a large number of times, we can build our approximate permutation distribution (i.e., the sampling distribution for the mean-difference).
We'll use replicate
do repeat our permutation process. The result will be a vector of the differences from each permutation (i.e., our distribution):
dist <- replicate(2000, diff(by(y, sample(tr, length(tr), FALSE), mean)))
We can look at our distribution using hist
and draw a vertical line for our observed difference:
hist(dist, xlim = c(-3, 3), col = "black", breaks = 100)
abline(v = diff(by(y, tr, mean)), col = "blue", lwd = 2)
At face value, it seems that our null hypothesis can probably be rejected. Our observed mean-difference appears to be quite extreme in terms of the distribution of possible mean-differences observable were the outcome independent of treatment assignment.
But we can use the distribution to obtain a p-value for our mean-difference by counting how many permuted mean-differences are larger than the one we observed in our actual data. We can then divide this by the number of items in our permutation distribution (i.e., 2000 from our call to replicate
, above):
sum(dist > diff(by(y, tr, mean)))/2000 # one-tailed test
## [1] 0.009
sum(abs(dist) > abs(diff(by(y, tr, mean))))/2000 # two-tailed test
## [1] 0.018
Using either the one-tailed test or the two-tailed test, our difference is unlikely to be due to chance variation observable in a world where the outcome is independent of treatment assignment.
We don't always need to build our own permutation distributions (though it is good to know how to do it). R provides a package to conduct permutation tests called coin. We can compare our p-value (and associated inference) from above with the result from coin:
library(coin)
independence_test(y ~ tr, alternative = "greater") # one-tailed
##
## Asymptotic General Independence Test
##
## data: y by tr
## Z = 2.315, p-value = 0.01029
## alternative hypothesis: greater
independence_test(y ~ tr) # two-tailed
##
## Asymptotic General Independence Test
##
## data: y by tr
## Z = 2.315, p-value = 0.02059
## alternative hypothesis: two.sided
Clearly, our approximate permutation distribution provided the same inference and a nearly identical p-value. coin provides other permutation tests for different kinds of comparisons, as well. Almost anything that you can address in a parametric framework can also be done in a permutation framework (if substantively appropriate). and anything that coin doesn't provide, you can build by hand with the basic permutation logic of resampling.
While we can use tables and statistics to summarize data, it is often use to visually summarize data. This script describes how to produce some common summary plots.
The simplest plot is a histogram, which shows the frequencies of different values in a distribution. Drawing a basic histogram in R is easy. First, let's generate a random vector:
set.seed(1)
a <- rnorm(30)
Then we can draw a histogram of the data:
hist(a)
This isn't the most attractive plot, though, and we can easily make it look different:
hist(a, col = "gray20", border = "lightgray")
Another approach to summarizing the distribution of a variable is a density plot. This visualization is basically a “smoothed” histogram and it's easy to plot, too.
plot(density(a))
Clearly, the two plots give us similar information. We can even overlay them. Doing so requires a few modifications to our code, though.
hist(a, freq = FALSE, col = "gray20", border = "lightgray")
lines(density(a), col = "red", lwd = 2)
One of the simplest data summaries is a barplot. Like a histogram, it shows bars. But those bars are statistics rather than just counts (though they could be counts). We can make a barplot from a vector of numeric values:
b <- c(3, 4.5, 5, 8, 3, 6)
barplot(b)
The result is something visually very similar to the histogram.
We can easily label the bars by specifying a names.arg
parameter:
barplot(b, names.arg = letters[1:6])
We can also turn the plot on its side, if that looks better:
barplot(b, names.arg = letters[1:6], horiz = TRUE)
We can also create a stacked barplot by providing a matrix rather than a vector of input data. Let's say we have counts of two types of objects (e.g., coins) from three groups:
d <- rbind(c(2, 4, 1), c(6, 1, 3))
d
## [,1] [,2] [,3]
## [1,] 2 4 1
## [2,] 6 1 3
barplot(d, names.arg = letters[1:3])
Instead of stacking the bars of each type, we can present them side by side using the beside
parameter:
barplot(d, names.arg = letters[1:3], beside = TRUE)
Rather than waste a lot of ink on bars, we can see the same kinds of relationships in dotcharts.
dotchart(b, labels = letters[1:6])
As we can see, the barplot and the dotchart communicate the same information in more or less the same way:
layout(matrix(1:2, nrow = 1))
barplot(b, names.arg = letters[1:6], horiz = TRUE, las = 2)
dotchart(b, labels = letters[1:6], xlim = c(0, 8))
It is often helpful to describe the distribution of those data with a box plot. The boxplot describes any continuous vector of data by showing the five number summary and any outliers:
boxplot(a)
It can also compare distributions in two or more groups:
e <- rnorm(100, 1, 1)
f <- rnorm(100, 2, 4)
boxplot(e, f)
We can also use a “formula” description of data if one of our variables describes which group our observations fall into:
g1 <- c(e, f)
g2 <- rep(c(1, 2), each = 100)
boxplot(g1 ~ g2)
As we can see, both of these last two plots are identical. They're just different ways of telling boxplot
what to plot.
When we want to describe the relationships among variables, we often want a scatterplot.
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- x1 + x2
x4 <- x1 + x3
We can draw a scatterplot in one of two ways. (1) Naming vectors as sequential arguments:
plot(x1, x2)
(2) Using a “formula” interface:
plot(x2 ~ x1)
We can plot the relationship between x1
and the three other variables:
layout(matrix(1:3, nrow = 1))
plot(x1, x2)
plot(x1, x3)
plot(x1, x4)
We can also use the pairs
function to do this for all relationships between all variables:
pairs(~x1 + x2 + x3 + x4)
This allows us to visualize a lot of information very quickly.
The olsplots.r
script walked through plotting regression diagnostics.
Here we focus on plotting regression results.
Because the other script described plotting slopes to some extent, we'll start there.
Once we have a regression model, it's incredibly easy to plot slopes using abline
:
set.seed(1)
x1 <- rnorm(100)
y1 <- x1 + rnorm(100)
ols1 <- lm(y1 ~ x1)
plot(y1 ~ x1, col = "gray")
Note: plot(y1~x1)
is equivalent to plot(x1,y1)
, with reversed order of terms.
abline(coef(ols1)[1], coef(ols1)["x1"], col = "red")
## Error: plot.new has not been called yet
This is a nice plot, but it doesn't show uncertainty. To add uncertainty about our effect, let's try bootstrapping our standard errors.
To bootstrap, we resample or original data, reestimate the model and redraw our line. We're going to do some functional programming to make this happen.
myboot <- function() {
tmpdata <- data.frame(x1 = x1, y1 = y1)
thisboot <- sample(1:nrow(tmpdata), nrow(tmpdata), TRUE)
coef(lm(y1 ~ x1, data = tmpdata[thisboot, ]))
}
bootcoefs <- replicate(2500, myboot())
The result bootcoefs
is 2500 bootstrapped OLS estimates
We can add these all to our plot using a function called apply
:
plot(y1 ~ x1, col = "gray")
apply(bootcoefs, 2, abline, col = rgb(1, 0, 0, 0.01))
## NULL
The darkest parts of this plot show where we have the most certainty about the our expected values. At the tails of the plot, because of the uncertainty about our slope, the range of plausible predicted values is greater.
We can also get a similar looking plot using mathematically calculated SEs.
The predict
function will help us determine the predicted values from a regression models at different inputs.
To use it, we generate some new data representing the range of observed values of our data:
new1 <- data.frame(x1 = seq(-3, 3, length.out = 100))
We then do the prediction, specifying our model (ols1
), the new data (new1
), that we want SEs, and that we want “response” predictions.
pred1 <- predict(ols1, newdata = new1, se.fit = TRUE, type = "response")
We can then plot our data:
plot(y1 ~ x1, col = "gray")
# Add the predicted line of best (i.e., the regression line:
points(pred1$fit ~ new1$x1, type = "l", col = "blue")
# Note: This is equivalent to `abline(coef(ols1)[1] ~ coef(ols1)[2],
# col='red')` over the range (-3,3). Then we add our confidence intervals:
lines(new1$x1, pred1$fit + (1.96 * pred1$se.fit), lty = 2, col = "blue")
lines(new1$x1, pred1$fit - (1.96 * pred1$se.fit), lty = 2, col = "blue")
Note: The lty
parameter means “line type.” We've requested a dotted line.
We can then compare the two approaches by plotting them together:
plot(y1 ~ x1, col = "gray")
apply(bootcoefs, 2, abline, col = rgb(1, 0, 0, 0.01))
## NULL
points(pred1$fit ~ new1$x1, type = "l", col = "blue")
lines(new1$x1, pred1$fit + (1.96 * pred1$se.fit), lty = 2, col = "blue")
lines(new1$x1, pred1$fit - (1.96 * pred1$se.fit), lty = 2, col = "blue")
As should be clear, both give us essentially the same representation of uncertainty, but in sylistically different ways.
It is also possible to draw a shaded region rather than the blue lines in the above example.
To do this we use the polygon
function, which we have to feed some x and y positions of points:
plot(y1 ~ x1, col = "gray")
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))),
c(pred1$fit - (1.96 * pred1$se.fit), rev(pred1$fit + (1.96 * pred1$se.fit))),
col = rgb(0, 0, 1, 0.5), border = NA)
Alternatively, we might want to show different confidence intervals with this kind of polygon:
plot(y1 ~ x1, col = "gray")
# 67% CI To draw the polygon, we have to specify the x positions of the
# points from our predictions. We do this first left to right (for the
# lower CI limit) and then right to left (for the upper CI limit). Then we
# specify the y positions, which are just the outputs from the `predict`
# function.
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))),
c(pred1$fit - (qnorm(0.835) * pred1$se.fit), rev(pred1$fit + (qnorm(0.835) *
pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
# Note: The `qnorm` function tells us how much to multiple our SEs by to get
# Gaussian CIs. 95% CI
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))),
c(pred1$fit - (qnorm(0.975) * pred1$se.fit), rev(pred1$fit + (qnorm(0.975) *
pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
# 99% CI
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))),
c(pred1$fit - (qnorm(0.995) * pred1$se.fit), rev(pred1$fit + (qnorm(0.995) *
pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
# 99.9% CI
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))),
c(pred1$fit - (qnorm(0.9995) * pred1$se.fit), rev(pred1$fit + (qnorm(0.9995) *
pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
When designing an experiment, we generally want to be able to create an experiment that adequately tests our hypothesis. Accomplishing this requires having sufficient “power” to detect any effects. Power is sometimes also called “sensitivity.” Power refers to the ability of a test (i.e., an analysis of an experiment) to detect a “true effect” that is different from the null hypothesis (e.g., the ability to detect a difference between treatment and control when that difference actually exists).
There are four factors that influence power: sample size, the true effect size, the variance of the effect, and the alpha-threshold (level of significance). The most important factor in power is sample size. Larger samples have more power than small samples, but the gain is power is non-linear. There is a declining marginal return (in terms of power) for each additional unit in the experiment. So designing an experiment trades off power with cost-like considerations. The alpha level (level of significance) also influences power. If we have a more liberal threshold (i.e., a higher alpha level), we have more power to detect the effect. But this higher power is due to the fact that the more liberal treshold also increases our “false positive” rate, where the analysis is more likely to say there is an effect when in fact there is not. So, again, there is a trade-off between detecting a true effect and avoiding false detections.
One of the simplest examples of power involves looking at a common statistical test for analyzing experiments: the t-test. The t-test looks at the difference in means for two groups or the difference between one group and a null hypothesis value (often zero).
t.test()
## Error: argument "x" is missing, with no default
power.t.test()
## Error: exactly one of 'n', 'delta', 'sd', 'power', and 'sig.level' must be
## NULL
Education researcher Howard Bloom has suggested that power is a difficult concept to grasp. He instead suggests we rely on a measure of “minimum detectable effect” (MDE) to discuss experiments. He's probably right. MDE tell us what is the smallest true effect, in standard deviations of the outcome, that is detectable for a given level of power and statistical significance. Because the standard deviation is influenced by sample size, MDE incorporates all of the information of a power calculation but does so in a way that applies to all experiments. That is to say, as long as we can guess at the variance of our outcome, the same sample size considerations apply every time we conduct any experiment.
For a one-tailed test:
sigma <- 1
sigma * qnorm((1 - 0.05)) + sigma * qnorm(0.8)
## [1] 2.486
For a two-tailed test:
sigma * qnorm((1 - (0.5 * 0.05))) + sigma * qnorm(0.8)
## [1] 2.802
We can envision the MDE as the threshold where, for power = .8, 80% of the sampling distribution of the observed effect would be larger than observed effect:
curve(dnorm(x, 0, 1), col = "gray", xlim = c(-3, 8)) # null hypothesis
segments(0, 0, 0, dnorm(0, 0, 1), col = "gray") # mean
curve(dnorm(x, 4, 1), col = "blue", add = TRUE) # alternative hypothesis
segments(4, 0, 4, dnorm(4, 4, 1), col = "blue") # mean
p <- qnorm((1 - 0.05), 0, 1) + qnorm(0.8, 0, 1)
segments(p, 0, p, dnorm(p, 4, 1), lwd = 2)
## Error: plot.new has not been called yet
e <- qnorm((1 - 0.05), 0, 1)
segments(e, 0, e, dnorm(e), lwd = 2)
## Error: plot.new has not been called yet
As in standard power calculations, we still need to calculate the standard deviation of the outcome.
FORTHCOMING
A critical aspect of (parametric) statistical analysis is the use of probability distributions, like the normal (Gaussian) distribution. These distributions underly all of our common (parametric) statistical tests, like t-tests, chi-squared tests, ANOVA, regression, and so forth. R has functions to draw values from all of the common distributions (normal, t, F, chi-squared, binomial, poisson, etc.), as well as many others.
There are four families of functions that R implements uniformly across each of these distributions that enable users to extract the probability density, the cumulative density, and the quantiles of a distribution. For example, the dnorm
function provides the density of the normal distribution at a specific quantile. The pnorm
function provides the cumulative density of the normal distribution at a specific quantile. The qnorm
function provides the quantile of the normal distribution at a specified cumulative density. An additional function, rnorm
, draws random values from the normal distribution (but this is discussed in detail in the random sampling tutorial).
The same functions are also implemented for the other common distributions. For example, the functions for Student's t distribution are dt
, pt
, and qt
. For the chi-squared distribution, they are dchisq
, pchisq
, and qchisq
. Hopefully you see the pattern. The rest of this tutorial walks through how to use these functions.
The density functions provide the density of a specified distribution at a given quantile. This means that the d*
family of functions can extract the density not just from a given distribution but from any version of the distribution. For example, calling:
dnorm(0)
## [1] 0.3989
provides the density of the standard normal distribution (i.e., a normal distribution with mean 0 and standard deviation 1) at the point 0 (i.e., at the distribution's mean). We can retrieve the density at a different value (or vector of values) easily:
dnorm(1)
## [1] 0.242
dnorm(1:3)
## [1] 0.241971 0.053991 0.004432
We can also retrieve densities from a different normal distribution (e.g., one with a higher mean or larger SD):
dnorm(1, mean = -2)
## [1] 0.004432
dnorm(1, mean = 5)
## [1] 0.0001338
dnorm(1, mean = 0, sd = 3)
## [1] 0.1258
We are often much more interested in the cumulative distribution (i.e., how much of the distribution is to the left of the indicated value). For this, we can use the p*
family of functions.
As an example, let's obtain the cumulative distribution function's value from a standard normal distribution at point 0 (i.e., the distribution's means):
pnorm(0)
## [1] 0.5
Unsurprisingly, the value is .5 because half of the distribution is to the left of 0.
When we conduct statistical significance testing, we compare a value we observe to the cumulative distribution function. As one might recall, the value of 1.65 is the (approximate) critical value for a 90% normal confidence interval. We can see that by requesting:
pnorm(1.65)
## [1] 0.9505
The comparable value for a 95% CI is 1.96:
pnorm(1.96)
## [1] 0.975
Note how the values are ~.95 and ~.975, respectively, because those are critical values for two-tailed tests. If we plug a negative value into the pnorm
function, we'll receive the cumulative probability for the left side of the distribution:
pnorm(-1.96)
## [1] 0.025
Thus subtracting the output of pnorm
for the negative input from the output for the positive input, we'll see that 95% of the density is between -1.96 and 1.96 (in the standard normal distribution):
pnorm(1.96) - pnorm(-1.96)
## [1] 0.95
The examples just described relied on the heuristic values of 1.65 and 1.96 as the thresholds for 90% and 95% two-tailed tests. But to find the exact points at which the normal distribution has accumulated a particular cumulative density, we can use the qnorm
function. Essentially, qnorm
is the reverse of pnorm
.
To obtain the critical values for a two-tailed 95% confidence interval, we would plug .025 and .975 into qnorm
:
qnorm(c(0.025, 0.975))
## [1] -1.96 1.96
And we could actually nest that call inside a pnorm
function to see that pnorm
and qnorm
are opposites:
pnorm(qnorm(c(0.025, 0.975)))
## [1] 0.025 0.975
For one-tailed tests, we simply specify the cumulative density. So, for a one-tailed 95% critical value, we would specify:
qnorm(0.95)
## [1] 1.645
We could obtain the other tail by specifying:
qnorm(0.05)
## [1] -1.645
Or, we could request the upper-tail of the distribution rather than the lower (left) tail (which is the default):
qnorm(0.95, lower.tail = FALSE)
## [1] -1.645
As with dnorm
, pnorm
, and qnorm
work on arbitrary normal distributions, but its results will be unfamiliar to us:
pnorm(1.96, mean = 3)
## [1] 0.1492
qnorm(0.95, mean = 3)
## [1] 4.645
As stated above, R supplies functions analogous to those just described for numerous distributions.
Details about all of the distributions can be found in the help files: ? Distributions
.
Here are a few examples:
t distribution
Note: The t distribution functions require a df
argument, specifying the degrees of freedom.
qt(0.95, df = 1000)
## [1] 1.646
qt(c(0.025, 0.975), df = 1000)
## [1] -1.962 1.962
Binomial distribution
The binomial distribution functions work as above, but require size
and prob
arguments, specifying the number of draw and the probability of success. So, if we are modelling fair coin flips:
dbinom(0, 1, 0.5)
## [1] 0.5
pbinom(0, 1, 0.5)
## [1] 0.5
qbinom(0.95, 1, 0.5)
## [1] 1
qbinom(c(0.025, 0.975), 1, 0.5)
## [1] 0 1
Note: Because the binomial is a discrete distribution, the values here might seem strange compared to the above.
R objects can be of several different “classes” A class essentially describes what kind of information is contained in the object
Often an object contains “numeric” class data, like a number or vector of numbers
We can test the class of an object using class
:
class(12)
## [1] "numeric"
class(c(1, 1.5, 2))
## [1] "numeric"
While most numbers are of class “numeric”, a subset are “integer”:
class(1:5)
## [1] "integer"
We can coerce numeric class objects to an integer class:
as.integer(c(1, 1.5, 2))
## [1] 1 1 2
But note that this modifies the second item in the vector (1.5 becomes 1)
Other common classes include “character” data We see character class data in country names or certain survey responses
class("United States")
## [1] "character"
If we try to coerce character to numeric, we get a warning and the result is a missing value:
as.numeric("United States")
## Warning: NAs introduced by coercion
## [1] NA
If we combine a numeric (or integer) and a character together in a vector, the result is character:
class(c(1, "test"))
## [1] "character"
You can see that the 1
is coerced to character:
c(1, "test")
## [1] "1" "test"
We can also coerce a numeric vector to character simply by changing its class:
a <- 1:4
class(a)
## [1] "integer"
class(a) <- "character"
class(a)
## [1] "character"
a
## [1] "1" "2" "3" "4"
Another class is “factor”
Factors are very important to R, especially in regression modelling
Factors combine characteristics of numeric and character classes
We can create a factor from numeric data using factor
:
factor(1:3)
## [1] 1 2 3
## Levels: 1 2 3
We see that factor displays a special levels
attribute
Levels describe the unique values in the vector
e.g., with the following factor, there are six values but only two levels:
factor(c(1, 2, 1, 2, 1, 2))
## [1] 1 2 1 2 1 2
## Levels: 1 2
To see just the levels, we can use the levels
function:
levels(factor(1:3))
## [1] "1" "2" "3"
levels(factor(c(1, 2, 1, 2, 1, 2)))
## [1] "1" "2"
We can also build factors from character data:
factor(c("a", "b", "b", "c"))
## [1] a b b c
## Levels: a b c
We can look at factors in more detail in the factors.R
script
Another common class is “logical” data
This class involves TRUE | FALSE objects
We can look at that class in detail in the logicals.R
script
One of the most confusing aspects of R for users of other statistical software is the idea that one can have any number of objects available in the R environment. One need not be constrained to a single rectangular dataset. This also means that it can be confusing to see what data is actually loaded into memory at any point in time. Here we discuss some tools for understanding the R working environment.
Let's start by clearing our workspace:
# rm(list=ls())
This option should be available in RGui under menu Miscellaneous > Remove all objects. Then create some R objects:
set.seed(1)
x <- rbinom(50, 1, 0.5)
y <- ifelse(x == 1, rnorm(sum(x == 1), 1, 1), rnorm(sum(!x == 1), 2, 1))
mydf <- data.frame(x = x, y = y)
Once we have a number of objects stored in memory, we can look at all of them using ls
:
ls()
## [1] "a" "allout" "amat" "b"
## [5] "between" "bmat" "c" "change"
## [9] "cmat" "coef.mi" "coefs.amelia" "d"
## [13] "d2" "df1" "df2" "e"
## [17] "e1" "e2" "e3" "e4"
## [21] "englebert" "f" "FUN" "g1"
## [25] "g2" "grandm" "grandse" "grandvar"
## [29] "height" "imp" "imp.amelia" "imp.mi"
## [33] "imp.mice" "lm" "lm.amelia.out" "lm.mi.out"
## [37] "lm.mice.out" "lmfit" "lmp" "localfit"
## [41] "localp" "logodds" "logodds_lower" "logodds_se"
## [45] "logodds_upper" "m" "m1" "m2"
## [49] "m2a" "m2b" "m3a" "m3b"
## [53] "me" "me_se" "means" "mmdemo"
## [57] "mydf" "myformula" "n" "newdata"
## [61] "newdata1" "newdata2" "newdf" "newvar"
## [65] "out" "p1" "p2" "p2a"
## [69] "p2b" "p3a" "p3b" "p3b.fitted"
## [73] "part1" "part2" "pool.mice" "ppcurve"
## [77] "s" "s.amelia" "s.mi" "s.mice"
## [81] "s.orig" "s.real" "s2" "search"
## [85] "ses" "ses.amelia" "tmpdf" "tmpsplit"
## [89] "tr" "w" "weight" "within"
## [93] "x" "X" "x1" "x2"
## [97] "X2" "x3" "y" "y1"
## [101] "y1s" "y2" "y2s" "y3"
## [105] "y3s" "z" "z1" "z2"
This shows us all of the objects that are currently saved. If we do another operation but do not save the result:
2 + 2
## [1] 4
ls()
## [1] "a" "allout" "amat" "b"
## [5] "between" "bmat" "c" "change"
## [9] "cmat" "coef.mi" "coefs.amelia" "d"
## [13] "d2" "df1" "df2" "e"
## [17] "e1" "e2" "e3" "e4"
## [21] "englebert" "f" "FUN" "g1"
## [25] "g2" "grandm" "grandse" "grandvar"
## [29] "height" "imp" "imp.amelia" "imp.mi"
## [33] "imp.mice" "lm" "lm.amelia.out" "lm.mi.out"
## [37] "lm.mice.out" "lmfit" "lmp" "localfit"
## [41] "localp" "logodds" "logodds_lower" "logodds_se"
## [45] "logodds_upper" "m" "m1" "m2"
## [49] "m2a" "m2b" "m3a" "m3b"
## [53] "me" "me_se" "means" "mmdemo"
## [57] "mydf" "myformula" "n" "newdata"
## [61] "newdata1" "newdata2" "newdf" "newvar"
## [65] "out" "p1" "p2" "p2a"
## [69] "p2b" "p3a" "p3b" "p3b.fitted"
## [73] "part1" "part2" "pool.mice" "ppcurve"
## [77] "s" "s.amelia" "s.mi" "s.mice"
## [81] "s.orig" "s.real" "s2" "search"
## [85] "ses" "ses.amelia" "tmpdf" "tmpsplit"
## [89] "tr" "w" "weight" "within"
## [93] "x" "X" "x1" "x2"
## [97] "X2" "x3" "y" "y1"
## [101] "y1s" "y2" "y2s" "y3"
## [105] "y3s" "z" "z1" "z2"
This result is not visible with ls
. Esssentially it disappears into the ether.
Now we can look at any of these objects just by calling their name:
x
## [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1
## [36] 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1
y
## [1] 2.3411 0.8706 -0.4708 0.5218 1.6328 2.3587 0.8972 1.3877
## [9] 0.9462 1.9608 2.6897 2.0280 0.9407 2.1888 1.7632 3.4656
## [17] 0.7466 1.6970 2.4755 0.3112 0.2925 1.0659 1.7685 2.3411
## [25] 0.8706 3.4330 3.9804 1.6328 0.8442 2.5697 1.8649 1.4179
## [33] 1.9608 2.6897 1.3877 0.9462 -0.3771 0.1950 0.6057 2.1533
## [41] 2.1000 1.7632 0.8355 0.7466 1.6970 1.5567 2.3411 0.8706
## [49] 1.3646 1.7685
mydf
## x y
## 1 0 2.3411
## 2 0 0.8706
## 3 1 -0.4708
## 4 1 0.5218
## 5 0 1.6328
## 6 1 2.3587
## 7 1 0.8972
## 8 1 1.3877
## 9 1 0.9462
## 10 0 1.9608
## 11 0 2.6897
## 12 0 2.0280
## 13 1 0.9407
## 14 0 2.1888
## 15 1 1.7632
## 16 0 3.4656
## 17 1 0.7466
## 18 1 1.6970
## 19 0 2.4755
## 20 1 0.3112
## 21 1 0.2925
## 22 0 1.0659
## 23 1 1.7685
## 24 0 2.3411
## 25 0 0.8706
## 26 0 3.4330
## 27 0 3.9804
## 28 0 1.6328
## 29 1 0.8442
## 30 0 2.5697
## 31 0 1.8649
## 32 1 1.4179
## 33 0 1.9608
## 34 0 2.6897
## 35 1 1.3877
## 36 1 0.9462
## 37 1 -0.3771
## 38 0 0.1950
## 39 1 0.6057
## 40 0 2.1533
## 41 1 2.1000
## 42 1 1.7632
## 43 1 0.8355
## 44 1 0.7466
## 45 1 1.6970
## 46 1 1.5567
## 47 0 2.3411
## 48 0 0.8706
## 49 1 1.3646
## 50 1 1.7685
The first two objects (x
and y
) are vectors, so they simply print to the console.
The second object (mydf
) is a dataframe, so its contents are printed as columns with row numbers.
If we call one of the columns from the dataframe, it will look just like a vector:
mydf$x
## [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1
## [36] 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1
This looks the same as just calling the x
object and indeed they are the same:
mydf$x == x
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
But if we change one of the objects, it only affects the object we changed:
x <- rbinom(50, 1, 0.5)
mydf$x == x
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [12] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [34] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [45] FALSE TRUE TRUE FALSE FALSE TRUE
So by storing something new into x
we change it, but not mydf$x
because that's a different object.
class
We sometimes what to know what kind of object something is. We can see this with class
:
class(x)
## [1] "integer"
class(y)
## [1] "numeric"
class(mydf)
## [1] "data.frame"
We can also use class
on the columns of a dataframe:
class(mydf$x)
## [1] "integer"
class(mydf$y)
## [1] "numeric"
This is helpful, but it doesn't tell us a lot about the objects (i.e., it's not a very good summary). We can, however, see more detail using some other functions.
One way to get very detailed information about an object is with str
(i.e, structure):
str(x)
## int [1:50] 1 1 0 0 1 0 1 0 0 0 ...
This output tells us that this is an object of class “integer”, with length 50, and it shows the first few values.
str(y)
## num [1:50] 2.341 0.871 -0.471 0.522 1.633 ...
This output tells us that this is an object of class “numeric”, with length 50, and it shows the first few values.
str(mydf)
## 'data.frame': 50 obs. of 2 variables:
## $ x: int 0 0 1 1 0 1 1 1 1 0 ...
## $ y: num 2.341 0.871 -0.471 0.522 1.633 ...
This output tells us that this is an object of class “data.frame”, with 50 observations on two variables. It then provides the same type of details for each variable that we would see by calling str(mydf$x)
, etc. directly.
Using str
on dataframes is therefore a very helpful and compact way to look at your data. More about this later.
To see more details we may want to use some other functions.
One particularly helpful function is summary
, which provides some basic details about an object.
For the two vectors, this will give us summary statistics.
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 0.56 1.00 1.00
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.471 0.871 1.630 1.550 2.140 3.980
For the dataframe, it will give us summary statistics for everything in the dataframe:
summary(mydf)
## x y
## Min. :0.00 Min. :-0.471
## 1st Qu.:0.00 1st Qu.: 0.871
## Median :1.00 Median : 1.633
## Mean :0.54 Mean : 1.549
## 3rd Qu.:1.00 3rd Qu.: 2.140
## Max. :1.00 Max. : 3.980
Note how the printed information is the same but looks different.
This is because R prints slightly different things depending on the class of the input object.
If you want to look “under the hood”, you will see that summary
is actually a set of multiple functions. When you type summary
you see that R is calling a “method” depending on the class of the object. For our examples, the methods called are summary.default
and summary.data.frame
, which differ in what they print to the console for vectors and dataframes, respectively.
Conveniently, we can also save any output of a function as a new object. So here we can save the summary
of x
as a new object:
sx <- summary(x)
And do the same for mydf
:
smydf <- summary(mydf)
We can then see that these new objects also have classes:
class(sx)
## [1] "summaryDefault" "table"
class(mydf)
## [1] "data.frame"
And, as you might be figuring out, an object's class determines how it is printed to the console.
Again, looking “under the hood”, this is because there are separate print
methods for each object class (see print.data.frame
for how a dataframe is printed and print.table
for how the summary of a dataframe is printed).
This can create some confusion, though, because it means that what is printed is a reflection of the underlying object but is not actually the object. A bit existential, right? Because calling objects shows a printed rendition of an object, we can sometimes get confused about what that object actually is.
This is where str
can again be helpful:
str(sx)
## Classes 'summaryDefault', 'table' Named num [1:6] 0 0 1 0.56 1 1
## ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
str(smydf)
## 'table' chr [1:6, 1:2] "Min. :0.00 " "1st Qu.:0.00 " ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:6] "" "" "" "" ...
## ..$ : chr [1:2] " x" " y"
Here we see that the summary of x
and summary of mydf
are both tables. summary(x)
is a one-dimensional table, whereas summary(mydf)
is a two-dimensional table (because it shows multiple variables).
Because these objects are tables, it actually means we can index them like any other table:
sx[1]
## Min.
## 0
sx[2:3]
## 1st Qu. Median
## 0 1
smydf[, 1]
##
## "Min. :0.00 " "1st Qu.:0.00 " "Median :1.00 " "Mean :0.54 "
##
## "3rd Qu.:1.00 " "Max. :1.00 "
smydf[1:3, ]
## x y
## "Min. :0.00 " "Min. :-0.471 "
## "1st Qu.:0.00 " "1st Qu.: 0.871 "
## "Median :1.00 " "Median : 1.633 "
This can be confusing because sx
and smydf
do not look like objects we can index, but that is because the way they are printed doesn't reflect the underlying structure of the objects.
It can be helpful to look at another example to see how what is printed can be confusing. Let's conduct a t-test on our data and see the result:
t.test(mydf$x, mydf$y)
##
## Welch Two Sample t-test
##
## data: mydf$x and mydf$y
## t = -6.745, df = 75.45, p-value = 2.714e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3067 -0.7109
## sample estimates:
## mean of x mean of y
## 0.540 1.549
The result is a bunch of details about the t.test. Like above, we can save this object:
myttest <- t.test(mydf$x, mydf$y)
Then we can call the object again whenever we want without repeating the calculation:
myttest
##
## Welch Two Sample t-test
##
## data: mydf$x and mydf$y
## t = -6.745, df = 75.45, p-value = 2.714e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3067 -0.7109
## sample estimates:
## mean of x mean of y
## 0.540 1.549
If we try to run summary
on this, we get some weirdness:
summary(myttest)
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## conf.int 2 -none- numeric
## estimate 2 -none- numeric
## null.value 1 -none- numeric
## alternative 1 -none- character
## method 1 -none- character
## data.name 1 -none- character
Because there is no method for summarizing a t.test. Why is this?
It is because of the class and structure of our myttest
object. Let's look:
class(myttest)
## [1] "htest"
This says it is of class “htest”. Not intuitive, but that's what it is.
str(myttest)
## List of 9
## $ statistic : Named num -6.75
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 75.4
## ..- attr(*, "names")= chr "df"
## $ p.value : num 2.71e-09
## $ conf.int : atomic [1:2] -1.307 -0.711
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 0.54 1.55
## ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "mydf$x and mydf$y"
## - attr(*, "class")= chr "htest"
This is more interesting. The output tells us that myttest
is a list of 9 objects. If we compare this to the output of myttest
, we will see that when we call myttest
, R is printing the underlying list in a pretty fashion for us.
But because myttest
is a list, it means that we can access any of the values in the list simply by calling them.
So the list consists of statistic
, parameter
, p.value
, etc. Let's look at some of them:
myttest$statistic
## t
## -6.745
myttest$p.value
## [1] 2.714e-09
The ability to extract these values from the underlying object (in addition to see them printed to the console in pretty form), means that we can easily use objects again and again to, e.g., combine results of multiple tests into a simplified table or use values from one test elsewhere in our analysis. As a simple example, let's compare the p-values of the same t.test under different hypotheses (two-sided, which is the default, and each of the one-sided alternatives):
myttest2 <- t.test(mydf$x, mydf$y, "greater")
myttest3 <- t.test(mydf$x, mydf$y, "less")
myttest$p.value
## [1] 2.714e-09
myttest2$p.value
## [1] 1
myttest3$p.value
## [1] 1.357e-09
This is much easier than having to copy and paste the p-value from each of the outputs and because these objects are stored in memory, we can access them at any point later in this session.
Recoding is one of the most important tasks in preparing for an analysis. Often the data we have is not in the format we need to perform an analysis. Changing in data in R is easy, as long as we understand indexing and assignment.
To recode values, we can either rely on positional or logical indexing. To change a particular value, we can rely on positions:
a <- 1:10
a[1] <- 99
a
## [1] 99 2 3 4 5 6 7 8 9 10
But this does not scale well. This is no better than recoding by hand. Logical indexing is then much easier to change multiple values at once:
a[a < 5] <- 99
a
## [1] 99 99 99 99 5 6 7 8 9 10
We can use multiple logical indices to change all of our values. For example, we could turn a vector into groups base on their values:
b <- 1:20
c <- b
c[b < 6] <- 1
c[b >= 6 & b <= 10] <- 2
c[b >= 11 & b <= 15] <- 3
c[b > 15] <- 4
Looking at the two vectors as a matrix, we can see how our input values translated to outputs:
cbind(b, c)
## b c
## [1,] 1 1
## [2,] 2 1
## [3,] 3 1
## [4,] 4 1
## [5,] 5 1
## [6,] 6 2
## [7,] 7 2
## [8,] 8 2
## [9,] 9 2
## [10,] 10 2
## [11,] 11 3
## [12,] 12 3
## [13,] 13 3
## [14,] 14 3
## [15,] 15 3
## [16,] 16 4
## [17,] 17 4
## [18,] 18 4
## [19,] 19 4
## [20,] 20 4
We can obtain the same result with nested ifelse
functions:
d <- ifelse(b < 6, 1, ifelse(b >= 6 & b <= 10, 2, ifelse(b >= 11 & b <= 15,
3, ifelse(b > 15, 4, NA))))
cbind(b, c, d)
## b c d
## [1,] 1 1 1
## [2,] 2 1 1
## [3,] 3 1 1
## [4,] 4 1 1
## [5,] 5 1 1
## [6,] 6 2 2
## [7,] 7 2 2
## [8,] 8 2 2
## [9,] 9 2 2
## [10,] 10 2 2
## [11,] 11 3 3
## [12,] 12 3 3
## [13,] 13 3 3
## [14,] 14 3 3
## [15,] 15 3 3
## [16,] 16 4 4
## [17,] 17 4 4
## [18,] 18 4 4
## [19,] 19 4 4
## [20,] 20 4 4
Another way that is sometimes more convenient for writing this involves the package car
, which we would need to load:
library(car)
In this library we can use function recode
to recode a vector:
e <- recode(b, "1:5=1; 6:10=2; 11:15=3; 16:20=4; else=NA")
The recode
function can also infer the minimum ('lo') and maximum ('hi') values in a vector:
f <- recode(b, "lo:5=1; 6:10=2; 11:15=3; 16:hi=4; else=NA")
All of these techniques produce the same result:
cbind(b, c, d, e, f)
## b c d e f
## [1,] 1 1 1 1 1
## [2,] 2 1 1 1 1
## [3,] 3 1 1 1 1
## [4,] 4 1 1 1 1
## [5,] 5 1 1 1 1
## [6,] 6 2 2 2 2
## [7,] 7 2 2 2 2
## [8,] 8 2 2 2 2
## [9,] 9 2 2 2 2
## [10,] 10 2 2 2 2
## [11,] 11 3 3 3 3
## [12,] 12 3 3 3 3
## [13,] 13 3 3 3 3
## [14,] 14 3 3 3 3
## [15,] 15 3 3 3 3
## [16,] 16 4 4 4 4
## [17,] 17 4 4 4 4
## [18,] 18 4 4 4 4
## [19,] 19 4 4 4 4
## [20,] 20 4 4 4 4
Instead of checking this visually, we can use an all.equal
function to compare two vectors:
all.equal(c, d)
## [1] TRUE
all.equal(c, e)
## [1] TRUE
all.equal(c, f)
## [1] TRUE
Note: if we instead used the ==
double equals comparator, the result would be a logical vector that compares corresponding values in each vector:
c == d
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE
If all.equal
turns out false, the ==
double equals comparator shows where the vectors differ.
Missing values are handled somewhat differently from other values.
If our vector has missing values, we need to use the is.na
logical function to identify them:
g <- c(1:5, NA, 7:13, NA, 15)
h <- g
g
## [1] 1 2 3 4 5 NA 7 8 9 10 11 12 13 NA 15
g[is.na(g)] <- 99
The recode
function can also handle missing values to produce the same result:
h <- recode(h, "NA=99")
all.equal(g, h)
## [1] TRUE
Often we want to recode based on two variables (e.g., age and sex) to produce categories.
This is easy using the right logical statements.
Let's create some fake data (in the form of a dataframe) using a function called expand.grid
:
i <- expand.grid(1:4, 1:2)
i
## Var1 Var2
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 1 2
## 6 2 2
## 7 3 2
## 8 4 2
This dataframe has two variables (columns), one with four categories ('Var1') and one with two ('Var2').
Perhaps we want to create a variable that reflects each unique combination of the two variables.
We can do this with ifelse
:
ifelse(i$Var2 == 1, i$Var1, i$Var1 + 4)
## [1] 1 2 3 4 5 6 7 8
This statement says that if an element from i$Var2
is equal to 1,
then the value of the corresponding element in our new variable is equal to the value in i$Var1
.
otherwise the value of the corresponding element in the new vecotr is set to i$Var1+4
.
That solution requires us to know something about the data (that it's okay to simply add 4 to get unique values).
A more general solution is to use the interaction
function:
interaction(i$Var1, i$Var2)
## [1] 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2
## Levels: 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2
This produces a factor vector with eight unique values. The names are a little strange, but it will always give us every unique combination of the two (or more) vectors.
There are lots of different ways to recode vectors, but these are the basic tools that can be combined to do almost anything.
R's graphical capabilities are very strong. This is particularly helpful when deal with regression.
If we have a regression model, there are a number of ways we can plot the relationships between variables. We can also use plots for checking model specification and assumptions.
Let's start with a basic bivariate regression:
set.seed(1)
x <- rnorm(1000)
y <- 1 + x + rnorm(1000)
ols1 <- lm(y ~ x)
The easiest plot we can draw is the relationship between y and x.
plot(y ~ x)
We can also add a line representing this relationship to the plot.
To get the coefficients from the model, we can use coef
:
coef(ols1)
## (Intercept) x
## 0.9838 1.0064
We can then use those coefficients in the line-plotting function abline
:
plot(y ~ x)
abline(a = coef(ols1)[1], b = coef(ols1)[2])
We can specify “graphical parameters” in both plot
and abline
to change the look.
For example, we could change the color:
plot(y ~ x, col = "gray")
abline(a = coef(ols1)[1], b = coef(ols1)[2], col = "red")
We can also use plot
to extract several diagnostics for our model.
Almost all of these help us to identify outliers or other irregularities.
If we type:
plot(ols1)
We are given a series of plots describing the model. We can also see two other plots that are not displayed by default.
To obtain a given plot, we use the which
parameter inside plot
:
plot(ols1, which = 4)
(1) A residual plot (2) A Quantile-Quantile plot to check the distribution of our residuals (3) A scale-location plot (4) Cook's distance, to identify potential outliers (5) A residual versus leverage plot, to identify potential outliers (6) Cook's distance versus leverage plot
Besides the default plot(ols1, which=1)
to get residuals, we can also plot residuals manually:
plot(ols1$residuals ~ x)
We might want to do this to check whether another variable should be in our model:
x2 <- rnorm(1000)
plot(ols1$residuals ~ x2)
Obviously, in this case x2
doesn't belong in the model.
Let's see a case where the plot would help us:
y2 <- x + x2 + rnorm(1000)
ols2 <- lm(y2 ~ x)
plot(ols2$residuals ~ x2)
Clearly, x2
is strongly related to our residuals, so it belongs in the model.
We can also use residuals plots to check for nonlinear relationships (i.e., functional form):
y3 <- x + (x^2) + rnorm(1000)
ols3 <- lm(y3 ~ x)
plot(ols3$residuals ~ x)
Even though x
is in our model, it is not in the correct form.
Let's try fixing that and see what happens to our plot:
ols3b <- lm(y3 ~ x + I(x^2))
Note: We need to use the I()
operator inside formulae in order to have R generate the x^2
variable!
Note (continued): This saves us from having to defined a new variable: xsq <- x^2
and then running the model.
plot(ols3b$residuals ~ x)
Clearly, the model now incorporates x
in the correct functional form.
Of course, if we had plotted our data originally:
plot(y3 ~ x)
We would have seen the non-linear relationship and could have skipped the incorrect model entirely.
Residual plots can also show heteroskedasticity
x3 <- runif(1000, 1, 10)
y4 <- (3 * x3) + rnorm(1000, 0, x3)
ols4 <- lm(y4 ~ x3)
plot(ols4$residuals ~ x3)
Here we see that x3
is correctly specified in the model. There is no relationship between x3
and y4
.
But, the variance of the residuals is much higher at higher levels of x3
.
We might need to rely on a different estimate of our regression SEs than the default provided by R.
And, again, this is a problem we could have identified by plotting our original data:
plot(y4 ~ x3)
If our model has more than one independent variable, these plotting tools all still work.
set.seed(1)
x5 <- rnorm(1000)
z5 <- runif(1000, 1, 5)
y5 <- x5 + z5 + rnorm(1000)
ols5 <- lm(y5 ~ x5 + z5)
We can see all six of our diagnostic plots:
plot(ols5, 1:6)
We can plot our outcome against the input variables:
plot(y5 ~ x5)
plot(y5 ~ z5)
We can see residual plots:
plot(ols5$residuals ~ x5)
plot(ols5$residuals ~ z5)
We might also want to check for colinearity between our input variables.
We could do this with cor
:
cor(x5, z5)
## [1] 0.03504
Or we could see it visually with a scatterplot:
plot(x5, z5)
In either case, there's no relationship.
We can also plot our effects from our model against our input data:
coef(ols5)
## (Intercept) x5 z5
## -0.0639 1.0183 1.0182
Lets plot the two input variables together using layout
:
layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5)[1] + mean(z5), b = coef(ols5)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
abline(a = coef(ols5)[1] + mean(x5), b = coef(ols5)["z5"], col = "red")
Note: We add the expected value of the other input variable so that lines are drawn correctly. If we plot each bivariate relationship separately, we'll see how we get the lines of best fit:
ols5a <- lm(y5 ~ x5)
ols5b <- lm(y5 ~ z5)
layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5a)[1], b = coef(ols5a)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
abline(a = coef(ols5b)[1], b = coef(ols5b)["z5"], col = "red")
If we regress the residuals from ols5a
on z5
we'll see some magic happen.
The estimated coefficient for z5
is almost identical to that from our full y5 ~ x5 + z5
model:
tmpz <- lm(ols5a$residuals ~ z5)
coef(tmpz)["z5"]
## z5
## 1.017
coef(ols5)["z5"]
## z5
## 1.018
The same pattern works if we repeat this process for our x5
input variable:
tmpx <- lm(ols5b$residuals ~ x5)
coef(tmpx)["x5"]
## x5
## 1.017
coef(ols5)["x5"]
## x5
## 1.018
In other words, the coefficients from our full model ols5
reflect the regression of each input variable…
on the residuals of y
unexplained by the contribution of the other input variable(s).
Let's see this visually by drawing the bivariate regression lines in blue.
And then overlapping these with the full model estimates in red:
layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5a)[1], b = coef(ols5a)["x5"], col = "blue")
abline(a = coef(ols5)[1] + mean(z5), b = coef(ols5)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
coef(lm(ols5a$residuals ~ z5))["z5"]
## z5
## 1.017
abline(a = coef(ols5b)[1], b = coef(ols5b)["z5"], col = "blue")
abline(a = coef(ols5)[1] + mean(x5), b = coef(ols5)["z5"], col = "red")
In that example, x5
and z5
were uncorrelated, so there was no bias from excluding one variable.
Let's look at a situation where we find omitted variable bias due to correlation between input variables.
set.seed(1)
x6 <- rnorm(1000)
z6 <- x6 + rnorm(1000, 0, 1.5)
y6 <- x6 + z6 + rnorm(1000)
We can see from a plot and correlation and that our two input variables are correlated:
cor(x6, z6)
## [1] 0.5565
plot(x6, z6)
Let's estimate some models:
ols6 <- lm(y6 ~ x6 + z6)
ols6a <- lm(y6 ~ x6)
ols6b <- lm(y6 ~ z6)
And then let's compare the bivariate estimates (blue) to the multivariate estimates (red):
layout(matrix(1:2, ncol = 2))
plot(y6 ~ x6, col = "gray")
abline(a = coef(ols6a)[1], b = coef(ols6a)["x6"], col = "blue")
abline(a = coef(ols6)[1] + mean(z6), b = coef(ols6)["x6"], col = "red")
plot(y6 ~ z6, col = "gray")
coef(lm(ols6a$residuals ~ z6))["z6"]
## z6
## 0.7004
abline(a = coef(ols6b)[1], b = coef(ols6b)["z6"], col = "blue")
abline(a = coef(ols6)[1] + mean(x6), b = coef(ols6)["z6"], col = "red")
As we can see, the estimates from our bivariate models overestimate the impact of each input. We could of course see this in the raw coefficients, as well:
coef(ols6)
## (Intercept) x6 z6
## 0.01624 1.03437 1.01468
coef(ols6a)
## (Intercept) x6
## -0.008398 2.058839
coef(ols6b)
## (Intercept) z6
## 0.01563 1.33198
These plots show, however, that omitted variable bias can be dangerous even when it seems our estimates are correct. The blue lines seem to fit the data, but those simple plots (and regressions) fail to account for correlations between inputs.
And the problem is that you can't predict omitted variable bias a priori. Let's repeat that last analysis but simply change the data generating process slightly:
set.seed(1)
x6 <- rnorm(1000)
z6 <- x6 + rnorm(1000, 0, 1.5)
y6 <- x6 - z6 + rnorm(1000) #' this is the only differences from the previous example
cor(x6, z6)
## [1] 0.5565
ols6 <- lm(y6 ~ x6 + z6)
ols6a <- lm(y6 ~ x6)
ols6b <- lm(y6 ~ z6)
layout(matrix(1:2, ncol = 2))
plot(y6 ~ x6, col = "gray")
abline(a = coef(ols6a)[1], b = coef(ols6a)["x6"], col = "blue")
abline(a = coef(ols6)[1] + mean(z6), b = coef(ols6)["x6"], col = "red")
plot(y6 ~ z6, col = "gray")
coef(lm(ols6a$residuals ~ z6))["z6"]
## z6
## -0.6802
abline(a = coef(ols6b)[1], b = coef(ols6b)["z6"], col = "blue")
abline(a = coef(ols6)[1] + mean(x6), b = coef(ols6)["z6"], col = "red")
The blue lines seem to fit the data, but they're biased estimates.
A contemporary way of presenting regression results involves converting a regression table into a figure.
set.seed(500)
x1 <- rnorm(100, 5, 5)
x2 <- rnorm(100, -2, 10)
x3 <- rnorm(100, 0, 20)
y <- (1 * x1) + (-2 * x2) + (3 * x3) + rnorm(100, 0, 20)
ols2 <- lm(y ~ x1 + x2 + x3)
Conventionally, we would present results from this regression as a table:
summary(ols2)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.89 -12.52 2.67 11.24 46.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0648 2.6053 -0.02 0.980
## x1 1.2211 0.3607 3.39 0.001 **
## x2 -2.0941 0.1831 -11.44 <2e-16 ***
## x3 3.0086 0.1006 29.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.1 on 96 degrees of freedom
## Multiple R-squared: 0.913, Adjusted R-squared: 0.91
## F-statistic: 335 on 3 and 96 DF, p-value: <2e-16
Or just:
coef(summary(ols2))[, 1:2]
## Estimate Std. Error
## (Intercept) -0.06483 2.6053
## x1 1.22113 0.3607
## x2 -2.09407 0.1831
## x3 3.00856 0.1006
It might be helpful to see the size and significance of these effects as a figure. To do so, we have to draw the regression slopes as points and the SEs as lines.
slopes <- coef(summary(ols2))[c("x1", "x2", "x3"), 1] #' slopes
ses <- coef(summary(ols2))[c("x1", "x2", "x3"), 2] #' SEs
We'll draw the slopes of the three input variables. Note: The interpretation of the following plot depends on input variables that have comparable scales. Note (continued): Comparing dissimilar variables with this visualization can be misleading!
Let's construct a plot that draws 1 and 2 SEs for each coefficient:
We'll start with a blank plot (like a blank canvas):
plot(NA, xlim = c(-3, 3), ylim = c(0, 4), xlab = "Slope", ylab = "", yaxt = "n")
# We can add a title:
title("Regression Results")
# We'll add a y-axis labelling our variables:
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
# We'll add a vertical line for zero:
abline(v = 0, col = "gray")
# Then we'll draw our slopes as points (`pch` tells us what type of point):
points(slopes, 1:3, pch = 23, col = "black", bg = "black")
# Then we'll add thick line segments for each 1 SE:
segments((slopes - ses)[1], 1, (slopes + ses)[1], 1, col = "black", lwd = 2)
segments((slopes - ses)[2], 2, (slopes + ses)[2], 2, col = "black", lwd = 2)
segments((slopes - ses)[3], 3, (slopes + ses)[3], 3, col = "black", lwd = 2)
# Then we'll add thin line segments for the 2 SEs:
segments((slopes - (2 * ses))[1], 1, (slopes + (2 * ses))[1], 1, col = "black",
lwd = 1)
segments((slopes - (2 * ses))[2], 2, (slopes + (2 * ses))[2], 2, col = "black",
lwd = 1)
segments((slopes - (2 * ses))[3], 3, (slopes + (2 * ses))[3], 3, col = "black",
lwd = 1)
We can draw a similar plot with confidence intervals instead of SEs.
plot(NA, xlim = c(-3, 3), ylim = c(0, 4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
points(slopes, 1:3, pch = 23, col = "black", bg = "black")
# Then we'll add thick line segments for each 67% CI: Note: The `qnorm`
# function tells us how much to multiple our SEs by to get Gaussian CIs.
# Note: We'll also use vectorization here to save having to retype the
# `segments` command for each line:
segments((slopes - (qnorm(0.835) * ses)), 1:3, (slopes + (qnorm(0.835) * ses)),
1:3, col = "black", lwd = 3)
# Then we'll add medium line segments for the 95%:
segments((slopes - (qnorm(0.975) * ses)), 1:3, (slopes + (qnorm(0.975) * ses)),
1:3, col = "black", lwd = 2)
# Then we'll add thin line segments for the 99%:
segments((slopes - (qnorm(0.995) * ses)), 1:3, (slopes + (qnorm(0.995) * ses)),
1:3, col = "black", lwd = 1)
Both of these plots are similar, but show how the size, relative size, and significance of regression slopes can easily be summarized visually.
Note: We can also extract confidence intervals for our model terms directly using the confint
function applied to our modle object and then plot those CIs using segments
:
ci67 <- confint(ols2, c("x1", "x2", "x3"), level = 0.67)
ci95 <- confint(ols2, c("x1", "x2", "x3"), level = 0.95)
ci99 <- confint(ols2, c("x1", "x2", "x3"), level = 0.99)
Now draw the plot:
plot(NA, xlim = c(-3, 3), ylim = c(0, 4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
points(slopes, 1:3, pch = 23, col = "black", bg = "black")
# add the confidence intervals:
segments(ci67[, 1], 1:3, ci67[, 2], 1:3, col = "black", lwd = 3)
segments(ci95[, 1], 1:3, ci95[, 2], 1:3, col = "black", lwd = 2)
segments(ci99[, 1], 1:3, ci99[, 2], 1:3, col = "black", lwd = 1)
One of the major problems (noted above) with these kinds of plots is that in order for them to make visual sense, the underlying covariates have to be inherently comparable. By showing slopes, the plot shows the effect of a unit change in each covariate on the outcome, but unit changes may not be comparable across variables. We could probably come up with an infinite number of ways of presenting the results, but let's focus on two here: plotting standard deviation changes in covariates and plotting minimum to maximum changes in scale of covariates.
Let's recall the values of our coefficients on x1
, x2
, and x3
:
coef(summary(ols2))[, 1:2]
## Estimate Std. Error
## (Intercept) -0.06483 2.6053
## x1 1.22113 0.3607
## x2 -2.09407 0.1831
## x3 3.00856 0.1006
On face value, x3
has the largest effect, but what happens when we account for different standard deviations of the covariates:
sd(x1)
## [1] 5.311
sd(x2)
## [1] 10.48
sd(x3)
## [1] 19.07
x1
clearly also has the largest variance, so it may make more sense to compare a standard deviation change across the variables.
To do that is relatively simple because we're working in a linear model, so we simply need to calculate the standard deviation of each covariate and multiply that by the respective coefficient:
c1 <- coef(summary(ols2))[-1, 1:2] # drop the intercept
c2 <- numeric(length = 3)
c2[1] <- c1[1, 1] * sd(x1)
c2[2] <- c1[2, 1] * sd(x2)
c2[3] <- c1[3, 1] * sd(x3)
Then we'll get standard errors for those changes:
s2 <- numeric(length = 3)
s2[1] <- c1[1, 2] * sd(x1)
s2[2] <- c1[2, 2] * sd(x2)
s2[3] <- c1[3, 2] * sd(x3)
Then we can plot the results:
plot(c2, 1:3, pch = 23, col = "black", bg = "black", xlim = c(-25, 65), ylim = c(0,
4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
# Then we'll add medium line segments for the 95%:
segments((c2 - (qnorm(0.975) * s2)), 1:3, (c2 + (qnorm(0.975) * s2)), 1:3, col = "black",
lwd = 2)
# Then we'll add thin line segments for the 99%:
segments((c2 - (qnorm(0.995) * s2)), 1:3, (c2 + (qnorm(0.995) * s2)), 1:3, col = "black",
lwd = 1)
By looking at standard deviation changes (focus on the scale of the x-axis), we can see that x3
actually has the largest effect by a much larger factor than we saw in the raw slopes. Moving the same relative amount up each covariate's distribution produces substantially different effects on the outcome.
Another way to visualize effect sizes is to examine the effect of full scale changes in covariates. This is especially useful when deal within covariates that differ dramatically in scale (e.g., a mix of discrete and continuous variables).
The basic calculations for these kinds of plots are the same as in the previous plot, but instead of using sd
, we use diff(range())
, which tells us what a full scale change is in the units of each covariate:
c3 <- numeric(length = 3)
c3[1] <- c1[1, 1] * diff(range(x1))
c3[2] <- c1[2, 1] * diff(range(x2))
c3[3] <- c1[3, 1] * diff(range(x3))
Then we'll get standard errors for those changes:
s3 <- numeric(length = 3)
s3[1] <- c1[1, 2] * diff(range(x1))
s3[2] <- c1[2, 2] * diff(range(x2))
s3[3] <- c1[3, 2] * diff(range(x3))
Then we can plot the results:
plot(c3, 1:3, pch = 23, col = "black", bg = "black", xlim = c(-150, 300), ylim = c(0,
4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
# Then we'll add medium line segments for the 95%:
segments((c3 - (qnorm(0.975) * s3)), 1:3, (c3 + (qnorm(0.975) * s3)), 1:3, col = "black",
lwd = 2)
# Then we'll add thin line segments for the 99%:
segments((c3 - (qnorm(0.995) * s3)), 1:3, (c3 + (qnorm(0.995) * s3)), 1:3, col = "black",
lwd = 1)
Focusing on the x-axes of the last three plots, we see how differences in scaling of the covariates can lead to vastly different visual interpretations of effect sizes. Plotting the slopes directly suggested that x3
had an effect about three times larger than the effect of x1
. Plotting standard deviation changes suggested that x3
had an effect about 10 times larger than the effect of x1
and plotting full scale changes in covariates showed a similar substantive conclusion. While each showed that x3
had the largest effect, interpreting the relative contribution of the different variables depends upon how much variance we would typically see in each variable in our data. The unit-change effect (represented by the slope) may not be the effect size that we ultimately care about for each covariate.
# as.character() toupper() tolower()
# match() pmatch()
# ? regex [Regular Expressions
# (Wikipedia)](http://en.wikipedia.org/wiki/Regular_expression)
# grep() grepl()
# regexpr() gregexpr() regexec()
# agrep()
# regmatches()
# sub() gsub()
We frequently need to save our data after we have worked on it for some time (e.g., because we've created scaled or deleted variables, created a subset of our original data, modified the data in a time- or processor-intensive way, or simply need to share a subset of the data). In most statistical packages, this is done automatically: those packages open a file and “destructively” make changes to the original file. This can be convenient, but it is also problematic. If I change a file and don't save the original, my work is no longer reproducible from the original file. It essentially builds a step into the scientific workflow that is not explicitly recorded.
R does things differently. When opening a data file in R, the data are read into memory and the link between those data in memory and the original file is severed. Changes made to the data are kept only in R and they are lost if R is closed without the data being saved. This is usually fine because good workflow involves writing scripts that work from the original data, make any necessary changes, and then produce output. But, for the reasons stated above, we might want to save our working data for use later on. R provides at least four ways to do this.
Note: All of the methods overwrite the system file by default. This means that writing a file over an existing file is “destructive,” so it's a good idea to make sure that you're not overwriting a file by checking to make sure your filename isn't already in use using list.files()
. By default, the file is written to your working directory (getwd()
) but can be written elsewhere if you supply a file path rather than name.
All of these methods work with an R dataframe, so we'll create a simple one just for the sake of demonstration:
set.seed(1)
mydf <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))
“” save ##
The most flexible way to save data objects from R uses the save
function. By default, save
writes an R object (or multiple R objects) to an R-readable binary file that can be opened using load
. Because save
can store multiple objects (including one's entire current workspace), it provides a very flexible way to “pick up where you left off.” For example, using save.image('myworkspace.RData')
, you could save everything about your current R workspace, and then load('myworkspace.RData')
later and be exactly where you were before.
But it is also a convenient way to write data to a file that you plan to use again in R. Because it saves R objects “as-is,” there's no need to worry about problems reading in the data or needing to change structure or variable names because the file is saved (and will load) exactly as it looks in R. The dataframe will even have the same name (i.e., in our example, the loaded object will be caleld mydf
). The .RData file format is also very space-saving, thus taking up less room than a comparable comma-separated variable file containing the same data.
To write our dataframe using save
, we simply supply the name of the dataframe and the destination file:
save(mydf, file = "saveddf.RData")
Note that the file name is not important (so long as it does not overwrite another file in your working directory). If you load the file using load
, the R object mydf
will appear in your workspace.
Let's remove the file just to not leave a mess:
unlink("saveddf.RData")
Sometimes we want to be able to write our data in a way that makes it exactly reproducible (like save
), but we also want to be able to read the file. Because save
creates a binary file, we can only open the file in R (or another piece of software that reads .RData files). If we want, for example, to be able to look at or change the file in a text editor, we need it in another format. One R-specific solution for this is dput
.
The dput
function saves data as an R expression. This means that the resulting file can actually be copied and pasted into the R console. This is especially helpful if you want to share (part of) your data with someone else. Indeed, it is rquired that when you ask data-related questions on StackOverflow, that you supply your data using dput
to make it easy for people to help you.
We can also simply write the output of dput
to the console to see what it looks like. Let's try that before writing it to a file:
dput(mydf)
## structure(list(x = c(-0.626453810742332, 0.183643324222082, -0.835628612410047,
## 1.59528080213779, 0.329507771815361, -0.820468384118015, 0.487429052428485,
## 0.738324705129217, 0.575781351653492, -0.305388387156356, 1.51178116845085,
## 0.389843236411431, -0.621240580541804, -2.2146998871775, 1.12493091814311,
## -0.0449336090152309, -0.0161902630989461, 0.943836210685299,
## 0.821221195098089, 0.593901321217509, 0.918977371608218, 0.782136300731067,
## 0.0745649833651906, -1.98935169586337, 0.61982574789471, -0.0561287395290008,
## -0.155795506705329, -1.47075238389927, -0.47815005510862, 0.417941560199702,
## 1.35867955152904, -0.102787727342996, 0.387671611559369, -0.0538050405829051,
## -1.37705955682861, -0.41499456329968, -0.394289953710349, -0.0593133967111857,
## 1.10002537198388, 0.763175748457544, -0.164523596253587, -0.253361680136508,
## 0.696963375404737, 0.556663198673657, -0.68875569454952, -0.70749515696212,
## 0.36458196213683, 0.768532924515416, -0.112346212150228, 0.881107726454215,
## 0.398105880367068, -0.612026393250771, 0.341119691424425, -1.12936309608079,
## 1.43302370170104, 1.98039989850586, -0.367221476466509, -1.04413462631653,
## 0.569719627442413, -0.135054603880824, 2.40161776050478, -0.0392400027331692,
## 0.689739362450777, 0.0280021587806661, -0.743273208882405, 0.188792299514343,
## -1.80495862889104, 1.46555486156289, 0.153253338211898, 2.17261167036215,
## 0.475509528899663, -0.709946430921815, 0.610726353489055, -0.934097631644252,
## -1.2536334002391, 0.291446235517463, -0.443291873218433, 0.00110535163162413,
## 0.0743413241516641, -0.589520946188072, -0.568668732818502, -0.135178615123832,
## 1.1780869965732, -1.52356680042976, 0.593946187628422, 0.332950371213518,
## 1.06309983727636, -0.304183923634301, 0.370018809916288, 0.267098790772231,
## -0.54252003099165, 1.20786780598317, 1.16040261569495, 0.700213649514998,
## 1.58683345454085, 0.558486425565304, -1.27659220845804, -0.573265414236886,
## -1.22461261489836, -0.473400636439312), y = c(-0.620366677224124,
## 0.0421158731442352, -0.910921648552446, 0.158028772404075, -0.654584643918818,
## 1.76728726937265, 0.716707476017206, 0.910174229495227, 0.384185357826345,
## 1.68217608051942, -0.635736453948977, -0.461644730360566, 1.43228223854166,
## -0.650696353310367, -0.207380743601965, -0.392807929441984, -0.319992868548507,
## -0.279113302976559, 0.494188331267827, -0.177330482269606, -0.505957462114257,
## 1.34303882517041, -0.214579408546869, -0.179556530043387, -0.100190741213562,
## 0.712666307051405, -0.0735644041263263, -0.0376341714670479,
## -0.681660478755657, -0.324270272246319, 0.0601604404345152, -0.588894486259664,
## 0.531496192632572, -1.51839408178679, 0.306557860789766, -1.53644982353759,
## -0.300976126836611, -0.528279904445006, -0.652094780680999, -0.0568967778473925,
## -1.91435942568001, 1.17658331201856, -1.664972436212, -0.463530401472386,
## -1.11592010504285, -0.750819001193448, 2.08716654562835, 0.0173956196932517,
## -1.28630053043433, -1.64060553441858, 0.450187101272656, -0.018559832714638,
## -0.318068374543844, -0.929362147453702, -1.48746031014148, -1.07519229661568,
## 1.00002880371391, -0.621266694796823, -1.38442684738449, 1.86929062242358,
## 0.425100377372448, -0.238647100913033, 1.05848304870902, 0.886422651374936,
## -0.619243048231147, 2.20610246454047, -0.255027030141015, -1.42449465021281,
## -0.144399601954219, 0.207538339232345, 2.30797839905936, 0.105802367893711,
## 0.456998805423414, -0.077152935356531, -0.334000842366544, -0.0347260283112762,
## 0.787639605630162, 2.07524500865228, 1.02739243876377, 1.2079083983867,
## -1.23132342155804, 0.983895570053379, 0.219924803660651, -1.46725002909224,
## 0.521022742648139, -0.158754604716016, 1.4645873119698, -0.766081999604665,
## -0.430211753928547, -0.926109497377437, -0.17710396143654, 0.402011779486338,
## -0.731748173119606, 0.830373167981674, -1.20808278630446, -1.04798441280774,
## 1.44115770684428, -1.01584746530465, 0.411974712317515, -0.38107605110892
## ), z = c(0.409401839650934, 1.68887328620405, 1.58658843344197,
## -0.330907800682766, -2.28523553529247, 2.49766158983416, 0.667066166765493,
## 0.5413273359637, -0.0133995231459087, 0.510108422952926, -0.164375831769667,
## 0.420694643254513, -0.400246743977644, -1.37020787754746, 0.987838267454879,
## 1.51974502549955, -0.308740569225614, -1.25328975560769, 0.642241305677824,
## -0.0447091368939791, -1.73321840682484, 0.00213185968026965,
## -0.630300333928146, -0.340968579860405, -1.15657236263585, 1.80314190791747,
## -0.331132036391221, -1.60551341225308, 0.197193438739481, 0.263175646405474,
## -0.985826700409291, -2.88892067167955, -0.640481702565115, 0.570507635920485,
## -0.05972327604261, -0.0981787440052344, 0.560820728620116, -1.18645863857947,
## 1.09677704427424, -0.00534402827816569, 0.707310667398079, 1.03410773473746,
## 0.223480414915304, -0.878707612866019, 1.16296455596733, -2.00016494478548,
## -0.544790740001725, -0.255670709156989, -0.166121036765006, 1.02046390878411,
## 0.136221893102778, 0.407167603423836, -0.0696548130129049, -0.247664341619331,
## 0.69555080661964, 1.1462283572158, -2.40309621489187, 0.572739555245841,
## 0.374724406778655, -0.425267721556076, 0.951012807576816, -0.389237181718379,
## -0.284330661799574, 0.857409778079803, 1.7196272991206, 0.270054900937229,
## -0.42218400978764, -1.18911329485959, -0.33103297887901, -0.939829326510021,
## -0.258932583118785, 0.394379168221572, -0.851857092023863, 2.64916688109488,
## 0.156011675665079, 1.13020726745494, -2.28912397984011, 0.741001157195439,
## -1.31624516045156, 0.919803677609141, 0.398130155451956, -0.407528579269772,
## 1.32425863017727, -0.70123166924692, -0.580614304240536, -1.00107218102542,
## -0.668178606753393, 0.945184953373082, 0.433702149545162, 1.00515921767704,
## -0.390118664053679, 0.376370291774648, 0.244164924486494, -1.42625734238254,
## 1.77842928747545, 0.134447660933676, 0.765598999157864, 0.955136676908982,
## -0.0505657014422701, -0.305815419766971)), .Names = c("x", "y",
## "z"), row.names = c(NA, -100L), class = "data.frame")
As you can see, output is a complicated R expression (using the structure
function), which includes all of the data values, the variable names, row names, and the class of the object. If you were to copy and paste this output into a new R session, you would have the exact same dataframe as the one we created here.
We can write this to a file (with any extension) by specifying a file
argument:
dput(mydf, "saveddf.txt")
I would tend to use the .txt (text file) extension, so that it will be easily openable in any text editor, but you can use any extension.
Note: Unlike save
and load
, which store an R object and then restore it using the save name, dput
does not store the name of the R object. So, if we want to load the dataframe again (using dget
), we need to store the dataframe as a variable:
mydf2 <- dget("saveddf.txt")
Additionally, and again unlike save
, dput only stores values up to a finite level of precision. So while our original mydf
and the read-back-in dataframe mydf2
look very similar, they differ due the rules of floating point values (a basic element of computer programming that is unimportant to really understand):
head(mydf)
## x y z
## 1 -0.6265 -0.62037 0.4094
## 2 0.1836 0.04212 1.6889
## 3 -0.8356 -0.91092 1.5866
## 4 1.5953 0.15803 -0.3309
## 5 0.3295 -0.65458 -2.2852
## 6 -0.8205 1.76729 2.4977
head(mydf2)
## x y z
## 1 -0.6265 -0.62037 0.4094
## 2 0.1836 0.04212 1.6889
## 3 -0.8356 -0.91092 1.5866
## 4 1.5953 0.15803 -0.3309
## 5 0.3295 -0.65458 -2.2852
## 6 -0.8205 1.76729 2.4977
mydf == mydf2
## x y z
## [1,] FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE
## [7,] FALSE FALSE TRUE
## [8,] FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE
## [10,] TRUE FALSE FALSE
## [11,] FALSE TRUE FALSE
## [12,] FALSE TRUE FALSE
## [13,] FALSE FALSE FALSE
## [14,] TRUE FALSE FALSE
## [15,] FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE
## [19,] FALSE TRUE FALSE
## [20,] FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE
## [23,] TRUE FALSE FALSE
## [24,] FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE
## [48,] FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE
## [52,] FALSE FALSE FALSE
## [53,] FALSE FALSE FALSE
## [54,] FALSE FALSE FALSE
## [55,] FALSE FALSE FALSE
## [56,] TRUE FALSE FALSE
## [57,] FALSE FALSE FALSE
## [58,] FALSE FALSE FALSE
## [59,] FALSE FALSE FALSE
## [60,] FALSE FALSE FALSE
## [61,] FALSE FALSE FALSE
## [62,] FALSE FALSE FALSE
## [63,] FALSE TRUE FALSE
## [64,] FALSE FALSE FALSE
## [65,] FALSE FALSE FALSE
## [66,] FALSE FALSE FALSE
## [67,] FALSE FALSE FALSE
## [68,] FALSE FALSE FALSE
## [69,] FALSE FALSE FALSE
## [70,] FALSE FALSE FALSE
## [71,] FALSE FALSE FALSE
## [72,] FALSE FALSE FALSE
## [73,] FALSE FALSE FALSE
## [74,] FALSE FALSE TRUE
## [75,] FALSE FALSE FALSE
## [76,] FALSE FALSE FALSE
## [77,] TRUE FALSE FALSE
## [78,] FALSE FALSE FALSE
## [79,] FALSE FALSE FALSE
## [80,] FALSE FALSE FALSE
## [81,] FALSE FALSE FALSE
## [82,] FALSE FALSE FALSE
## [83,] FALSE FALSE FALSE
## [84,] FALSE FALSE FALSE
## [85,] FALSE FALSE FALSE
## [86,] FALSE TRUE FALSE
## [87,] FALSE FALSE FALSE
## [88,] FALSE FALSE FALSE
## [89,] FALSE FALSE FALSE
## [90,] FALSE FALSE FALSE
## [91,] FALSE FALSE FALSE
## [92,] FALSE FALSE FALSE
## [93,] FALSE FALSE FALSE
## [94,] FALSE FALSE FALSE
## [95,] FALSE FALSE FALSE
## [96,] FALSE FALSE FALSE
## [97,] FALSE FALSE FALSE
## [98,] FALSE FALSE FALSE
## [99,] FALSE FALSE FALSE
## [100,] FALSE FALSE TRUE
Thus, a dataframe saved using save
is exactly the same when reloaded into R whereas the one saved using dput
is the same up to a lesser degree of precision.
Let's clean up that file so not to leave a mess:
unlink("saveddf.text")
Similar to dput
, the dump
function writes the dput
output to a file. Indeed, it write the exact same representation we saw above on the console. But, instead of writing an R expression that we have to save to a variable name later, dump
preserves the name of our dataframe. Thus it is a blend between dput
and save
(but mostly it is like dput
).
dump
also uses a default filename: "dumpdata.R"
, making it a shorter command to write and one that is less likely to be destructive (except to previous data dumps). Let's see how it works:
dump("mydf")
Note: We specify the dataframe name as a character string because this is written to the file so that when we load the "dumpdata.R"
file, the dataframe has the same name as it does right now.
We can load this dataframe into memory from the file using source
:
source("dumpdata.R", echo = TRUE)
##
## > mydf <-
## + structure(list(x = c(-0.626453810742332, 0.183643324222082, -0.835628612410047,
## + 1.59528080213779, 0.329507771815361, -0.820468384118015 .... [TRUNCATED]
As you'll see in the (truncated) output of source
, the file looks just like dput
but includes mydf <-
at the beginning, meaning it s storing the dput
-like output into the mydf
object in R memory.
Note: dump
can also take arbitrary file names to its file
argument (like the save
and dput
).
Let's clean up that file so not to leave a mess:
unlink("dumpdata.R")
One of the easiest ways to save an R dataframe is to write it to a comma-separated value (CSV) file. CSV files are human-readable (e.g., in a text editor) and can be opened by essentially any statistical software (Excel, Stata, SPSS, SAS, etc.) making them one of the best formats for data sharing.
To save a dataframe as CSV is easy. You simply need to use the write.csv
function with the name of the dataframe and the name of the file you want to write to. Let's see how it works:
write.csv(mydf, file = "saveddf.csv")
That's all there is to it.
R also allows you to save files in other CSV-like formats. For example, sometimes we want to save data using a different separator such as a tab (i.e., to create a tab-separated value file or TSV). The TSV is, for example, the default file format used by The Dataverse Network online data repository. To write to a TSV we use a related function write.table
and specify the sep
argument:
write.table(mydf, file = "saveddf.tsv", sep = "\t")
Note: We use the \t
symbol to represent a tab (a standard common to many programming languages).
We could also specify any character as a separator, such as |
or ;
or .
but commas and tabs are the most common.
Note: Just like dput
, writing to a CSV or another delimited-format file necessarily includes some loss of precision, which may or may not be problematic for your particular use case.
Let's clean up our files just so we don't leave a mess:
unlink("savedf.csv")
unlink("savedf.tsv")
The foreign package, which we can use to load “foreign” file formats also includes a write.foreign
function that can be used to write an R dataframe to a foreign, proprietary data format. Supported formats include SPSS, Stata, and SAS.
One of the most common analytic tasks is creating variables. For example, we have some variable that we need to use in the analysis, but we want it to have a mean of zero or be confined to [0,1]. Alternatively, we might have a large number of indicators that we need to aggregate into a single variable.
When we used R as a calculator, we learned that R is “vectorized”. This means that when we call a function like add (+
), it adds each respective element of two vectors together. For example:
(1:3) + (10:12)
## [1] 11 13 15
This returns a three-element vector that added each corresponding element of the two vectors together. We also should remember R's tendency to use “recyling”:
(1:3) + 10
## [1] 11 12 13
Here, the second vector only has one element, so R assumes that you want to add 10 to each element of the first vector (as opposed to adding 10 to the first element and nothing to the second and third elements). This is really helpful for preparing data vectors because it means we can use mathematical operators (addition, subtraction, multiplication, division, powers, logs, etc.) for their intuitive purposes when trying to create new variables rather than having to rely on obscure function names. But R also has a number of other functions for building variables.
Let's examine all of these features using some made-up data. In this case, we'll create a dataframe of indicator variables (coded 0 and 1) and build them into various scales.
set.seed(1)
n <- 30
mydf <- data.frame(x1 = rbinom(n, 1, 0.5), x2 = rbinom(n, 1, 0.1), x3 = rbinom(n,
1, 0.5), x4 = rbinom(n, 1, 0.8), x5 = 1, x6 = sample(c(0, 1, NA), n, TRUE))
Let's use str
and summary
to get a quick sense of the data:
str(mydf)
## 'data.frame': 30 obs. of 6 variables:
## $ x1: int 0 0 1 1 0 1 1 1 1 0 ...
## $ x2: int 0 0 0 0 0 0 0 0 0 0 ...
## $ x3: int 1 0 0 0 1 0 0 1 0 1 ...
## $ x4: int 1 1 1 0 1 1 1 1 0 1 ...
## $ x5: num 1 1 1 1 1 1 1 1 1 1 ...
## $ x6: num NA 1 1 0 NA 1 1 0 0 1 ...
summary(mydf)
## x1 x2 x3 x4 x5
## Min. :0.000 Min. :0 Min. :0.0 Min. :0.000 Min. :1
## 1st Qu.:0.000 1st Qu.:0 1st Qu.:0.0 1st Qu.:1.000 1st Qu.:1
## Median :0.000 Median :0 Median :0.0 Median :1.000 Median :1
## Mean :0.467 Mean :0 Mean :0.4 Mean :0.833 Mean :1
## 3rd Qu.:1.000 3rd Qu.:0 3rd Qu.:1.0 3rd Qu.:1.000 3rd Qu.:1
## Max. :1.000 Max. :0 Max. :1.0 Max. :1.000 Max. :1
##
## x6
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.591
## 3rd Qu.:1.000
## Max. :1.000
## NA's :8
All variables are coded 0 or 1, x5
is all 1's, and x6
contains some missing data (NA
) values.
The easiest scales are those that add or substract variables. Let's try that quick:
mydf$x1 + mydf$x2
## [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0
mydf$x1 + mydf$x2 + mydf$x3
## [1] 1 0 1 1 1 1 1 2 1 1 0 1 1 0 1 1 2 1 1 2 1 1 1 0 1 0 1 0 1 0
mydf$x1 + mydf$x2 - mydf$x3
## [1] -1 0 1 1 -1 1 1 0 1 -1 0 -1 1 0 1 -1 0 1 -1 0 1 -1 1
## [24] 0 -1 0 -1 0 1 0
One way to save some typing is to use the with
command, which simply tells R which dataframe to look in for variables:
with(mydf, x1 + x2 - x3)
## [1] -1 0 1 1 -1 1 1 0 1 -1 0 -1 1 0 1 -1 0 1 -1 0 1 -1 1
## [24] 0 -1 0 -1 0 1 0
A faster way to take a rowsum is to use rowSums
:
rowSums(mydf)
## [1] NA 3 4 2 NA 4 4 4 2 4 3 3 3 2 NA 4 5 4 NA 5 NA 4 3
## [24] 2 NA 3 3 NA 3 NA
Because we have missing data, any row that has an NA results in a sum of 0
. We could either skip that column:
rowSums(mydf[, 1:5])
## [1] 3 2 3 2 3 3 3 4 2 3 2 3 3 1 3 3 4 3 2 4 2 3 3 2 3 2 3 2 3 2
or use the na.rm=TRUE
argument to skip NA
values when calculating the sum:
rowSums(mydf, na.rm = TRUE)
## [1] 3 3 4 2 3 4 4 4 2 4 3 3 3 2 3 4 5 4 2 5 2 4 3 2 3 3 3 2 3 2
or we could look at a reduced dataset, eliminating all rows from the result that have a missing value:
rowSums(na.omit(mydf))
## 2 3 4 6 7 8 9 10 11 12 13 14 16 17 18 20 22 23 24 26 27 29
## 3 4 2 4 4 4 2 4 3 3 3 2 4 5 4 5 4 3 2 3 3 3
but this last option can create problems if we try to store the result back into our original data (since it has fewer elements than the original dataframe has rows).
We can also multiply (or divide) across variables. For these indicator variables, that applies an AND logic to tell us if all of the variables are 1:
with(mydf, x3 * x4 * x5)
## [1] 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0
We might also want to take an average value across all the columns, which we could do by hand:
with(mydf, x1 + x2 + x3 + x4 + x5 + x6)/6
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
or use the rowSums
function from earlier:
rowSums(mydf)/6
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
or use the even simpler rowMeans
function:
rowMeans(mydf)
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
If we want to calculate some other kind of function, like the variance, we can use the apply
function:
apply(mydf, 1, var) # the `1` refers to rows
## [1] NA 0.3000 0.2667 0.2667 NA 0.2667 0.2667 0.2667 0.2667 0.2667
## [11] 0.3000 0.3000 0.3000 0.2667 NA 0.2667 0.1667 0.2667 NA 0.1667
## [21] NA 0.2667 0.3000 0.2667 NA 0.3000 0.3000 NA 0.3000 NA
We can also make calculations for columns (though this is less common in rectangular data unless we're trying to create summary statistics):
rowSums(mydf)
## [1] NA 3 4 2 NA 4 4 4 2 4 3 3 3 2 NA 4 5 4 NA 5 NA 4 3
## [24] 2 NA 3 3 NA 3 NA
rowMeans(mydf)
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
apply(mydf, 2, var) # the `2` refers to columns
## x1 x2 x3 x4 x5 x6
## 0.2575 0.0000 0.2483 0.1437 0.0000 NA
sapply(mydf, var) # another way to apply a function to columns
## x1 x2 x3 x4 x5 x6
## 0.2575 0.0000 0.2483 0.1437 0.0000 NA
Sometimes we need to build a scale with a different formula for subsets of a dataset. For example, we want to calculate a scale in one way for men and a different way for women (or something like that). We can use indexing to achieve this. We can start by creating an empty variable with the right number of elements (i.e., the number of rows in our dataframe):
newvar <- numeric(nrow(mydf))
Then we can store values into this conditional on a variable from our dataframe:
newvar[mydf$x1 == 1] <- with(mydf[mydf$x1 == 1, ], x2 + x3)
newvar[mydf$x1 == 0] <- with(mydf[mydf$x1 == 0, ], x3 + x4 + x5)
The key to making that work is using the same index on the new variable as on the original data. Doing otherwise would produce a warning about mismatched lengths:
newvar[mydf$x1 == 1] <- with(mydf, x2 + x3)
## Warning: number of items to replace is not a multiple of replacement
## length
Scatterplots are one of the best ways to understand a bivariate relationship. They neatly show the form of the relationship between x
and y
. But they are really only effective when both variables are continuous. When one of the variables in discrete, boxplots, conditional density plots, and other visualization techniques often do a better job communicating relationships.
But sometimes we have discrete data that is almost continuous (e.g., years of formal education). These kinds of variables might be nearly continuous and have approximately linear relationships with other variables. Summarizing an continuous outcome (e.g., income) using a boxplot at every level of education can be pretty tedious and indeed is a difficult graph to read. In these situations, we might want to rely on a scatterplot, but we need to preprocess the data in order to clearly visualize it.
Let's start with some example data (where the predictor variable is discrete and the outcome is continuous), look at the problems with plotting these kinds of data using R's defaults, and then look at the jitter
function to draw a better scatterplot.
set.seed(1)
x <- sample(1:10, 200, TRUE)
y <- 3 * x + rnorm(200, 0, 5)
Here's what a standard scatterplot of these data looks like:
plot(y ~ x, pch = 15)
Because the independent variable is only observed at a few levels, it can be difficult to get a sense of the “cloud” of points. We can use jitter
to add a little random noise to the data in order to see the cloud more clearly:
plot(y ~ jitter(x, 1), pch = 15)
We can add even more random noise to see an even more “cloud”-like representation:
plot(y ~ jitter(x, 2), pch = 15)
If both our independent and dependent variables are discrete, the value of jitter
is even greater. Let's look at some data like this:
x2 <- sample(1:10, 500, TRUE)
y2 <- sample(1:5, 500, TRUE)
plot(y2 ~ x2, pch = 15)
Here the data simply look like a grid of points. It is impossible to infer the density of the data anywhere in the plot. jitter
will be quite useful.
Let's start by applying jitter
just to the x2
variable (as we did above):
plot(y2 ~ jitter(x2), pch = 15)
Here we start to see teh data a little more clearly. Let's try it just on the outcome:
plot(jitter(y2) ~ x2, pch = 15)
That's a similar level of improvement, but let's use jitter
on both the outcome and predictor to get a much more cloud-like effect:
plot(jitter(y2) ~ jitter(x2), pch = 15)
Adding even more noise will make an even fuller cloud:
plot(jitter(y2, 2) ~ jitter(x2, 2), pch = 15)
We now clearly see that our data are evenly dense across the entire matrix. Of course, adding this kind of noise probably isn't appropriate for analyzing data, but we could, e.g., run a regression model on the original data then when we plot the results use the jitter inputs in order to more clearly convey the underlying descriptive relationship.
When we want to compare the distributions of two variables in a scatterplot, sometimes it is hard to see the marginal distributions.
To observe the marginal distributions more clearly, we can add “rugs” using the rug
function.
A rug is a one-dimensional density plot drawn on the axis of a plot.
Let's start with some data for two groups.
set.seed(1)
x1 <- rnorm(1000)
x2 <- rbinom(1000, 1, 0.7)
y <- x1 + 5 * x2 + 3 * (x1 * x2) + rnorm(1000, 0, 3)
We can plot the scatterplot for each group separately in red and blue.
We can then add some marginal “rugs” to each side. We could do this for all the data or separately for each group.
To do it separately for each group, we need to specify the line
parameter so that the rugs don't overwrite each other.
plot(x1[x2 == 1], y[x2 == 1], col = "tomato3", xaxt = "n", yaxt = "n", xlab = "",
ylab = "", bty = "n")
points(y[x2 == 0] ~ x1[x2 == 0], col = "royalblue3")
# x-axis rugs for each group
rug(x1[x2 == 1], side = 1, line = 0, col = "tomato1", tck = 0.01)
rug(x1[x2 == 0], side = 1, line = 0.5, col = "royalblue1", tck = 0.01)
# y-axis rugs for each group
rug(y[x2 == 1], side = 2, line = 0, col = "tomato1", tck = 0.01)
rug(y[x2 == 0], side = 2, line = 0.5, col = "royalblue1", tck = 0.01)
# Note: The `tck` parameter specifies how tall the rug is. A shorter rug
# uses less ink to communicate the same information.
axis(1, line = 1)
axis(2, line = 1)
The last two lines add some axes a little farther out than they normally would be on the plots.
We might also want to add some more descriptives to the plot. For example, the marginal means for each group as a small black line of the rugs:
plot(x1[x2 == 1], y[x2 == 1], col = "tomato3", xaxt = "n", yaxt = "n", xlab = "",
ylab = "", bty = "n")
points(y[x2 == 0] ~ x1[x2 == 0], col = "royalblue3")
rug(x1[x2 == 1], side = 1, line = 0, col = "tomato1", tck = 0.01)
rug(x1[x2 == 0], side = 1, line = 0.5, col = "royalblue1", tck = 0.01)
rug(y[x2 == 1], side = 2, line = 0, col = "tomato1", tck = 0.01)
rug(y[x2 == 0], side = 2, line = 0.5, col = "royalblue1", tck = 0.01)
axis(1, line = 1)
axis(2, line = 1)
# means(on x-axis rugs)
Axis(at = mean(x1[x2 == 1]), side = 1, line = 0, labels = "", col = "black",
lwd.ticks = 3, tck = 0.01)
Axis(at = mean(x1[x2 == 0]), side = 1, line = 0.5, labels = "", col = "black",
lwd.ticks = 3, tck = 0.01)
# means(on y-axis rugs)
Axis(at = mean(y[x2 == 1]), side = 2, line = 0, labels = "", col = "black",
lwd.ticks = 3, tck = 0.01)
Axis(at = mean(y[x2 == 0]), side = 2, line = 0.5, labels = "", col = "black",
lwd.ticks = 3, tck = 0.01)
As should be clear, the means of x1
are similar in both groups, but the means of y
in each group differ considerably.
By combining the scatterplot with the rug, we are able to communicate considerable information with little ink.
Sometimes people standardize regression coefficients in order to make them comparable. Gary King thinks this produces apples-to-oranges comparisons. He's right. It is a rare context in which these are helpful.
Let's start with some data:
set.seed(1)
n <- 1000
x1 <- rnorm(n, -1, 10)
x2 <- rnorm(n, 3, 2)
y <- 5 * x1 + x2 + rnorm(n, 1, 2)
Then we can build and summarize a standard linear regression model.
model1 <- lm(y ~ x1 + x2)
The summary shows us unstandardized coefficients that we typically deal with:
summary(model1)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.230 -1.313 -0.045 1.363 5.626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9762 0.1139 8.57 <2e-16 ***
## x1 5.0098 0.0063 795.00 <2e-16 ***
## x2 1.0220 0.0314 32.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.06 on 997 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.998
## F-statistic: 3.17e+05 on 2 and 997 DF, p-value: <2e-16
We might want standardized coefficients in order to make comparisons across the two input variables, which have different means and variances. To do this, we multiply the coefficients by the standard deviation of the input over the standard deviation of the output.
b <- summary(model1)$coef[2:3, 1]
sy <- apply(model1$model[1], 2, sd)
sx <- apply(model1$model[2:3], 2, sd)
betas <- b * (sx/sy)
The result are coefficients for x1
and x2
that we can interpret in the form:
“the change in y (in standard deviations) for every standard deviation change in x”
betas
## x1 x2
## 0.99811 0.04092
We can obtain the same results by standardizing our variables to begin with:
yt <- (y - mean(y))/sd(y)
x1t <- (x1 - mean(x1))/sd(x1)
x2t <- (x2 - mean(x2))/sd(x2)
model2 <- lm(yt ~ x1t + x2t)
If we compare the result of original model to the results from our manual calculation or our pre-standardized mode, we see that the latter two sets of coefficients are identical, but different from the first.
rbind(model1$coef, model2$coef, c(NA, betas))
## (Intercept) x1 x2
## [1,] 9.762e-01 5.0098 1.02202
## [2,] 2.864e-17 0.9981 0.04092
## [3,] NA 0.9981 0.04092
We can see how these produce the same inference by examining the change in y
predicted by one-SD change in x1
from model1
:
sd(x1) * model1$coef["x1"]
## x1
## 51.85
Dividing that value by the standard deviation of y
, we obtain our standardized regression coefficient:
sd(x1) * model1$coef["x1"]/sd(y)
## x1
## 0.9981
And the same is true for x2
:
sd(x2) * model1$coef["x2"]/sd(y)
## x2
## 0.04092
Thus, we obtain the same substantive inference from standardized coefficients. Using them is a matter of what produces the most intuitive story from the data.
We often want to tabulate data (e.g., categorical data).
R supplies tabulation functionality with the table
function:
set.seed(1)
a <- sample(1:5, 25, TRUE)
a
## [1] 2 2 3 5 2 5 5 4 4 1 2 1 4 2 4 3 4 5 2 4 5 2 4 1 2
table(a)
## a
## 1 2 3 4 5
## 3 8 2 7 5
The result is a table, showing the names of each possible value and a frequency count of for each value. This looks similarly regardless of the class of the vector. Note: If the vector contains continuous data, the result may be unexpected:
table(rnorm(100))
##
## -2.22390027400994 -1.563782051071 -1.43758624082998
## 1 1 1
## -1.4250983947325 -1.28459935387219 -1.28074943178832
## 1 1 1
## -1.26361438497058 -1.23753842192996 -1.16657054708471
## 1 1 1
## -1.13038577760069 -1.0655905803883 -0.97228683550556
## 1 1 1
## -0.940649162618608 -0.912068366948338 -0.891921127284569
## 1 1 1
## -0.880871723252545 -0.873262111744435 -0.832043296117832
## 1 1 1
## -0.814968708869917 -0.797089525071965 -0.795339117255372
## 1 1 1
## -0.776776621764597 -0.69095383969683 -0.649471646796233
## 1 1 1
## -0.649010077708898 -0.615989907707918 -0.542888255010254
## 1 1 1
## -0.500696596002705 -0.452783972553158 -0.433310317456782
## 1 1 1
## -0.429513109491881 -0.424810283377287 -0.418980099421959
## 1 1 1
## -0.412519887482398 -0.411510832795067 -0.376702718583628
## 1 1 1
## -0.299215117897316 -0.289461573688223 -0.282173877322451
## 1 1 1
## -0.279346281854269 -0.275778029088027 -0.235706556439501
## 1 1 1
## -0.227328691424755 -0.224267885278309 -0.21951562675344
## 1 1 1
## -0.172623502645857 -0.119168762418038 -0.117753598165951
## 1 1 1
## -0.115825322156954 -0.0571067743838088 -0.0548774737115786
## 1 1 1
## -0.0110454784656636 0.00837095999603331 0.0191563916602738
## 1 1 1
## 0.0253828675878054 0.0465803028049967 0.046726172188352
## 1 1 1
## 0.0652881816716207 0.119717641289537 0.133336360814841
## 1 1 1
## 0.14377148075807 0.229019590694692 0.242263480859686
## 1 1 1
## 0.248412648872596 0.250141322854153 0.252223448156132
## 1 1 1
## 0.257338377155533 0.266137361672105 0.358728895971352
## 1 1 1
## 0.36594112304922 0.377395645981701 0.435683299355719
## 1 1 1
## 0.503607972233726 0.560746090888056 0.576718781896486
## 1 1 1
## 0.59625901661066 0.618243293566247 0.646674390495345
## 1 1 1
## 0.66413569989411 0.726750747385451 0.77214218580453
## 1 1 1
## 0.781859184600258 0.804189509744908 0.83204712857239
## 1 1 1
## 0.992160365445798 0.996543928544126 0.996986860909106
## 1 1 1
## 1.08576936214569 1.10096910219409 1.1519117540872
## 1 1 1
## 1.15653699715018 1.23830410085338 1.25408310644997
## 1 1 1
## 1.2560188173061 1.29931230256343 1.45598840106634
## 1 1 1
## 1.62544730346494 1.67829720781629 1.75790308981071
## 1 1 1
## 2.44136462889459
## 1
We also often want to obtain percentages (i.e., the proportion of observations falling into each category).
We can obtain this information by wrapping our table
function in a prop.table
function:
prop.table(table(a))
## a
## 1 2 3 4 5
## 0.12 0.32 0.08 0.28 0.20
The result is a “proportion” table, showing the proportion of observations in each category. If we want percentages, we can simply multiply the resulting table by 100:
prop.table(table(a)) * 100
## a
## 1 2 3 4 5
## 12 32 8 28 20
To get frequencies and proportions (or percentages) together, we can bind the two tables:
cbind(table(a), prop.table(table(a)))
## [,1] [,2]
## 1 3 0.12
## 2 8 0.32
## 3 2 0.08
## 4 7 0.28
## 5 5 0.20
rbind(table(a), prop.table(table(a)))
## 1 2 3 4 5
## [1,] 3.00 8.00 2.00 7.00 5.0
## [2,] 0.12 0.32 0.08 0.28 0.2
In addition to these basic (univariate) tabulation functions, we can also tabulate in two or more dimensions.
To obtain simple crosstabulations, we can still use table
:
b <- rep(c(1, 2), length = 25)
table(a, b)
## b
## a 1 2
## 1 0 3
## 2 5 3
## 3 1 1
## 4 5 2
## 5 2 3
The result is a crosstable with the first requested variable a
as rows and the second as columns
.
With more than two variables, the table is harder to read:
c <- rep(c(3, 4, 5), length = 25)
table(a, b, c)
## , , c = 3
##
## b
## a 1 2
## 1 0 1
## 2 3 1
## 3 0 1
## 4 1 0
## 5 1 1
##
## , , c = 4
##
## b
## a 1 2
## 1 0 0
## 2 2 2
## 3 0 0
## 4 2 2
## 5 0 0
##
## , , c = 5
##
## b
## a 1 2
## 1 0 2
## 2 0 0
## 3 1 0
## 4 2 0
## 5 1 2
R supplies two additional functions that make reading these kinds of tables easier.
The ftable
function attempts to collapse the previous result into a more readable format:
ftable(a, b, c)
## c 3 4 5
## a b
## 1 1 0 0 0
## 2 1 0 2
## 2 1 3 2 0
## 2 1 2 0
## 3 1 0 0 1
## 2 1 0 0
## 4 1 1 2 2
## 2 0 2 0
## 5 1 1 0 1
## 2 1 0 2
The xtabs
function provides an alternative way of requesting tabulations.
This uses R's formula data structure (see 'formulas.r').
A righthand-only formula produces the same result as table
:
xtabs(~a + b)
## b
## a 1 2
## 1 0 3
## 2 5 3
## 3 1 1
## 4 5 2
## 5 2 3
xtabs(~a + b + c)
## , , c = 3
##
## b
## a 1 2
## 1 0 1
## 2 3 1
## 3 0 1
## 4 1 0
## 5 1 1
##
## , , c = 4
##
## b
## a 1 2
## 1 0 0
## 2 2 2
## 3 0 0
## 4 2 2
## 5 0 0
##
## , , c = 5
##
## b
## a 1 2
## 1 0 2
## 2 0 0
## 3 1 0
## 4 2 0
## 5 1 2
With a crosstable, we can also add table margins using addmargins
:
x <- table(a, b)
addmargins(x)
## b
## a 1 2 Sum
## 1 0 3 3
## 2 5 3 8
## 3 1 1 2
## 4 5 2 7
## 5 2 3 5
## Sum 13 12 25
As with a one-dimensional table, we can calculate proportions from an k-dimensional table:
prop.table(table(a, b))
## b
## a 1 2
## 1 0.00 0.12
## 2 0.20 0.12
## 3 0.04 0.04
## 4 0.20 0.08
## 5 0.08 0.12
The default result is a table with proportions of the entire table.
We can calculate row percentages with the margin
parameter set to 1:
prop.table(table(a, b), 1)
## b
## a 1 2
## 1 0.0000 1.0000
## 2 0.6250 0.3750
## 3 0.5000 0.5000
## 4 0.7143 0.2857
## 5 0.4000 0.6000
We can calculate column percentages with the margin
parameter set to 2:
prop.table(table(a, b), 2)
## b
## a 1 2
## 1 0.00000 0.25000
## 2 0.38462 0.25000
## 3 0.07692 0.08333
## 4 0.38462 0.16667
## 5 0.15385 0.25000
One of the many handy, and perhaps underappreciated, functions in R is curve
. It is a neat little function that provides mathematical plotting, e.g., to plot functions. This tutorial shows some basic functionality.
The curve
function takes, as its first argument, an R expression. That expression should be a mathematical function in terms of x
. For example, if we wanted to plot the line y=x
, we would simply type:
curve((x))
Note: We have to type (x)
rather than just x
.
We can also specify an add
parameter to indicate whether to draw the curve on a new plotting device or add to a previous plot. For example, if we wanted to overlay the function y=x^2
on top of y=x
we could type:
curve((x))
curve(x^2, add = TRUE)
We aren't restricted to using curve
by itself either. We could plot some data and then use curve
to draw a y=x
line on top of it:
set.seed(1)
x <- rnorm(100)
y <- x + rnorm(100)
plot(y ~ x)
curve((x), add = TRUE)
And, like all other plotting functions, curve
accepts graphical parameters. So we could redraw our previous graph with gray points and a thick red curve
:
plot(y ~ x, col = "gray", pch = 15)
curve((x), add = TRUE, col = "red", lwd = 2)
We could also call these in the opposite order (replacing plot
with points
):
curve((x), col = "red", lwd = 2)
points(y ~ x, col = "gray", pch = 15)
Note: The plots are different because calling curve
without xlim
and ylim
plots means that R doesn't know that we're going to add data outside the plotting region when we call points
.
We can also use curve
(as we would line
or points
) to draw points rather than a line:
curve(x^2, type = "p")
We can also specify to
and from
arguments to determine over what range the curve will be drawn. These are independent of xlim
and ylim
. So we could draw a curve over a small range on a much larger plotting region:
curve(x^3, from = -2, to = 2, xlim = c(-5, 5), ylim = c(-9, 9))
Because curve
accepts any R expression as its first argument (as long as that expression resolves to a mathematical function of x
), we can overlay all kinds of different curve
s:
curve((x), from = -2, to = 2, lwd = 2)
curve(0 * x, add = TRUE, col = "blue")
curve(0 * x + 1.5, add = TRUE, col = "green")
curve(x^3, add = TRUE, col = "red")
curve(-3 * (x + 2), add = TRUE, col = "orange")
These are some relatively basic examples, but they highlight the utility of curve
when we simply want to plot a function, it is much easier than generating data vectors that correspond to a function simply for the purposes of plotting.
Working with objects in R will become tedious if we don't give those objects names to refer to them in subsequent analysis.
In R, we can “assign” an object a name that we can then reference subsequently.
For example, rather than see the result of the expression 2+2
, we can store the result of this expression and look at it later:
a <- 2 + 2
To see the value of the result, we simply call our variable's name:
a
## [1] 4
Thus the <-
(less than and minus symbols together) mean assign the right-hand side to the name on the left-hand side.
We can get the same result using =
(an equal sign):
a = 2 + 2
a
## [1] 4
We can also, much more uncommonly, produce the same result by reversing the order of the statement and using a different symbol:
a <- 2 + 2
a
## [1] 4
This is very uncommon, though. The <-
is the preferred assignment operator.
When we assign an expression to a variable name, the result of the evaluated expression is saved.
Thus, when we call a
again later, we don't see 2+2
but instead see 4
.
We can overwrite the value stored in a variable by simply assigning something new to that variable:
a <- 2 + 2
a <- 3
a
## [1] 3
We can also copy a variable into a different name:
b <- a
b
## [1] 3
We may decide we don't need a variable any more and it is possible to remove that variable from the R environment using rm
:
rm(a)
Sometimes we forget what we've done and want to see what variables we have floating around in our R environment. We can see them with ls
:
ls()
## [1] "a1" "a2" "allout" "amat"
## [5] "b" "b1" "betas" "between"
## [9] "bin" "bmat" "bootcoefs" "c"
## [13] "c1" "c2" "c3" "change"
## [17] "ci67" "ci95" "ci99" "cmat"
## [21] "coef.mi" "coefs.amelia" "condmeans_x" "condmeans_x2"
## [25] "condmeans_y" "cumprobs" "d" "d1"
## [29] "d2" "d3" "d4" "d5"
## [33] "df1" "df2" "dist" "e"
## [37] "e1" "e2" "e3" "e4"
## [41] "e5" "englebert" "f" "fit1"
## [45] "fit2" "fit3" "FUN" "g"
## [49] "g1" "g2" "grandm" "grandse"
## [53] "grandvar" "h" "height" "i"
## [57] "imp" "imp.amelia" "imp.mi" "imp.mice"
## [61] "lm" "lm.amelia.out" "lm.mi.out" "lm.mice.out"
## [65] "lm1" "lm2" "lmfit" "lmp"
## [69] "localfit" "localp" "logodds" "logodds_lower"
## [73] "logodds_se" "logodds_upper" "m" "m1"
## [77] "m2" "m2a" "m2b" "m3a"
## [81] "m3b" "me" "me_se" "means"
## [85] "mmdemo" "model1" "model2" "myboot"
## [89] "mydf" "mydf2" "myformula" "myttest"
## [93] "myttest2" "myttest3" "n" "n1"
## [97] "n2" "n3" "new1" "newdata"
## [101] "newdata1" "newdata2" "newdf" "newvar"
## [105] "nx" "ologit" "ols" "ols1"
## [109] "ols2" "ols3" "ols3b" "ols4"
## [113] "ols5" "ols5a" "ols5b" "ols6"
## [117] "ols6a" "ols6b" "oprobit" "oprobprobs"
## [121] "out" "p" "p1" "p2"
## [125] "p2a" "p2b" "p3a" "p3b"
## [129] "p3b.fitted" "part1" "part2" "plogclass"
## [133] "plogprobs" "pool.mice" "ppcurve" "pred1"
## [137] "s" "s.amelia" "s.mi" "s.mice"
## [141] "s.orig" "s.real" "s1" "s2"
## [145] "s3" "search" "ses" "ses.amelia"
## [149] "sigma" "slope" "slopes" "sm1"
## [153] "sm2" "smydf" "sx" "sy"
## [157] "tmp1" "tmp2" "tmp3" "tmp4"
## [161] "tmpdata" "tmpdf" "tmpsplit" "tmpx"
## [165] "tmpz" "tr" "val" "valcol"
## [169] "w" "weight" "within" "x"
## [173] "X" "x1" "x1cut" "x1t"
## [177] "x2" "X2" "x2t" "x3"
## [181] "x4" "x5" "x6" "xseq"
## [185] "y" "y1" "y1s" "y2"
## [189] "y2s" "y3" "y3s" "y4"
## [193] "y5" "y6" "yt" "z"
## [197] "z1" "z2" "z5" "z6"
This returns a character vector containing all of the names for all named objects currently in our R environment. It is also possible to remove ALL variables in our current R session. You can do that with the following:
# rm(list=ls())
Note: This is usually an option on the RGui dropdown menus and should only be done if you really want to remove everything. Sometimes you can also see an expression like:
b <- NULL
This expression does not remove the object, but instead makes its value NULL. NULL is different from missing (NA) because R (generally) ignores a NULL value whenever it sees it. You can see this in the difference between the following two vectors:
c(1, 2, NULL)
## [1] 1 2
c(1, 2, NA)
## [1] 1 2 NA
The first has two elements and the second has three.
It is also possible to use the assign
function to assign a value to name:
assign("x", 3)
x
## [1] 3
This is not common in interactive use of R but can be helpful at more advanced levels.
R has some relatively simple rules governing how objects can be named:
(1) R object names are case sensitive, so a
is not the same as A
. This applies to objects and functions.
(2) R object names (generally) must start with a letter or a period.
(3) R object names can contain letters, numbers, periods (.
), and underscores (_
).
(4) The names of R objects can be just about any length, but anything over about 10 characters gets annoying to type.
CAUTION: We can violate some of these restrictions by naming things with backticks, but this can be confusing:
f <- 2
f
## [1] 2
f <- 3
f
## [1] 3
That makes sense and can allow us to name variables that start with a number. Then to call objects with these noncompliant names, we need to use the backticks:
`1f` <- 3
# Then try typing `1f` (with the backticks)
If we just called 1f, we would get an error. But this also means we can name objects with just a number as a name:
`4` <- 5
4
## [1] 4
# Then try typing `4` (with the backticks)
Which is kind of weird. It is best avoided.
An important aspect of working with R objects is knowing how to “index” them Indexing means selecting a subset of the elements in order to use them in further analysis or possibly change them Here we focus just on three kinds of vector indexing: positional, named reference, and logical Any of these indexing techniques works the same for all classes of vectors
If we start with a simple vector, we can extract each element from the vector by placing its position in brackets:
c("a", "b", "c")[1]
## [1] "a"
c("a", "b", "c")[2]
## [1] "b"
c("a", "b", "c")[3]
## [1] "c"
Indices in R start at 1 for the first item in the vector and continue up to the length of the vector. (Note: In some languages, indices start with the first item being indexed as 0.) This means that we can even index a one-element vector:
4[1]
## [1] 4
But, we will get a missing value if we try to index outside the length of a vector:
length(c(1:3))
## [1] 3
c(1:3)[9]
## [1] NA
Positional indices can also involve an R expression
For example, you may want to extract the last element of a vector of unknown length
To do that, you can embed the length
function in the the []
brackets.
a <- 4:12
a[length(a)]
## [1] 12
Or, you can express any other R expression, for example to get the second-to-last element:
a[length(a) - 1]
## [1] 11
It is also possible to extra multiple elements from a vector, such as the first two elements:
a[1:2]
## [1] 4 5
You can use any vector of element positions:
a[c(1, 3, 5)]
## [1] 4 6 8
This means that you could also return the same element multiple times:
a[c(1, 1, 1, 2, 2, 1)]
## [1] 4 4 4 5 5 4
But note that positions outside of the length of vector will be returned as missing values:
a[c(5, 25, 26)]
## [1] 8 NA NA
It is also possible to index a vector, less a vector of specified elements, using the -
symbol
For example, to get all elements except the first, on could simply index with -1
:
a[-1]
## [1] 5 6 7 8 9 10 11 12
Or, to obtain all elements except the last element, we can combine -
with length
:
a[-length(a)]
## [1] 4 5 6 7 8 9 10 11
Or, to obtain all elements except the second and third:
a[-c(2, 3)]
## [1] 4 7 8 9 10 11 12
Note: While in general 2:3
is the same as c(2,3)
, this is not the case in indexing
A second approach to indexing that is not particularly common for vectors is named indexing Vector elements can assigned names, such that each element has a value but also a name attached to it:
b <- c(x = 1, y = 2, z = "4")
b
## x y z
## "1" "2" "4"
This is the same as:
b <- c(x = 1, y = 2, z = "4")
b
## x y z
## "1" "2" "4"
In this type of vector we can still use positional indexing:
b[1]
## x
## "1"
But we can also index based on the names of the vector elements:
b["x"]
## x
## "1"
And, just with positional indexing, we can extract multiple elements at once:
b[c("x", "z")]
## x z
## "1" "4"
But, it's not possible to use the -
indexing that we used with element positions.
For example, b[-'x']
would return an error.
If a vector has names, this provides a way to extract elements without knowing their relative position in the order of vector elements.
If you want to know which name is in which position, we can also get just the names of the vector elements:
names(b)
## [1] "x" "y" "z"
And we can use positional indexing on the names(b)
vector, e.g. to get the first element's name:
names(b)[1]
## [1] "x"
The final way to index a vector involves logicals. Positional indexing allowed us to use any R expression to extract one or more elements. Logical indexing allows us to extract elements that meet specified criteria, as specified by an R logical expression. Thus, with a given vector, we could, for example, extract elements that are equal to a particular value:
c <- 10:3
c[c == 5]
## [1] 5
This works by first constructing a logical vector and then using that to return elements where the logical is TRUE:
c == 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
c[c == 5]
## [1] 5
We can use an exclamation point (!
) to negate the logical and thus return an opposite set of vector elements
This is similar to the -
indexing from positional indexing:
!c == 5
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
c[!c == 5]
## [1] 10 9 8 7 6 4 3
We do not need to restrict ourselves to logical equivalences. We can also use other comparators:
c[c > 5]
## [1] 10 9 8 7 6
c[c <= 7]
## [1] 7 6 5 4 3
We can also use boolean operators (i.e., AND &
, OR |
) to combine multiple criteria:
c < 9 & c > 4
## [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
c[c < 9 & c > 4]
## [1] 8 7 6 5
c > 8 | c == 3
## [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
c[c > 8 | c == 3]
## [1] 10 9 3
Here we can see how different logical criteria translate into a logical vector that is then used to index our target vector Some potentially unexpected behavior can happen if we try to index with a logical vector of a different length than our target vector:
c[TRUE] #' returns all elements
## [1] 10 9 8 7 6 5 4 3
c[c(TRUE, TRUE)] #' returns all elements
## [1] 10 9 8 7 6 5 4 3
c[FALSE] #' returns an empty vector
## integer(0)
Just with positional indexing, if the logical vector is longer than our target vector, missing values will be appended to the end:
d <- 1:3
d[c(TRUE, TRUE, TRUE, TRUE)]
## [1] 1 2 3 NA
Because 0 and 1 values can be coerced to logicals, we can also use some shorthand to get the same indices as logical values:
as.logical(c(1, 1, 0))
## [1] TRUE TRUE FALSE
d[c(TRUE, TRUE, FALSE)]
## [1] 1 2
d[as.logical(c(1, 1, 0))]
## [1] 1 2
Note: A blank index like e[]
is treated specially in R.
It refers to all elements in a vector.
e <- 1:10
e[]
## [1] 1 2 3 4 5 6 7 8 9 10
This is of course redundant to just saying e
, but might produce unexpected results during assignment:
e[] <- 0
e
## [1] 0 0 0 0 0 0 0 0 0 0
This replaces all values of e
with 0, which may or may not be intended.
An important, if not the most important, object in the R language is the vector.
A vector is a set of items connected together.
Building a vector is easy using the c
operator:
c(1, 2, 3)
## [1] 1 2 3
This combines three items - 1 and 2 and 3 - into a vector.
The same result is possible with the :
(colon) operators:
1:3
## [1] 1 2 3
The two can also be combined:
c(1:3, 4)
## [1] 1 2 3 4
c(1:2, 4:5, 6)
## [1] 1 2 4 5 6
1:4
## [1] 1 2 3 4
And colon-built sequences can be in any direction:
4:1
## [1] 4 3 2 1
10:2
## [1] 10 9 8 7 6 5 4 3 2
And we can also reverse the order of a vector using rev
:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
rev(1:10)
## [1] 10 9 8 7 6 5 4 3 2 1
Arbitrary numeric sequences can also be built with seq
:
seq(from = 1, to = 10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(2, 25)
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [24] 25
seq
accepts a number of optional arguments, including:
by, which controls the spacing between vector elements
seq(1, 10, by = 2)
## [1] 1 3 5 7 9
seq(0, 1, by = 0.1)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
length.out, which controls the length of the resulting sequence
seq(0, 1, length.out = 11)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
A related function seq_along
produces a sequence the length of another vector:
seq_along(c(1, 4, 5))
## [1] 1 2 3
This is shorthand for combining seq
with the length
function:
length(c(1, 4, 5))
## [1] 3
seq(1, length(c(1, 4, 5)))
## [1] 1 2 3
It's also possible to create repeated sequences using rep
:
rep(1, times = 5)
## [1] 1 1 1 1 1
This also allows us to repeat shorter vectors into longer vectors:
rep(c(1, 2), times = 4)
## [1] 1 2 1 2 1 2 1 2
If we use an each
parameter instead of a times
parameter, we can get a different result:
rep(c(1, 2), each = 4)
## [1] 1 1 1 1 2 2 2 2
Finally, we might want to repeat a vector into a vector that is not a multiple of the original vector length.
For example, we might want to alternate 1 and 2 for five values. We can use the length.out
parameter:
rep(c(1, 2), length.out = 5)
## [1] 1 2 1 2 1
These repetitions can be helpful when we need to categorize data into groups.
The above vectors are numeric, but vectors can be other classes, like character:
c("a", "b")
## [1] "a" "b"
Sequences of dates are also possible, using Date classes:
seq(as.Date("1999/1/1"), as.Date("1999/3/5"), "week")
## [1] "1999-01-01" "1999-01-08" "1999-01-15" "1999-01-22" "1999-01-29"
## [6] "1999-02-05" "1999-02-12" "1999-02-19" "1999-02-26" "1999-03-05"
seq(as.Date("1999/1/1"), as.Date("1999/3/5"), "day")
## [1] "1999-01-01" "1999-01-02" "1999-01-03" "1999-01-04" "1999-01-05"
## [6] "1999-01-06" "1999-01-07" "1999-01-08" "1999-01-09" "1999-01-10"
## [11] "1999-01-11" "1999-01-12" "1999-01-13" "1999-01-14" "1999-01-15"
## [16] "1999-01-16" "1999-01-17" "1999-01-18" "1999-01-19" "1999-01-20"
## [21] "1999-01-21" "1999-01-22" "1999-01-23" "1999-01-24" "1999-01-25"
## [26] "1999-01-26" "1999-01-27" "1999-01-28" "1999-01-29" "1999-01-30"
## [31] "1999-01-31" "1999-02-01" "1999-02-02" "1999-02-03" "1999-02-04"
## [36] "1999-02-05" "1999-02-06" "1999-02-07" "1999-02-08" "1999-02-09"
## [41] "1999-02-10" "1999-02-11" "1999-02-12" "1999-02-13" "1999-02-14"
## [46] "1999-02-15" "1999-02-16" "1999-02-17" "1999-02-18" "1999-02-19"
## [51] "1999-02-20" "1999-02-21" "1999-02-22" "1999-02-23" "1999-02-24"
## [56] "1999-02-25" "1999-02-26" "1999-02-27" "1999-02-28" "1999-03-01"
## [61] "1999-03-02" "1999-03-03" "1999-03-04" "1999-03-05"
But vectors can only have one class, so elements will be coerced, such that:
c(1, 2, "c")
## [1] "1" "2" "c"
produces a character vector
We can create vectors of different classes using the appropriate functions:
(1) The function numeric
produces numeric vectors:
numeric()
## numeric(0)
The result is an empty numeric vector. If we supply a length
parameter:
numeric(length = 10)
## [1] 0 0 0 0 0 0 0 0 0 0
The result is a vector of zeroes.
(2) The function character
produces an empty character vector:
character()
## character(0)
We can again supply a length
argument to produce a vector of empty chracter strings:
character(length = 10)
## [1] "" "" "" "" "" "" "" "" "" ""
(3) The function logical
produces an empty logical vector:
logical()
## logical(0)
Or, with a length
parameter, a vector of FALSE values:
logical(length = 10)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
These functions may seem kind of pointless right now. But they are useful in large projects.
Filling in the values of a vector “initialized” (e.g., with numeric
, character
, or logical
) is much faster than building a vector with c()
.
This is hard to observe at this scale (a few elements) but matters with bigger data.