table |
Contingency table from count data |
xtabs |
Contingency table from count data in frequency-weighted format |
as.data.frame.table |
Data frame in frequency-weighted format from a contingency table |
ftable |
"Flat" contingency table |
prop.table |
Scale table cells |
addmargins |
Add margins to a table |
margin.table |
Get marginal sums |
chisq.test |
Chi-squared contingency table tests and goodness-of-fit tests |
fisher.test |
Fisher's exact test for 2x2 contingency tables |
# Data frame with some dummy count data
d1 = data.frame(
g1=factor(rep(1:2, c(73,47)), labels=c("yes","no")),
g2=factor(c(rep(1:3, c(21,19,33)), rep(1:3, c(16,12,19))), labels=c("low","med","high")),
g3=rep(gl(4,5, labels=LETTERS[1:4]),6) )
# Same data in frequency-weighted format
d2 = as.data.frame.table(table(d1))
names(d2) = c("f1","f2","f3","freq") # Make names of d2 different from d1 so we can attach both
attach(d1)
attach(d2)
The table and xtabs functions construct contingency tables (cross-tabulations) of count data. The table function is designed for "long" format count data. It takes one or more factors as arguments, or a data frame of factors, and returns a table where the cells are the counts at each combination of the factor levels. The xtabs function is designed for frequency-weighted data. It takes a formula giving the frequency variable and factors. The as.data.frame.table function is the inverse of xtabs in that it takes a contingency table and returns a data frame in frequency-weighted format.
# 1-dimensional tables
table(g1)
xtabs(freq~f1)
# 2-dimensional tables
table(g1,g2)
xtabs(freq~f1+f2)
# 3-dimensional tables (using ftable to make "flat" tables)
ftable(table(g1,g2,g3), row.vars=1:2)
ftable(xtabs(freq~f1+f2+f3), row.vars=1:2)
See also tapply and other aggregators for tables of aggregated data. For example tables of group means:
x = runif(120)
tapply(x, list(g1,g2), mean) # Table of group means
ftable(tapply(x, list(g1,g2,g3), mean), row.vars=1:2) # Flat table of group means
The prop.table function takes a table (or matrix), scales the values in its cells, and returns the scaled table. The default scaling is to divide each cell by the sum total of the cells. For a contingency table this scales the count data in each cell as relative frequencies. The optional margin argument sets the scale factor as a marginal sum, (ie. a row or a column sum). If margin=1 the proportion is with respect to the sum of the corresponding row. If margin=2 the proportion is with respect to the sum of the corresponding column.
m = table(g1,g2)
prop.table(m) * 100 # Relative frequency scaled up to a percentage
prop.table(m,1) # Scale cells by the row sum
prop.table(m,2) # Scale cells by the column sum
The margin.table function takes a table (or matrix) and returns the sum total or the marginal sums. See also functions: rowsum, colsum, rowSums, colSums, rowMeans, colMeans.
margin.table(m) # Sum total
margin.table(m, 1) # Row sums
margin.table(m, 2) # Column sums
The addmargins function takes a table (or matrix) and returns it with additional margins containing the marginal sums. The optional FUN argument can be used to pass a summary function.
addmargins(m) # Marginal sums
addmargins(m, margin=1) # Margin containing column sums
addmargins(m, margin=2) # Margin containing row sums
addmargins(m, FUN=mean) # Marginal means
The write.table function writes 1 and 2-dimensional tables. For example:
m = table(g1,g2) # Contingency table
write.table(m, file="", sep="\t", col.names=NA, quote=F) # See ?write.table for the meaning of col.names=NA
To add a name for the rows and columns set the dimname attribute and use the ftable and write.ftable functions. For example:
names(attr(m,"dimnames")) = c("Cond1","Cond2")
write.ftable(ftable(m), file="")
Tables of aggregated statistics may need to be rounded and formatted. For example:
m = tapply(x, list(g1,g2), mean) # Table of group means
write.table(round(m,3), file="", sep="\t", col.names=NA, quote=F) # Round each entry to 3dp
write.table(format(round(m,3)), file="", sep="\t", col.names=NA, quote=F) # Format to pad to 3dp
For 3 or 4-dimensional tables use the ftable and write.ftable functions. For example:
m = ftable(tapply(x, list(g1,g2,g3), mean), row.vars=1:2) # 3-way table of group means
names(attr(m, "row.vars")) = c("Cond1","Cond2") # Add names for the row variables
names(attr(m, "col.vars")) = "Subject" # ...and the column variable
write.ftable(m, file="", digits=3, quote=FALSE) # Pretty-print the table to a file
More general tables may be structured as an array with 'dimnames' and displayed using ftable. For example:
t1 = tapply(x, list(g1,g2,g3), mean)
t2 = tapply(x, list(g1,g2,g3), sd)
t3 = tapply(x, list(g1,g2,g3), length)
a1 = array(t1, dim=c(2,3,4), dimnames=list(levels(g1),levels(g2),levels(g3)))
a2 = array(c(t1,t2), dim=c(2,3,4,2), dimnames=list(levels(g1),levels(g2),levels(g3),c("Mean","Std Dev")))
a3 = array(c(t1,t2,t3), dim=c(2,3,4,3), dimnames=list(levels(g1),levels(g2),levels(g3),c("Mean","Std Dev","N")))
ftable(a1)
ftable(a2)
ftable(a3)
The chi-squared statistic can be used to test the significance of an association between samples of two (or more) categorical variables represented by factors. The association between factors is based upon comparing "observed frequencies" at each combination of factor levels, with "expected frequencies" that are averages of observed frequencies over combinations of factor levels. The chi-squared test is of the null that there is no significant difference between the observed and expected frequencies.
The chisq.test function performs chi-squared contingency table tests and goodness-of-fit tests. Given a 1-D table, (based on one factor), it performs a goodness-of-fit test. Given a 2-D table, (based on two factors), it performs a chi-squared test for independence of two factors. The function is designed only for 1-D and 2-D tables. However the summary method for table objects (returned by table or xtabs) also performs a chi-squared test for independence of factors, and this can handle tables based on more than two factors.
t2 = table(g1,g2) # 2-D contingency table
barplot(t2, beside=TRUE, legend=TRUE, ylim=c(0,40)) # Barplot of the contingency table
chisq.test(t2) # Pearson Chi-Square test
summary(t2) # ...(also performed by the table summary method)
chisq.test(t2)$observed # The observed frequencies, (same as t2)
chisq.test(t2)$expected # The expected frequencies
The test statistic is only approximately chi-squared, and becomes inaccurate when expected frequencies are "small" as defined using Cochran's rule-of-thumb, which states the test statistic is not close enough to Chi-Squared if: for a 2-D table any cell has expected frequency < 5, or for a larger table any cell has expected frequency < 1, or more than 20% of the cells have expected frequency < 5.
In the special case of a 2x2 table the chisq.test function applies Yates' continuity correction by default, which attempts to make the Pearson chi-squared statistic more accurate when the expected frequencies are small. However in that case it may be better to use Fisher's exact test instead. This is performed by the fisher.test function.
The chi-squared statistic is not a correlation coefficient in the sense that is cannot usefully be squared to produce a measure of effect-size for comparison purposes. Measures of the strength of association between categorical variables are provided by phi in package psych and by cramer.test function in package cramer. In the 2x2 case of association between dichotomous variables the odds ratio calculated by fisher.test is also useful.
detach(d1) # Clean up
detach(d2)