The two answers from @MarkMiller shows how confusing statistics based on counts can be. Before we can get confidence intervals, it is essential that we identify clearly what is being measured.
When examining counts for a desired outcome (let's call them success arbitrarily), and want to convert them to percentages, two scenarios are possible. (a) if the count for one cell was to increase by one, would one of the other counts decrease by one? (b) could the count for one cell increase while leaving the other counts unchanged? We examine these two scenarios with the following example data taken from the OP:
count <- c(20, 1, 9, 1)
X5employf <- c(1, 2, 3, 4)
data <- data.frame(X5employf, count)
In this scenario, the counts show a classification of the outcome into one of four categories. In the data.frame, there were 31 observations (20+9+1+9) and they were assigned to some class 1, 2, 3 or 4. In that situation, when an observation is misclassified, the count where it was classified must be decreased by 1 and necessarily, the correct class's count must be increased by 1 so that all the 39 observations are taken into account.
In that situation, we have what is called a multinomial
distribution of class memberships, and such data must be analysed with classification models assuming the multinomial distribution. One of @MarkMiller 's answer referred to R's package MultinomialCI
which should be best (see Sison and Glaz). I also propose one solution below using another R package of mine, ANOPA
(see Laurencelle and Cousineau). Such measures are called frequencies.
In this scenario, a subject or a group of subjects (either humans or any other unit of interest) is given a certain number of trials
, and each trial may result in a success of failure. Then, we count the number of success. Other groups are likewise given trials (there may be the same number of trials, or all could have variable amounts of trials). If ever we notice that we miss-read a trial's result, the count may go up or down by one unit, but it has no influence on the other group's counts.
In that situation, each group has a proportion of success and the correct model is based on a binomial distribution. The other response from @MarkMiller was based on a R's package binom
which is adequate in that second scenario (based on Clopper & Pearson). I will also propose an alternative based on the R package ANOFA
, also a package of mine. When the count of success is divided by the number of trials, measures like this are called proportions.
The two scenarios are mutually exclusive. The two approaches cannot be correct at the same time. One is correct necessarily implies that the second is incorrect because they are based on different, mutually exclusive, assumptions.
The ANOFA framework is analogous to ANOVAs in the sense that it examines frequencies in situations with one or many factors.
# we load some relevant packages:
library(dplyr)
library(ANOFA)
# we run an analysis with count as the dependent variable
w = anofa(count ~ X5employf, data)
summary(w) # result of the analysis
# G df Gcorrected pvalue etasq
# X5employf 32.42 3 31.57 1e-06 0.5112
# plot of the scores (this is a ggplot graph)
anofaPlot(w)
The frequencies, not at all surprisingly, as significantly different across groups (G(3) = 31.57, p < .001). This is the plot
To respond to OP, here are the 95% confidence intervals
anofaPlot(w, showPlot = F)
# X5employf center lowerwidth upperwidth
# 1 1 20 -11.870873 10.079419
# 2 2 1 -1.949385 8.355309
# 3 3 9 -9.181831 11.782361
# 4 4 1 -1.949385 8.355309
Although it is not necessary in the present analyze, the OP asked how we can compute the total count per category of "X5employf". This should do it:
data$grp <- 1 # some factor encompassing the groups...
totals <- group_by(data, grp) %>% summarise(n = sum(count) )
data <- merge(data, totals, by = "grp")
We can compare the CI from ANOPA
with those from MultinomialCI
:
library(MultinomialCI)
uu <- multinomialCI(x,alpha=0.05,verbose=FALSE)
uu
# [,1] [,2]
# [1,] 0.5161290 0.8332777
# [2,] 0.0000000 0.2203745
# [3,] 0.1612903 0.4784390
# [4,] 0.0000000 0.2203745
ww <- anofaPlot(w, showPlot = F)
data.frame(lowlim = ww$center/31+ww$lowerwidth/31/2,
higlim = ww$center/31+ww$upperwidth/31/2)
# lowlim higlim
# 1 0.4536956 0.8077326
# 2 0.0008164 0.1670211
# 3 0.1422285 0.4803607
# 4 0.0008164 0.1670211
The widths in ww
were adjusted to allow pair-wise comparisons, as is typically the purpose of a plot. By dividing their width by 2, we get stand-alone confidence intervals.
As seen, the confidence intervals returned by ANOPA are not identical to those returned by multinomialCI
(sometimes shorter, sometimes wider). This is because ANOPA does not use exact formulas for their computations, contrary to the package MultinomialCI
.
The ANOPA framework is also analogous to ANOVAs in the sense that it examines proportions in situations with one or many factors; also, the same subjects could be tested in multiple conditions (within-subject factors). The ANOFA is described in this text.
To compute proportions, we need the number of trials. Let's suppose they are
ntrials <- c(31,4,11,3)
data$ntrials <- ntrials
data
# grp X5employf count ntrials
# 1 1 1 20 31
# 2 1 2 1 4
# 3 1 3 9 11
# 4 1 4 1 3
We also assume that the design has three groups, identified with the variable "X5employf". We can launch an analysis with
library(ANOPA)
w <- anopa({count;ntrials} ~ X5employf , data)
summary(w) # to see the effect of the factor (n.s.)
# X5employf 0.061618 3 1.573304 0.193494 1.147188 1.371444 0.24938
# Error 0.039165 Inf
# to see a plot
anopaPlot(w)
The proportions across X5employf
levels are not significantly different (F(3, inf) = 1.57, p = .19). The plot is
We get the summary statistics with
# lowerwidth is the width of the lower branch of the 95% CI relative to center
# upperwidth is the width of the upper branch
anopaPlot(w, showPlot = FALSE)$summaryStatistics
# X5employf center lowerwidth upperwidth
# 1 1 0.6451613 -0.2501391 0.2154477
# 2 2 0.2500000 -0.3437500 0.7045993
# 3 3 0.8181818 -0.4133632 0.2123664
# 4 4 0.3333333 -0.4583333 0.7517923
where center is the proportion.
library(binom)
tt <- binom.confint(x, n, conf.level = 0.95, methods = "exact")
ww <- anopaPlot(w, showPlot = FALSE)$summaryStatistics[1:4,]
tt$lower
tt$upper
# [1] 0.4536956 0.0063094 0.4822441 0.0084038
# [1] 0.8077326 0.8058796 0.9771688 0.9057007
ww$center+ww$lowerwidth/sqrt(2)
ww$center+ww$upperwidth/sqrt(2)
# [1] 0.4682863 0.0069320 0.5258899 0.0092427
# [1] 0.7975058 0.7482270 0.9683476 0.8649307
The ANOPA confidence intervals are slightly shorter. In the present case, it is related to the very small number of trials (4 and 3) in two of the categories. As a rule of thumb, proportions should be based on at least 20 observations, more so if the proportions are quite different from 50%.