confidence interval

A confidence interval provides a range of plausible values for an estimate. > [!Example]+ Formulas > - [[confidence interval for a mean]] >- [[confidence interval for a difference in proportions]] >- [[confidence interval for the variance]] >- [[confidence interval for the ratio of two variances]] ## General approach The general approach to constructing a confidence interval is to define the [[estimator]] and then find a function of the estimator and the "target" whose distribution is known and free of unknown parameters. 1. Define the estimator (known as a pivotal quantity) 2. Find the distribution of the estimator 3. Identify critical values that meet the confidence level from the estimators distribution (or some transformation, like the transformation to the standard normal distribution) 4. Solve for the true parameter "in the middle" of the inequality ## Intuition We know that we can capture a random variable within an upper and lower bound with some probability $P$ based on its [[probability density function|pdf]]. Let's say we want to capture the random variable with $95\%$ probability. In that case, we would want to leave $5\%$ in the tails, or $2.5\%$ in each tail. The upper bound would be the value at which the area to the left includes $95\% + 2.5\% = 97.5\%$ of the total area of the distribution. For the [[standard normal distribution]], we can find this value in a lookup table or with R using `qnorm(.975)`, which gives us approximately $1.96$. By [[symmetry]], we know the lower bound is $-1.96$, however we could confirm with `qnorm(0.025)`. #diagram If we standardize our estimator we then want to solve the inequality $P(-1.96 < \frac{\bar X - \mu}{\sigma / \sqrt{n}} < 1.96) = 0.95$ We can multiply both sides by $\sigma/\sqrt{n}$, add $\bar X$ to both sides, multiply both sides by $-1$ and flip the inequalities to get $P(\bar X -1.96 \frac{\sigma}{\sqrt{n}} < \mu < \bar X + 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95$ The left and right sides form the endpoints of our confidence interval. We can use an estimator of $\sigma$, such as the [[sample variance]], in place of $\sigma$ if $\sigma$ is not known provided we have large enough sample sizes. ## Interpretation of a confidence interval The correct interpretation of a confidence interval for a given level of confidence, for example 95%, is that 95% of the confidence intervals constructed will capture the true parameter value if the sampling were repeated. Do not say "I'm 95% confident..." or "there is 95% probability that the true parameter is within the confidence interval". ## Example confidence interval for difference in population means Let's consider the difference between population means from a large sample at a $95\%$ confidence level as an example. The estimator in this case is simply $\bar X_1 - \bar X_2$. We know that the distribution of this estimator, as a [[linear combination of normal random variables]], has the normal distribution. $\bar X_1 - \bar X_2 \sim N \Big (\mu_1 - \mu_2, \frac{\sigma^2_1}{n} + \frac{\sigma^2_2}{n} \Big )$We can standardize this estimator by subtracting the mean and dividing by the square root of the variance to get a [[standard normal distribution]]. $Z = \frac{\bar X_1 - \bar X_2 -(\mu_1 - \mu_2)}{\sqrt{\frac{\sigma^2_1}{n} + \frac{\sigma^2_2}{n}}} \sim N(0,1)$ Placing this value between two critical values such that the probability equals $95\%$ gives us $P(-z_{\alpha/2} < \frac{\bar X_1 - \bar X_2 -(\mu_1 - \mu_2)}{\sqrt{\frac{\sigma^2_1}{n} + \frac{\sigma^2_2}{n}}} < z_{\alpha/2}) = 95\%$ Finally, we solve for the true parameter in the middle by multiplying both sides of the inequality by the denominator and adding the estimator. $\bar X_1 - \bar X_2 - z_{\alpha/2} \sqrt{\frac{\sigma^2_1}{n} + \frac{\sigma^2_2}{n}} < \mu_1 - \mu_2 < \bar X_1 - \bar X_2 + z_{\alpha/2} \sqrt{\frac{\sigma^2_1}{n} + \frac{\sigma^2_2}{n}}$ Simplifying the notation we have our confidence interval $\bar X_1 - \bar X_2 \pm z_{\alpha/2} \sqrt{\frac{\sigma^2_1}{n} + \frac{\sigma^2_2}{n}}$ If we don't know $\sigma$, we can replace it with an estimator, such as the [[sample variance]]. ## Example 2 Suppose that $X_1, X_2, \dots, X_n$ is a random sample from the [[exponential distribution]] with rate $\lambda>0$. Construct a 95% confidence interval for $\lambda$. First, we'll choose an estimator for $\lambda$. We know that the sum of exponential random variables has the [[gamma distribution]]. $\sum_{i=1}^n X_i \sim \Gamma(n, \lambda)$ The mean is just the sum of the random variables times $1/n$. We know that when we multiply a gamma random variable by a constant $c$ it transforms the distribution by $\beta/c$. $\frac1n \sum_{i=1}^n X_i \sim \Gamma(n, n\lambda)$ Given this, we can use the sample mean as our estimator, but we need to define a function of the estimator and the parameter of interest that is known and does not depend on the parameter of interest. The gamma distribution above still depends on $\lambda$. However, we can factor the term $\lambda$ in the gamma distribution by pulling it out front (because if we multiply it back through it will go under the $\beta$ term). $\bar X \sim \Gamma(n, n\lambda) = \lambda \bar X \sim \Gamma(n, n)$ We can express the confidence interval in terms of probability. $P \Big ( \Gamma(n,n)_{alpha/2} < \lambda \bar X < \Gamma(n,n)_{alpha/2} \Big ) = 95\%$ Solving for $\lambda$ in the middle we have $P \Big (\frac{1}{\bar X} \Gamma(n,n)_{alpha/2} < \lambda< \frac{1}{\bar X} \Gamma(n,n)_{1 - alpha/2} \Big ) = 95\%$ We can look up these critical values in [[R]] and solve for the confidence interval. Here is the code in R to sample an exponential distribution with `rate=4` and `n=100` and construct a confidence interval using the expression above. Notice our confidence interval captures the true rate parameter. ```R set.seed(1) n <- 100 data <- rexp(n=n, rate=4) data.mean <- mean(data) lower.bound <- (1 / data.mean) * qgamma(0.025, n, n) upper.bound <- (1 / data.mean) * qgamma(0.975, n, n) round(c(lower.bound, upper.bound), 3) > (3.158, 4.678) ``` Historically, statisticians have transformed the gamma distribution to the [[chi-squared distribution]] because we did not have lookup tables for the gamma distribution with two parameters. Recall the chi-squared distribution is a gamma distribution parameterized as $\chi^2(n) \sim (n/2, 1/2)$ and what we have is $\Gamma(n, n)$. We can multiply by $2n$ to pull the $n$ term out of the $\beta$ parameter and transform to $\beta = 1/2$. $2n\lambda \bar X \sim \Gamma(n, \frac12)$ We can't necessarily transform the $\alpha$ parameter to $n/2$ easily, instead we can simply use a $\chi^2(2n)$ random variable for the critical values. Expressing what we have in terms of probability we get $P \Big (\chi^2(2n)_{\alpha/2} < 2n\lambda \bar X < \chi^2(2n)_{1 - alpha/2} \Big) = 95\%$ Solving for $\lambda$ in the middle we get $P \Big (\frac{\chi^2(2n)_{\alpha/2}}{2n\bar X} < \lambda < \frac{\chi^2(2n)_{1 - alpha/2}}{2n\bar X} \Big) = 95\%$ Using R we can find the critical values and solve this confidence interval. Here is the code in R to sample an exponential distribution with `rate=4` and `n=100` and construct a confidence interval using the expression above. Notice our confidence interval captures the true rate parameter. Notice that the interval is the same as the one calculated from the gamma distribution without first transforming to the chi-squared distribution. ```R set.seed(1) n <- 100 data <- rexp(n, 4) data.mean <- mean(data) lower.bound <- qchisq(.025, 2*n) / (2 * n * data.mean) upper.bound <- qchisq(.975, 2*n) / (2 * n * data.mean) round(c(lower.bound, upper.bound), 3) > (3.158, 4.678) ``` ## R In [[R]], use `qnorm(1 - alpha/2)` to get the critical value from the standard normal distribution and `qt(1 - alpha/2, n-1)` to get the critical value from the t-distribution. ```R data <- c(2781, 2900, 3013, 2856, 2888, ...) # dataset as a vector n <- length(data) data.mean <- mean(data) data.var <- sum((data - data.mean)^2)/(n - 1) confidence <- 0.95 alpha <- (1 - confidence) if(n>=30){ cv <- qnorm(1 - alpha/2) } else { # only valid if population is normally distributed cv <- qt(1 - alpha/2, n - 1) } ci_lower <- data.mean - cv * (sqrt(data.var/n)) ci_upper <- data.mean + cv * (sqrt(data.var/n)) ``` ```mermaid --- config: themeVariables: xyChart: backgroundColor: "#1c1b1a" --- xychart-beta title "Standard Normal" x-axis [jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec] y-axis "Revenue (in $)" 4000 --> 11000 bar [5000, 6000, 7500, 8200, 9500, 10500, 11000, 10200, 9200, 8500, 7000, 6000] line [5000, 6000, 7500, 8200, 9500, 10500, 11000, 10200, 9200, 8500, 7000, 6000] ```