generalized linear model

A generalized linear model (GLM) extends the concepts of linear regression for cases that violate the assumptions of standard [[base/Linear Regression/linear regression|linear regression]], for example when the response is categorical (i.e., non-normal response, non-constant variance). A GLM has three components: 1. **random component**: the response. 2. **systematic component**: linear combination of predictor measurements and parameters. 3. **[[link function]]**: connects the random response to a linear combination of predictor variables. The link function must be monotone continuous and differentiable function. There are convenient choices for certain data types. When the response is from the [[normal distribution]], the link function is simply $g(\mu) = \mu$ and so the generalized linear model is equivalent to the standard linear model. However, when the response is from some other distribution, this will not be true. The response must be from the [[exponential family of distributions]] (which is different than the exponential distribution). For these distributions, there exists a well-behaved link function that will allow the parameters to be estimated with the [[maximum likelihood estimator]]. The MLE cannot be calculated exactly in many cases, and so the result is a [[numerical analysis]] estimate using the **iteratively re-weighted least squares** algorithm in [[R]]. Let $x_j = (x_{1,j}, \dots, x_{n,j})^T$, $j = 1, \dots, p$ be a set of predictors; $Y = (Y_1, \dots, Y_n)^T$ be a response, and $\beta = (\beta_0, \beta_1, \dots, \beta_p)^T$ be a vector of parameters. ## link function The link function $g$ in a [[generalized linear model|GLM]] is a function of the mean $g(\mu)$ that connects the random response to a linear combination of predictor variables. $g(\mu) = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i, 2} + \dots + \beta_p x_{i,p}$ ## logit The logit link function is a common link function used in binomial regression, given by: $g(u) = \ln \Big( \frac{u}{1-u} \Big) = \theta$ ## exponential family of distributions $Y$ is a random variable from the exponential distribution from the exponential family if the distribution can be written as: $f(y; \ \theta, \phi) = \exp\Big \{ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \Big \} $ $\theta$ is called the canonical or natural parameter and $\phi$ is called the dispersion parameter. If $Y$ is a [[random variable]] from the exponential family, then the mean is the first derivative of $b(\theta)$ and the variance is its second derivative multiplied by the dispersion parameter: - $\mu = E(Y) = b'(\theta)$ - $\sigma^2 = Var(Y) = b''(\theta)\phi$ Many of the common distributions are from the exponential family including - [[Binomial distribution]] - [[Poisson distribution]] - [[exponential distribution]] - [[gamma distribution]] - [[normal distribution]] (yes, even the normal distribution is in the exponential family, but we'll use linear regression rather generalized linear regression in this case!) ## proving the binomial distribution is from the exponential family The [[Binomial distribution]] is from the exponential family. To demonstrate this, we can re-write the [[pmf]] in the general form of the exponential distribution. The pmf is $f(y; n,p) = P(Y=y_i; n,p) = \binom{n}{y}p^y(1-p)^{n-y}$ We can simultaneously exponentiate and take the log of the pmf to preserve the equality, then simplify using [[logarithm rules]] to get $\begin{align} &= \exp \Big \{ \ln \Big [ \binom{n}{y} p^y (1-p)^{n-y} \Big ] \Big \} \\ &= \exp \Big \{ \ln \binom{n}{y} + y \ln p + (n-y) \ln (1-p) \Big \} \\ &= \exp \Big \{ \ln \binom{n}{y} + y \ln p - y \ln (1-p) + n \ln (1-p) \Big \} \\ &= \exp \Big \{ y \ln \Big ( \frac{p}{1-p} \Big ) + n \ln (1-p) + \ln \binom{n}{y} \Big \} \\ \end{align} $ This is in the form of the exponential family of distributions where $\begin{align} \theta = \ln \Big ( \frac{p}{1-p} \Big ), && b(\theta) = n \ln(1-p), && c(y, \phi) = \ln \binom{n}{y} \end{align}$ The term $a(\phi)$ is simply $1$. To get $p$, we can exponentiate $\theta$ and solve for $p$ to get $p = \frac{e^\theta}{1 + e^{\theta}}$ ## binomial regression Binomial regression is a type of [[generalized linear model|GLM]] where the response is assumed to be a realization from the [[Binomial distribution]]. Note that each $y_i$ could be drawn from a different binomial distribution, in other words the response does not need to be [[identically distributed]], but it is assumed to be [[independent]]. $Y_i \overset{i}{\sim} \text{binomial}(n_i,p_i)$ The goal of binomial regression is to [[predict]] or [[explain]] $E(Y_i) = \mu_i = n_i p_i$ using the covariate class $x_{i,1},\dots,x_{i,p}$. Since each $n_i$ is known, what we're really after is $p_i$, the probability of success for a given $y_i$. In binomial regression, the most popular [[link function]] is the logit link, which is the default in [[R]]. $g(p_i) = \ln \Big( \frac{p_i}{1-p_i} \Big) = \theta$ Note that we've already demonstrated this definition of $\theta$ by [[proving the binomial distribution is from the exponential family]]. The probit link function is also available which uses the inverse of the [[cumulative density function|cdf]] of the standard normal distribution. $g(p_i) = \Phi^{-1}(p_i)$ Using the logit link, the parameters can be interpreted as - $\beta_0$ represents the average **log-odds** of success when all predictors are equal to zero. - A unit increase in $x_j$ with all other predictors held constant increases the log-odds of success by $\beta_j$ (additive) and the [[odds]] of success by $\exp\{\beta_j\}$ (multiplicative). When the predictors are standardized - $\beta_0$ represents the mean log odds of success when all predictors are the sample mean. - A one standard deviation increase in $x_j$ increase the log odds of survival by a multiplicative factor of $B_j$ and the odds by a multiplicative factor of $e^{B_j}$ on average. To understand why the increase in odds is multiplicative, rather than additive, consider the increase in the context of the link function in this simplified example. $\begin{align} B_0 + B_1x_1 &= \ln \Big( \frac{p_i}{1-p_i} \Big) \\ e^{ B_0 + B_1x_1} &= \frac{p}{1-p} \end{align}$ If we increase $x_1$ by one unit we have on the left side of the equation $e^{ B_0 + B_1 (x_1 + 1) }$ We can rearrange the terms to get $e^{ B_0 + B_1x_1 + B_1} = e^{\beta_1} * e^{ B_0 + B_1x_1 }$ Note that the term $e^{ B_0 + B_1x_1 }$ is the original odds, and so a one unit increase in $x_1$ increases the odds by a multiplicative factor of $e^{\beta_1}$ on average. In [[R]], use the function `glm` with the parameter `family = binomial` to run binomial regression, optionally specifying the [[link function]] to use. The response is provided as a two-column matrix with successes in the first column and failures in the second column. ```R glmod = glm(cbind(successes, trials - successes) ~ predictors, data = data, binomial(link="logit")) ``` When interpreting the coefficient estimates, recall that the intercept is the average log-odds when the other covariates are zero. The estimate for each covariate can be interpreted as the multiplicative increase in log-odds for a one unit increase of that variable. To get the multiplicative increase in odds, exponentiate. The [[deviance]] of the binomial regression is $D = -2\ell (\hat{\beta}) = -2\sum^n_{i=1} \Big[ y_i\eta_i - n_i \ln \Big (1 + e^{\eta_i} \Big) + \ln{n_i \choose y_i} \Big]$ A smaller deviance implies a better fit. For $n_i > 5$ for all $i = 1,...,n$, it can be shown that the residual deviance (and more generally, any difference in deviances) follows a $\chi^2$ distribution with $(n - p+1)$ degrees of freedom. Thus an upper-tailed chi-squared test can be used to test the hypotheses: $\begin{align} H_0&: \text{The model with } p \text{ parameters fits well enough.} \\ H_1&: \text{The model with } p \text{ parameters does not fit well enough.} \end{align}$ In [[R]], use ```R pchisq(deviance(glmod), df.residual(glmod), lower = FALSE) ``` A large p-value means the model fits the data well enough, because we do not have evidence against the null hypothesis. ## logistic regression Logistic regression is a special case of binomial regression where $y_i$ is a realization from $Y_i \overset{i}{\sim} \text{binomial}(1,p_i)$ (which is to say the response is a realization of the [[Bernoulli distribution]]). In this case, each outcome is either $1$ or $0$. Logistic regression is a type of [[classification]] problem. Other popular approaches to classification are Bayesian networks, neural networks, and decision trees. $P(Y|X) = \frac{e^\eta}{1 + e^\eta}$ where $\eta = \beta_0 + \beta_1 + \dots + \beta_p$. In [[R]], use the function `glm` with the parameter `family = binomial` to run binomial regression, optionally specifying the [[link function]] to use. ```R glmod = glm(success ~ predictors, data = data, family = binomial(link="logit")) ``` Note that goodness of fit metrics like deviance are not trustworthy for the logistic regression. [[R]] instead reports the McFadden's $R^2$. ### sklearn ```python from sklearn.linear_model import LogisticRegression model = LogisticRegression().fit(X, y) # Get coefficients model.coef # Get intercept model.intercept_ # Get predictions (classes) model.predict(X_test) # Get prediction probabilities model.predict_proba(X_test) ``` ### statsmodels ```python import statsmodels.api as sm logit_model = sm.Logit(y_train, x_train).fit() print(result.summary()) ``` ## Poisson regression model In Poisson regression, we assume the response variable $Y_i$ is a realization of the [[Poisson distribution]]. The model takes the form $ Y_i = \lambda_i = \exp(\beta_0 + \beta_1 x_{i,1} + \dots + \beta_p x_{i,p})$ - $e^{\beta_0}$ can be interpreted as the mean of the response when each predictor is set to zero - $e^{\beta_j}$ can be interpreted as the multiplicative increase in the mean of the response for a one unit increase in $x_{i,j}$ holding all other predictors constant The rate parameter $\lambda_i$ may be different for each observation and potentially depends on the covariate class for the observation. Thus, Poisson regression can be thought of as an effort to estimate $\lambda_i$ for each observation $i$. Thus, $\lambda$ is the canonical parameter. Typically the logit (log-link) function is used as the [[link function]]. $g(\lambda_i) = \ln (\lambda_i) = \theta$ The parameters of the model are estimated using maximum likelihood estimation but because the log-likelihood is not linear, [[R]] will use [[numerical analysis]] to approximate the solution. ### offset Recall that a Poisson random variable is a discrete random variable that can be used to model counts or rates (where a rate is just a count over some period of time). For rate models, especially where the period of time differs across observations, the period is important to our model. In Poisson regression, we call the period of time an **offset**. Consider that the mean of a Poisson random variable is the rate parameter $\lambda$, which can also be expressed as the count $y_i$ over the period $t_i$. Thus the [[link function]] can be written as $\ln(\lambda_i) = \ln(\frac{y_i}{t_i}) = \ln(y_i) - \ln(t_i)$ The term $\ln(t_i)$ is the offset term. To include an offset term in the model, use the following syntax in [[R]]. ```R glm(response ~ predictors, data=data, family=poisson(link="log")) # Poisson model with offset term glm(response ~ predictors + offset(log(time)), data=data, family=poisson(link="log")) ``` ### deviance Generally, the **deviance** of a GLM is -2 times the log likelihood of the GLM evaluated at the MLEs. $D = -2\ell (\hat{\beta}) = -2\sum^n_{i=1} \Big[ y_i \hat \eta_i - e^{\hat \eta_i} - \ln(y_i!) \Big]$ where $\hat \eta_i = \exp(\beta_0 + \beta_1 x_{i,1} + \dots + \beta_p x_{i,p})$. For a Poisson regression this equates to $D = -2\ell (\hat{\beta}) = -2\sum^n_{i=1} \Big[ y_i \ln(\hat \lambda_i) - \hat \lambda_i - \ln(y_i!) \Big]$ where $\hat \lambda_i = \eta_i = \exp(\beta_0 + \beta_1 x_{i,1} + \dots + \beta_p x_{i,p})$. A smaller deviance implies a better goodness of fit. #### null deviance The null deviance is the deviance with no predictors (or where all predictors are assigned a zero value), which for the Poisson regression equates to the difference between each response and the mean of the responses. This is analogous to the [[total sum of squares]] for standard [[linear regression]]. $D_{\text{null}} = -2\sum \Big [ y_i \ln(\bar y) - \bar y - \ln (y_i !) \Big ]$ #### saturated deviance The saturated deviance is the deviance for a model where there is a parameter for every data point and the model can perfectly predict the response. $D_{\text{sat}} = -2\sum \Big [ y_i \ln(y_i) - y_i - \ln (y_i !) \Big ]$ The saturated deviance is used to construct the residual deviance. #### residual deviance The residual deviance is the difference between deviance of the model as fit $D_p$ and the saturated deviance $D_{sat}$. Residual deviance is analogous to the [[residual sum of squares|RSS]] in a standard [[linear regression]]. $D_{\text{resid}} = D_p - D_{\text{sat}} = -2\sum \Big [ y_i \ln \Big (\frac{y_i}{\hat \lambda_i} \Big) - (y_i - \hat \lambda_i) \Big ]$ The residual deviance follows a [[chi-squared distribution]] under the hypothesis that the model fits the data $\chi^2 \sim (n - (p+1))$. Use a one tailed t-test to calculate a [[p-value]] for a goodness of fit test in [[R]]. ```R 1 - pchisq(summary(glm_model)$deviance, summary(glm_model)$df.resid, lower.tail = FALSE) ``` If the p-value is large, you can reject the null hypothesis that this model fits well enough and infer that you may need more predictors or a model of a different form. To compare two models, a chi-squared test can also be used. The test statistic is the difference between the deviance of the full model and the reduced model. ```R pchisq(model.small$deviance - model$deviance, df=model.small$df.residual-model$df.residual, lower.tail=FALSE) ``` #### deviance residuals The deviance residuals can be calculated for each observation to provide a metric analogous to the [[residuals]] in standard [[linear regression]]. $d_i = \text{sign}(y_i - \hat \lambda_i) \sqrt{2 \Big [ y_i \ln \Big (\frac{y_i}{\hat \lambda_i} \Big) - (y_i - \hat \lambda_i) \Big ]}$ where $\text{sign}(y_i - \hat \lambda_i)$ gives you the direction ($\pm$ ) of the deviance. Note the deviance residuals are approximately normally distributed $d_i \sim N(0, 1)$ because the residual deviance follows a [[chi-squared distribution]] under the hypothesis that the model fits the data and that we construct chi-squared random variables from the sums of squares of [[standard normal distribution|standard normal distributions]]. If a plot of the deviance residuals against the predicted values shows a non-linear structure, it may be the case that additional predictors are needed. This is analogous to [[residuals versus fitted plot]] in standard linear regression. ```R residuals(glm.mod, type="deviance") ``` ### Pearson's chi-squared The Pearson's chi-squared test statistic is another method for finding goodness of fit for the Poisson regression. $\chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i} = \sum_{i=1}^n \frac{(y_i - \hat \lambda_i)^2}{\hat \lambda_i} \sim \chi^2 (n - (p + 1))$ A large $\chi^2$ (and small associated p-value) is evidence against the null that the model fit is sufficient. For the Poisson regression, because the mean and variation is the same ($\lambda$), the square root of this term will be the value minus the mean divided by the standard deviation, which is how we [[standardize]] a random variable and thus these values will follow a [[standard normal distribution]]. These are called **Pearson's residuals**. $P_i = \frac{y_i - \hat \lambda_i}{\sqrt{\lambda_i}}$ Plot the Pearson's residuals against the fitted values to check for structure that might indicate additional parameters are needed. ```R residuals(glm.mod, type="pearson") ``` ### overdispersion Count response data suffer **overdispersion** when the variance of the response is greater than the mean for Poisson regression (note that these two values should be equal under the Poisson regression). $Var(Y_i) > E(Y_i)$ Real overdispersion arises when this condition is systematic in the data. An example is zero-inflation, when the observed response includes many more zero values than would be expected under the assumed distribution (use hurdle models or mixture models). Spatial or temporal correlation can also lead to overdispersion. In these cases its important to include the spatial or temporal variables in the model. Any dependent response may result in overdispersion. Apparent overdispersion can occur when there are outliers or missing predictors. Apparent overdispersion can also arise under an inappropriate link function. To remedy, remove the outlier, add a predictor, or try a different link function. To quantify overdispersion in Poisson regression, we can include a term $\phi$ to the model for the variance. The expected value will still be $E(Y_i) = \hat \lambda$ but the variance will now be $Var(Y_i) = \hat \lambda_i \phi$. The parameter $\phi$ can be estimated with the Pearson's residuals. $\hat \phi = \frac{\sum((y_i - \hat \lambda_i)^2 / \hat \lambda_i)}{n - (p + 1)}$ If this estimate for $\hat \phi$ is much larger than $1$ there is evidence of overdispersion. A statistically significant chi-squared test would be evidence of overdispersion. A rule of thumb is that overdispersion is present if the residual deviance is significantly greater than the degrees of freedom. The presence of overdispersion does not impact the validity of maximum likelihood estimation for fitting the model. However, the estimated variance can be severely underestimated, which will result in statistically insignificant predictors being treated as significant. To adapt to overdispersion, use quasilikelihood methods (which adjust the standard error by $\hat \phi$) or a negative binomial model.