linear regression

Linear regression is used to explain or model the relationship between a single variable $Y_i$ and one or more variables $x_1, x_2, \dots, x_p$. $Y$ is called the response, outcome, output or dependent variable. The $x_ps are called predictors, inputs, independent variables, explanatory variables, covariates or features. We can leverage the basic idea of [[statistical inference]] by considering that our response variable $Y$ is normally distributed with some mean $\mu$ that is a linear function of some vectors $\vec X$ and $\vec \beta$. $\vec Y \overset{iid}{\sim} N \Big( \mu(\vec X; \vec \beta), \sigma^2 \Big)$ Regression analysis has two main objectives: - [[prediction]]: predict an unmeasured or unseen $Y$ using the observed $x_1, x_2, \dots, x_p$. - [[explanation]]: assess the effect of, or explain the relationship between $Y$ and $x_1, x_2, \dots, x_p$. Simple linear regression refers to a model with a single predictor variable. Multiple linear regression refers to a model with multiple predictor variables. The [[least squares estimator]] provides a common technique for fitting the linear regression model. > [!NOTE] > **Predictors** are the measured variables that we are interested in using to predict some output. **Parameters** include all predictors and intercept term. Thus, a linear model has $p + 1$ parameters where $p$ is the number of predictors. ## assumptions To use linear regression, four assumptions are required. These are listed in order of importance. - **Linearity**: there is a linear relationship between the response and the model parameters. Predictor and/or response variables may be transformed to meet this assumption, but take care in interpreting the results. - **Independence**: the measures of the response variable are [[independent]] (this will also be true of their error terms). In other words, errors are uncorrelated. - **Homoskedasticity** (constant variance): the variance of all $Y_i$ are not dependent on $Y_i$; each observation of the response contains roughly the same information; the variance for each term $Y_i$ is the same. $Var(Y_i) = Var(\epsilon_i) = \sigma^2$. - **Normality**: the response variable is normally distributed $Y_i \sim N(\beta_0 + \beta_1 X_{i,1} + \dots + \beta_p X_{i,p}, \sigma^2)$. This is also to say $\epsilon \sim N(0, \sigma^2)$. Note that while the response variable should be normally distributed (although some regression analyses can be fairly robust to deviations in this assumption), the predictor variables do not need to be normally distributed. ## when to use linear regression Linear regression can be used even when the relationship between response and predictor variables is not linear, provided there is some transformation to establish a linear relationship. For example $Y_i = \exp(\beta_0 + \beta_1 X_i)$ can be transformed by taking the $\ln$ of both sides to get $\ln(Y_1) = \beta_0 + \beta_1 X_i$ for which a linear model can be fit. When a [[mechanistic]] understanding indicates a linear relationship, linear regression is a good choice. For example, Hook's law states that the force of a spring is equal to a spring constant times the displacement of the spring $F = kx$. Thus, we can use linear regression to estimate the spring constant $k$ from measures of a spring. If past studies suggest that the relationship is linear, linear regression may be a good choice. Finally, data exploration of the dataset can be used to discover whether the relationship might be linear. However, this can be tricky. In our data scarce recent past, data were collected after hypotheses were constructed in a controlled way to address research questions. Now, with abundant data, we can explore the data to generate hypotheses. However, it's important to avoid [[circular analysis]] after data exploration. ## diagnostic methods Diagnostics of linear regression involve confirming the statistical modeling assumptions hold. There are two methods for diagnostics: graphical and numeric. Graphical is often preferred. In [[R]], use `plot(lm_model)` to get four diagnostic plots for any linear model: residuals versus fitted, qq-plot, scale-location, and residuals vs leverage. ### checking linearity For the linearity assumption to hold, the effect of each individual predictor on the response should be linear and the effect of each additional predictor on the response is additive. If this is not the case, - The [[least squares estimator|LSE]] will be biased - The model will have low explanatory power - The inferences (e.g., t-tests, F-test) based on the mis-specified regression model will be biased and misleading - The predictions from the model are less likely to be accurate. A scatter plot of the fitted versus observed points should follow the line $y=x$. #diagram A scatter plot of the fitted versus residuals should show no relationship. #diagram Note the mean of the residuals is always $0$, so this cannot be evidence that the linearity assumption holds. You're looking for patterns (often cyclical) in the data. ### checking independence Violations of independence is common for spatial or time-series data. Plot the residuals according to any inherent order you suspect in the data (e.g., dates if the data are dated). Again this should show no relationship. #diagram A successive residual plot compares the residual at $x_{i+1}$ to the residual at $x_i$. This should show no relationship. Again, the order of the dataset matters, a trend in an ordered dataset may disappear if the dataset is randomized. Further, if the trend is something more complicated (e.g., $x_{i+2}$ and $x_i$ are somehow correlated), this test will miss detecting it. ### Durbin-Watson test The Durbin-Watson test is a formal hypothesis test for independence in linear regression. $\epsilon_i = \rho \epsilon_{i-1} + \gamma$ where $\gamma$ is some normally distributed error term. The hypothesis are $H_0: \rho=0$ and $H_1: \rho \ne 0$. The test statistic is $d = \frac{\sum_{i=2}^n(\hat \epsilon _i - \epsilon_{i-1})^2}{\sum_{i=1}^2 \hat \epsilon_i^2} \approx 2(1-\hat \rho)$ If the test statistic is substantially less than 2, its possible there is a violation of the independence assumption. Additionally, one should critically analyze the [[concept validity]] of the model. This of course requires critical thinking and cannot be easily demonstrated by numerical or graphical techniques. ### Generalized least squares Generalized least squares (GLS) can be used to fit a line of best fit when the errors of the model are not [[independent and identically distributed]]. GLS assumes that the error terms have a known covariance matrix that is not diagonal. By taking into account the covariance structure of the errors, GLS can provide more efficient and accurate estimates of the regression parameters than ordinary least squares (OLS), which assumes that the error terms are independent and have constant variance. GLS involves transforming the original regression model into an equivalent form that has uncorrelated errors with constant variance. This is achieved by pre-multiplying both the dependent variable and the independent variables by a matrix called the "whitening matrix," which is the inverse of the square root of the covariance matrix of the errors. After the transformation, the new regression model can be estimated using OLS, and the resulting estimates can be transformed back to the original scale using the inverse of the whitening matrix. GLS is a powerful and flexible technique that can be used to handle a wide range of correlation structures in the errors, including serial correlation, heteroscedasticity, and spatial dependence. However, GLS requires knowledge of the true covariance structure of the errors, which may not be available in practice and needs to be estimated from the data. The package `nlme` in [[R]] has a function to perform GLS. ```R library(nlme) glmod <- gls(y ~ x, correlation=corAR1(form = ~year), data=data) # conf-int for the phi value glmod_intervals <- intervals(glmod, which = "var-cov") summary(glmod) ``` > [!Tip]- Additional Resources > [Andrew Rothman - Generalized Least Squares (GLS): Relations to OLS & WLS](https://towardsdatascience.com/generalized-least-squares-gls-mathematical-derivations-intuition-2b7466832c2c) ### Checking for homoskedasticity The variance of the error terms should be constant. However even under this assumption the [[residuals]] do not have constant variance. If the spread of the residuals versus fitted plot is not consistent across values of fitted values, there is evidence of non-homoskedasticity. Solutions here include transformation of the response (e.g., log transformation), add a predictor that might be missing, or use a weighted/generalized least squares. ### Checking for normality In linear regression, we use the normality assumption of the error terms for inference, namely calculating confidence intervals and hypothesis testing. A [[QQ plot]] of the [[residuals]] can be used to check for normality. #diagram Residual versus fitted plots should show no relationship or pattern. The Shapiro-Wilk test can be used to check for normality of error terms in a linear regression. Run the Wilks-Shapiro test on the [[residuals]] of the model. Transformations or Generalized Linear Models can be used. ## Shapiro-Wilk normality test The Shapiro-Wilk test can be used to check for normality of a dataset. This test is sensitive to large datasets. In general, we prefer plot methods like the [[QQ plot]] to formal hypothesis testing to check for normality. In [[R]] use `shapiro.test` to run a Wilks-Shapiro test. The test will report a test statistic and p-value. ```R shapiro.test(data) ``` ## formalization Let $\vec Y = (Y_1, \dots, Y_n)^T$ be the response variable and $\vec x_1 = \begin{pmatrix} x_{1,1} \\ x_{2,1} \\ \vdots \\ x_{n,1} \end{pmatrix} , \ \vec x_2 = \begin{pmatrix} x_{1,2} \\ x_{2,2} \\ \vdots \\ x_{n,2} \end{pmatrix} , \ \vec x_p = \begin{pmatrix} x_{1,p} \\ x_{2,p} \\ \vdots \\ x_{n,p} \end{pmatrix}$ be predictors. We will collect predictors in a matrix $X = (1 \ \vec x_1 \ \vec x_2 \ \dots \ \vec x_p)$, where $1 = (1, 1, \dots, 1)^T$. Let $\vec \beta = (\beta_0, \beta_1, \dots, \beta_p)^T$ be a vector of parameters. Finally, let $\vec \epsilon = (\epsilon_1, \dots, \epsilon_n)^T$ be a vector of error terms. We can write a linear model as $Y_i = \beta_0 + \beta_1X_{i,1} + \dots + \beta_p X_{i,p} + \epsilon_i$ for all $i=1, \dots, n$. Note that this implies $n$ such equations to fully specify our model. Instead of writing $n$ equations, a more compact notation is **matrix-vector form**. $\begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} & \dots & x_{1,p} \\ 1 & x_{2,1} & \dots & x_{2,p} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n,1} & \dots & x_{n,p} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix}$ An even more compact notation is simply $\vec Y = X \vec \beta + \vec \epsilon$ > [!NOTE] Notation > In linear algebra, a [[vector]] is typically written as a column, vertically. For ease of notation, we will instead sometimes write the vector in linear form and use the superscript $^T$ to indicate the vector is transposed. A good practice in linear algebra is to confirm that dimensions of matrices line up for the intended operation. $X$ has dimensions $(n \times (p+1))$ and $\vec \beta$ has dimensions $((p+1) \times 1)$. When we multiply matrices, the inner dimensions need to be the same, and the result will be the outer dimensions. Thus $X \vec \beta$ will have dimensions $(n \times 1)$. The vector $\vec \epsilon$ also has dimensions $(n \times 1)$ and so $X \vec \beta + \vec \epsilon$ will have the dimensions $(n \times 1)$, which are also the dimensions of $Y$. The dimensions all work out. ## interpreting linear regression models $\beta_0$ is the average value of $\vec Y$ when the predictor $x$ (or all the $x_ps) is zero. Usually this is called the "baseline average". $\beta_i$ is the average change in $\vec Y$ associated with a one unit increase in the value of $x_i$ assuming all other predictors are held constant. These "slope" parameters are called **partial** or **adjusted** regression parameters or coefficients. > [!Tip]- Additional Resources > - [Linear Models with R by Faraway](https://a.co/d/351Qk2T) > - [Extending Linear Models with R by Faraway](https://a.co/d/8is2EnC) > - [Regression Analysis by Example by Chatterjee](https://a.co/d/gvdrRad) > - [Regression and Other Stories](https://a.co/d/402B3b0) [[model selection]]