Least squares estimation (LSE) is one technique to find a **best fit line** in a [[linear regression]]. LSE is often alternatively called [[ordinary least squares]] (OLS). The least squares estimator is the best, linear, unbiased estimator ([[BLUE]]) under the [[Gauss-Markov]] assumptions. The line of best fit to the data is the line that minimizes the sum of the squared vertical distances (the [[residual sum of squares]]) between the line and the observed points. $\arg \ \underset{\beta_0, \beta_1}{\min} \sum_{i=1}^n \Big (y_i - [\beta_0 + \beta_1 x_i] \Big )^2$ This line is called the **least squares line** in LSE. Written in matrix-vector form $\hat \beta = (X^T X)^{-1} X^T Y$ The `lm` function in [[R]] uses a numerical approximation, the QR factorization, to calculate this quantity because calculating the inverse of a matrix is expensive. In higher dimensions, a surface of best fit to the data is the surface that minimizes the sum of the squared vertical distances between the surface and the observed points. $\arg \ \underset{\vec \beta}{\min} ||\vec Y - X \vec \beta||^2 = (\vec Y - X \vec \beta)^T (\vec Y - X \vec \beta)$ The problem is to find a $\vec \beta$ so that $X \vec \beta$ is as close as possible to $y$ for a [[design matrix]] $X$ without [[overfitting]] by modeling the noise in the error term. An over-determined linear system does not have an exact solution. An over-determined linear system is one where there are more observations than predictor variables, or more rows in the matrix than columns, which is to say the matrix is tall rather than wide. Another way to say this is that there are more sample points than parameters in the model. Under the assumption of normally distributed errors, the least squares estimator is equivalent to the [[maximum likelihood estimator]]. As with MLE, the response and predictor values are considered fixed. To demonstrate, first let's write the marginal pdf under the assumption of normally distributed errors $\epsilon \overset{iid}{\sim} N(0, \sigma^2)$ $f(y_i; \vec \beta) = \frac{1}{\sqrt{2 \pi \sigma^2}} \ exp \Big [-\frac{1}{2 \sigma^2}(y_i - \mu_y)^2 \Big ]$ where $\mu_y = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}$. The joint pdf is $f(\vec y; \vec \beta) = (2 \pi \sigma^2)^{-n/2} \ exp \Big [ -\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu_y)^2 \Big]$ The log-likelihood is $l(\vec \beta) = -\frac{n}{2} \ln(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \sum(y_i - \mu_y)^2$ Note that we can ignore the first term as it is a constant with respect to $\beta$. The second term is simply a negative constant times the [[residual sum of squares]] where $RSS = \sum(y_i - \mu_y)$. Maximizing the negative of the RSS is the same as minimizing the RSS, thus the least squares estimator, which minimizes the RSS, is the same as the maximum likelihood estimator for $\beta$. ## assumptions - $E(\epsilon_i) = 0$ for all $i=1, \dots, n$ - $E(Y_i) = \beta_0 + \beta_1X_{i,1} + \dots + \beta_p X_{i,p}$ for all $i=1, \dots, n$ - $\displaystyle Cov(\epsilon_i, \epsilon_j) = \begin{cases} 0 & : i \ne j \\ \sigma^2 & : i = j \end{cases}$ - $(X^T X)^-1$ exists (if not, this is the problem of [[non-identifiability]]) - $Y_i \overset{iid}{\sim} N(\vec x_i^T \vec \beta, \sigma^2)$ ## definitions [[residuals]] [[fitted values]] [[hat matrix]] [[explained sum of squares]] [[total sum of squares]] [[coefficient of determination]] [[hypothesis test for individual regression parameters]] ## derivation Least squares estimation requires finding the minimum of the [[residual sum of squares]]. Recall that the RSS can be written in matrix-vector notation and expanded as $\begin{align}RSS &= (Y - X \beta)^T(Y - X \beta) \\ &= (Y^T - \beta^T X^T)(Y - X \beta) \\ &= Y^T Y - Y^T X \beta - \beta^T X^T Y + \beta^T X^T X \beta \end{align}$ Note that the dimensions of $Y^T X \beta$ are $1 \times n$, $n \times (p + 1)$, and $(p + 1) \times 1$, which means that the product will be a $1 \times 1$ scalar. Since the transpose of a scalar is the scalar again, and $(Y^T X \beta)^T$ = $\beta^T X^T Y$, the second and third terms in the above can be combined. $RSS = Y^T Y - 2\beta^T X^T Y + \beta^T X^T X \beta$ To calculate the function minimum, we will differentiate with respect to $\beta$ and set equal to 0 (see [[linear algebra for least squares estimation]] for the lemmas used to find this partial derivative). $\begin{align}\frac{\partial RSS}{\partial \beta} = 0 - 2X^T Y + 2 X^T X \beta &\overset{set}{=} 0 \\ 2 X^T X \beta &= 2X^T Y \\ X^T X \beta &= X^T Y \end{align}$ If we assume that the inverse of $X^TX$ exists, we can multiply it by both sides to get $\beta$ by itself. $\hat \beta = (X^T X)^{-1} X^T Y$ See [[example deriving the least squares estimator for simple linear regression]] for an example of deriving this without linear algebra. ## for simple linear regression The minimizers for least squares estimator in the case of simple [[base/Linear Regression/linear regression|linear regression]] are: $\hat \beta_0 = \bar y - \beta_1 \bar x$ $\hat \beta_1 = \frac{\sum(x_i - \bar x)(y_i - \bar y)}{\sum(x_i - \bar x)^2}$