The `lm` function performs [[base/Linear Regression/linear regression|linear regression]] in [[R]]. Pass a [[tibble]] (or similar data format) as `data` and specify the model as $Y \sim x_1 + x_2 + \dots + x_p$.
```R
lm_marketing = lm(sales ~ facebook + youtube, data = marketing)
```
## summary
Use `summary` to print a summary of the model and key statistics.
```R
summary(lm_marketing)
```
The output will be like the one shown below.
```R
Call:
lm(formula = cost ~ pup.tch.ratio + avg.salary, data = school.data)
Residuals:
Min 1Q Median 3Q Max
-13.8290 -5.2752 -0.8332 3.8253 19.6986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 120.23756 17.73230 6.781 3.79e-08 ***
pup.tch.ratio -2.82585 0.37714 -7.493 3.90e-09 ***
avg.salary 0.24061 0.08396 2.866 0.0066 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.721 on 40 degrees of freedom
Multiple R-squared: 0.6372, Adjusted R-squared: 0.6191
F-statistic: 35.13 on 2 and 40 DF, p-value: 1.559e-09
```
Access elements of the summary like
```R
summary(lm_marketing)$r.squared
```
Use `?summary.lm` to discover all elements accessible from the summary object.
## anova
Use `anova` to calculate [[residual sum of squares|RSS]], [[explained sum of squares|ESS]], and [[total sum of squares|TSS]].
```R
anova(lm_marketing)
```
## prediction
Use `predict` to use the linear model to predict.
```R
predict(lm_marketing)
```
## manual calculations
Common summary metrics
```R
b = coef(lm_marketing)
ess = sum((fitted(lm_marketing) - mean(marketing$sales))**2)
rss = sum(residuals(lm_marketing)**2)
tss = with(marketing, sum((sales - mean(sales))**2))
r2 = 1 - rss/tss
dof = nrows(lm_marketing) - length(b)
var = rss/(n - p + 1)
sigma = sqrt(var)
predicted_values <- predict(lm_marketing)
residuals <- lm_marketing$sales - predicted values
mse <- mean(residuals**2)
```
Prediction intervals
```R
index = sample(seq_len(nrow(test)), size=5)
star = test[index,]
alpha <- 0.05
b = coef(lm_marketing)
dof = length(train$sales) - length(b)
X = model.matrix(lm_marketing) # design matrix
xstar_matrix = data.matrix(star[1:3]) # predictors from subset of test set
y_star_hat = cbind(1, xstar_matrix) %*% b # predicted values
# for first row
l = y_star_hat[1] - qt(1-alpha/2, dof) * sqrt(var * (cbind(1, t(xstar_matrix[1, ])) %*% solve(t(X) %*% X) %*% t(cbind(1,t(xstar_matrix[1,])))+1))
u = y_star_hat[1] + qt(1-alpha/2, dof) * sqrt(var * (cbind(1, t(xstar_matrix[1, ])) %*% solve(t(X) %*% X) %*% t(cbind(1,t(xstar_matrix[1,])))+1))
pred_interval = round(c(l,u), 2)
```
## Plotting
```R
predictions <- predict(lm_octo, newdata=octo.data.reduced, interval="prediction")
plot_df <- cbind(octo.data.reduced, predictions)
octo.plot <- ggplot(data=plot_df) +
aes(x=age, y=weight) +
geom_point() +
geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
geom_line(aes(y=upr), color = "red", linetype = "dashed")+
geom_smooth(method=lm, se=TRUE, formula=y~x) +
labs(title="Octopi Growth", x="Age (days)", y="Weight (grams)")
octo.plot
```
> [!Tip]- Additional Resources
> - [Confidence and Prediction Intervals for Linear Regression](https://rpubs.com/Bio-Geek/71339) (RPubs)