linear regression

The `lm` function performs [[base/Linear Regression/linear regression|linear regression]] in [[R]]. Pass a [[tibble]] (or similar data format) as `data` and specify the model as $Y \sim x_1 + x_2 + \dots + x_p$. ```R lm_marketing = lm(sales ~ facebook + youtube, data = marketing) ``` ## summary Use `summary` to print a summary of the model and key statistics. ```R summary(lm_marketing) ``` The output will be like the one shown below. ```R Call: lm(formula = cost ~ pup.tch.ratio + avg.salary, data = school.data) Residuals: Min 1Q Median 3Q Max -13.8290 -5.2752 -0.8332 3.8253 19.6986 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 120.23756 17.73230 6.781 3.79e-08 *** pup.tch.ratio -2.82585 0.37714 -7.493 3.90e-09 *** avg.salary 0.24061 0.08396 2.866 0.0066 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.721 on 40 degrees of freedom Multiple R-squared: 0.6372, Adjusted R-squared: 0.6191 F-statistic: 35.13 on 2 and 40 DF, p-value: 1.559e-09 ``` Access elements of the summary like ```R summary(lm_marketing)$r.squared ``` Use `?summary.lm` to discover all elements accessible from the summary object. ## anova Use `anova` to calculate [[residual sum of squares|RSS]], [[explained sum of squares|ESS]], and [[total sum of squares|TSS]]. ```R anova(lm_marketing) ``` ## prediction Use `predict` to use the linear model to predict. ```R predict(lm_marketing) ``` ## manual calculations Common summary metrics ```R b = coef(lm_marketing) ess = sum((fitted(lm_marketing) - mean(marketing$sales))**2) rss = sum(residuals(lm_marketing)**2) tss = with(marketing, sum((sales - mean(sales))**2)) r2 = 1 - rss/tss dof = nrows(lm_marketing) - length(b) var = rss/(n - p + 1) sigma = sqrt(var) predicted_values <- predict(lm_marketing) residuals <- lm_marketing$sales - predicted values mse <- mean(residuals**2) ``` Prediction intervals ```R index = sample(seq_len(nrow(test)), size=5) star = test[index,] alpha <- 0.05 b = coef(lm_marketing) dof = length(train$sales) - length(b) X = model.matrix(lm_marketing) # design matrix xstar_matrix = data.matrix(star[1:3]) # predictors from subset of test set y_star_hat = cbind(1, xstar_matrix) %*% b # predicted values # for first row l = y_star_hat[1] - qt(1-alpha/2, dof) * sqrt(var * (cbind(1, t(xstar_matrix[1, ])) %*% solve(t(X) %*% X) %*% t(cbind(1,t(xstar_matrix[1,])))+1)) u = y_star_hat[1] + qt(1-alpha/2, dof) * sqrt(var * (cbind(1, t(xstar_matrix[1, ])) %*% solve(t(X) %*% X) %*% t(cbind(1,t(xstar_matrix[1,])))+1)) pred_interval = round(c(l,u), 2) ``` ## Plotting ```R predictions <- predict(lm_octo, newdata=octo.data.reduced, interval="prediction") plot_df <- cbind(octo.data.reduced, predictions) octo.plot <- ggplot(data=plot_df) + aes(x=age, y=weight) + geom_point() + geom_line(aes(y=lwr), color = "red", linetype = "dashed")+ geom_line(aes(y=upr), color = "red", linetype = "dashed")+ geom_smooth(method=lm, se=TRUE, formula=y~x) + labs(title="Octopi Growth", x="Age (days)", y="Weight (grams)") octo.plot ``` > [!Tip]- Additional Resources > - [Confidence and Prediction Intervals for Linear Regression](https://rpubs.com/Bio-Geek/71339) (RPubs)