multicollinearity

Multicollinearity occurs when two or more variables capture the same information in some way. Where multicollinearity is present, standard errors may be high. For collinearity in two variables, [[correlation coefficient]]s can be used to detect collinearity. Correlations above 0.7 are considered problematic. However, it is possible for three or more variables to be highly correlated even when no single pair of variables shows high correlation. Correlation between variables may occur due to redundant information, underlying (confounding) factors, or just natural correlation between two things. Perfect multicollinearity might occur when two variables are the same measurement but in different units. For example, a column of weight in pounds and a column in kilograms will be perfectly correlated, introducing multicollinearity in the model. To diagnose multicollinearity 1. Examine the correlation matrix for the predictors 2. Examine the signs and standard errors of $\hat \beta$ 3. Compute and examine [[variance inflation factor]] (especially when multicollinearity may exist between three or more variables) 4. Compute the [[condition number]] of the design matrix To solve multicollinearity, drop one of the offending variables (which shouldn't affect the model accuracy since its information is contained in other variables) or combine correlated variables into a single variable, for example by averaging standardized versions of each. In [[R]] `corrplot` can be used to visualize relationships between data to check for multicollinearity in two variables (note this may not detect multicollinearity in three or more variables). ```R library(corrplot) corrplot(cor(df)) ``` Another approach is `pairs`, which plots the pairwise scatterplots of all variables in a data frame. ```R pairs(df) ```