Understanding Multicollinearity: Definitions and Examples

By Indeed Editorial Team

Published May 14, 2022

The Indeed Editorial Team comprises a diverse and talented team of writers, researchers and subject matter experts equipped with Indeed's data and insights to deliver useful tips to help guide your career journey.

Multicollinearity is a mathematical concept that relates to how independent variables correlate in a multiple regression model. Many statisticians and analysts notice this phenomenon when understanding the patterns and variances in the information and datasets they use to model real-life behaviours, such as the relationship between age and education level. Learning what multiple correlations mean and how they can affect a data sample can help you improve your analyses and derive conclusions from your dataset. In this article, we explain what multicollinearity is, explore the different types, discuss their causes, and share some examples.

Related: How to Find Critical Value in Two Steps (With Examples)

What is multicollinearity?

Multicollinearity refers to the occurrence of high intercorrelations between two or more independent variables in a regression model. Intercorrelation means that as one independent variable changes value, the other independent variables also change in a specific direction. For example, a data sample with a positive correlation shows that as one variable increases in value, the other variables also increase in value. Multiple correlations can also occur when you derive one independent variable from other variables in your data sample or when two or more independent variables provide similar or repetitive results.

Depending on the correlation between the independent variables, it may pose a challenge for your analysis. For example, it may lead to skewed or misleading results. High intercorrelation between variables that seem independent of each other makes it harder to understand the behaviour and patterns of the dependent variables during statistical analysis. It can also lead to results that are skewed and less accurate in your analysis because you have less reliable probabilities in how independent variables affect your overall regression model.

Related: How to Calculate the Median of a Data Set in Statistics

Types of multicollinearities

There are four types of multiple regressions that you may find in your data sample when conducting statistical analysis. The four types are:

Perfect multicollinearity

Perfect multicollinearity means two or more independent variables in a regression model have a perfectly predictable linear relationship. On a graph, you may find the data points of a perfectly collinear model create a straight line. For example, including dummy variables for every group in your data can create a perfectly collinear relationship between variables. If your data has a perfect correlation, it may be challenging for you to make structural inferences about the original model using your sample data. This usually means your regression coefficients, which describe the relationship between two or more variables, are indeterminate with infinite standard errors.

It's rare for a data sample to show perfect correlation. You can typically prevent this by removing one or more highly correlated independent variables. Most statistical programs remove one or more variables in your model when they detect perfect multiple correlations before providing the estimation results. There are a few ways you can avoid perfect multiple correlations. One way is to select which variables you include as independent variables in your analysis. You can also add independent variables together or perform an analysis designed for highly correlated variables.

High multicollinearity

High multiple correlations mean there's a linear relationship that's highly correlated but not perfectly correlated between two or more variables. This phenomenon is more common than perfect correlations. The difference between high and perfect correlation is that some variations in the independent variable remain unexplained by variations in other independent variables. A high correlation between variables can occur if you use lagged variables, express a common time trend, or capture similar or the same behaviour.

Like perfect collinear models, you typically want to avoid high correlations. It can cause large standard errors, coefficients sensitive to changes in specification, and coefficient estimates with extreme and unreliable signs and magnitudes. It may lead to an unsureness of which independent variable handles the variations and patterns in your regression model. You can resolve high correlation issues by removing some of the highly correlated independent variables or using a partial least squares regression, reducing the number of independent variables into a smaller set of uncorrelated components.

Data-based multicollinearity

Issues with experimental design often lead to data-based multiple correlations between variables. Other causes of this phenomenon include reliance on purely observational data or challenges in manipulating the environment where you collect the data. Inaccurate experiments may create skewed or misleading data. This can make it challenging to calculate accurate estimators or understand how dependent variables behave when you change independent factors. You can reduce the collinearity by removing one or more of the violating predictors from your model. Another way to reduce data-based collinearity is by conducting additional experiments under other observational conditions to collect more data.

Structural multicollinearity

A structural multiple correlation phenomenon occurs when the analyst or researcher adds different independent variables into the equation of the regression model. For example, you may use mathematical operations to create new variables from the existing independent variables in your dataset. The new variables correlate with the existing variables in your model. If you have structural multiple correlations in your equation, you can centre the predictors to reduce them in your regression. Centring the predictors means subtracting the mean of the predictor values in your data sample from each independent variable.

Related: What is Quantitative Analysis?

Causes for multicollinearity

There are many reasons your regression model and equation may show multiple correlations. Here are some common causes analysts or statisticians can encounter:

Insufficient data

It's sometimes challenging for analysts or statisticians to work with small data samples, especially with a large, represented population. You can resolve this issue by designing additional experiments to collect more data points for your model. It's important to design your experiments well so you can collect reliable information that contains low correlation or an easily manageable level.

Dummy variables

Dummy variables are numerical variables that allow you to use one regression equation for multiple groups or categories in your data sample. They represent different treatment groups or qualitative characteristics in your data sample. Examples of dummy variables include gender, race, favourite colour, or level of education. If you exclude some subgroups or add a dummy variable for every category in a subgroup, you may encounter a high or perfect correlation in your regression model.

Related: Types of Variables in Statistics and Research (With FAQs)

Variables that are a combination of other variables

You encounter multiple correlations when you form one independent variable through one or more other variables used in the regression. For example, your total annual income derives from adding your employment income to income from investment accounts. As the variables corresponding to your employment and investment income change, your total annual income most likely changes in the same direction.

Similar variables

You may include variables that are very similar or identical. For example, your equation may use distance in kilometres, metres, and centimetres. Identical variables track values in your data sample similarly, so you may encounter high degrees of correlation. It's important to eliminate variable repetitions to ensure you can conduct analysis correctly.

Repetitive variables

You may accidentally include the same variable multiple times in your regression model or equation. Repeating variables reduces the reliability of your model because identical variables behave the same way when you change the values in your equation. You can remove the multiple correlation phenomenon by eliminating the repetitive variables, so relevant variables only appear once.

Examples of multicollinearity

Here are two scenarios where multiple corrleations may exist in your regression model:

Investing

A multiple correlation case is a common concern for investors performing technical analysis to predict the future movements of a financial product, such as a stock or a bond. Using similar variables or indicators generates highly correlated results, making it hard to predict the behaviour of dependent variables to price movements. For example, including two or more of the same type of technical indicators may cause similar predictions in your model. To resolve issues of multiple correlations, you may analyze the model using multiple indicators, such as momentum indicators and volatility indicators. You may also remove similar indicators or add them together.

Biology

Biologists often collect data samples and conduct regression analysis where high correlations can skew the results. For example, illnesses may correlate with age, gender, weight, stress, exercise, diet, and blood pressure. When you consider the variables you included in your equation, you may find a case of multiple correlations because many of the variables relate to each other. Your equation may contain a variable that explains weight, but a person's weight correlates with their diet and exercise habits. It's important to remove closely related variables, add them together, or include additional variables that provide variance.

Explore more articles