Skip to content

Multicollinearity in Regression

Regression analysis is a powerful tool for uncovering relationships between variables. But what happens when those variables become too friendly with each other? This is where multicollinearity throws a wrench in the works.
Let’s deep dive and try to understand what is Multicollinearity in Regression.

What is Multicollinearity?

Imagine you’re a detective investigating a robbery. Two witnesses come forward, both claiming they saw a red car speeding away. Sounds helpful, right? Not necessarily. If their descriptions of the car and the culprit are virtually identical, it becomes difficult to determine the unique contribution of each witness. This is analogous to multicollinearity in regression.

Multicollinearity occurs when two or more independent variables (predictors) in your model are highly correlated with each other. This means they share a significant amount of explanatory power. In essence, they’re telling a very similar story, making it challenging to isolate the true effect of each variable on the dependent variable (what you’re trying to predict).

Why is Multicollinearity a Potential Problem?

While correlated variables might seem like a good thing (more information!), multicollinearity can wreak havoc on your regression analysis:

  • Unreliable Coefficients: When variables are highly correlated, it becomes tricky to pinpoint the individual impact of each on the dependent variable. The estimated coefficients for these variables become unreliable and can even flip signs (go from positive to negative) with slight changes in the data. Imagine two witnesses whose testimonies contradict each other – how can you trust either account?
  • Increased Variance: Multicollinearity can inflate the variance of the estimated coefficients, making it difficult to assess their statistical significance (how likely it is that the observed effect is due to chance). This makes it hard to determine if a relationship between a variable and the dependent variable is truly meaningful.
  • Confounding Effects: It becomes challenging to understand the true relationship between the independent variables and the dependent variable because their effects are intertwined. Think back to the witness example – you can’t tell if one witness simply copied the other’s story, or if they genuinely saw the same thing.

Do I Have to Fix Multicollinearity?

Not always. If your model still provides good predictions despite some correlation, it might be acceptable. However, be aware of its potential impact, especially if:

  • Your coefficients are unexpectedly large or have the wrong sign.
  • The variance of your coefficients is very high.
  • Your model’s explanatory power (R-squared) is high, but the individual variable significance tests are weak.

Testing for Multicollinearity with Variance Inflation Factors (VIF)

A common method to detect multicollinearity is the Variance Inflation Factor (VIF). VIF measures how much the variance of an estimated coefficient is inflated due to multicollinearity. Here’s a general guideline:

  • VIF < 5: No significant multicollinearity.
  • 5 < VIF < 10: Moderate multicollinearity, investigate further.
  • VIF > 10: High multicollinearity, consider corrective actions.

Multicollinearity Example

Imagine you’re building a model to predict student grades. You include factors like study hours, number of absences, and difficulty of the course. Study hours and absences might be highly correlated (students who miss class tend to study less). This multicollinearity can make it difficult to isolate the true effect of each factor on grades.

here are a couple more examples of multicollinearity with explanations:

1. Marketing Example:

Imagine you’re analyzing factors affecting sales of a new smartphone model. You include variables like:

  • Marketing budget: Amount spent on advertising the phone.
  • Social media engagement: Number of likes, shares, and comments on social media posts about the phone.
  • The average income in the target market: The purchasing power of the target audience.

Here, marketing budget and social media engagement might be highly correlated. Companies with larger marketing budgets are likely to have more active social media campaigns. This multicollinearity can make it difficult to isolate the unique effect of each factor on phone sales. It might be challenging to determine if strong sales are due to a high marketing budget or the resulting social media buzz.

2. Finance Example:

Let’s say you’re building a model to predict stock prices. You consider factors like:

  • Price-to-earnings ratio (P/E ratio): A valuation metric comparing a company’s stock price to its earnings per share.
  • Earnings per share (EPS): A company’s profit divided by the number of outstanding shares.

In this case, the P/E ratio and EPS are inherently connected. A higher EPS (more profit per share) typically leads to a higher P/E ratio (investors willing to pay more for the stock). This multicollinearity can make it difficult to assess the independent effect of each factor on stock price. You might not be able to tell if a high stock price is due to strong company performance (high EPS) or simply a reflection of investor sentiment already captured in the P/E ratio.

These examples highlight how multicollinearity can arise from inherently linked variables, making it crucial to carefully consider the relationships between factors before building a regression model.

How to Deal with Multicollinearity

If you suspect multicollinearity, there are ways to address it:

  • Remove a Variable: If one variable can be logically excluded (e.g., absences might be redundant with study hours), consider removing it. However, do this with justification and avoid removing important information.
  • Combine Variables: Can you create a new variable by combining the highly correlated ones? For example, you could create a “study engagement” variable that incorporates both study hours and absences.
  • Dimensionality Reduction Techniques: Statistical methods like principal component analysis (PCA) can help reduce the number of variables while preserving the most relevant information.

Conclusion

Multicollinearity can be a stealthy foe in regression analysis. By understanding its signs, employing detection methods like VIF, and taking appropriate actions, you can ensure your models provide clear and reliable insights. Remember, the goal is to build a model that accurately reflects the relationships between variables, not one where variables are simply echoing each other.

Footnotes:

Additional Reading

OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more Machine Learning and Data Engineering topics soon. Please also comment and subs if you like my work any suggestions are welcome and appreciated.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments