Skip to content

Assumptions of Linear Regression

Linear regression

Linear Regression is a technique to find the relationship between an independent variable and a dependent variable, Regression is a Parametric machine learning algorithm which means an algorithm can be described and summarize as a learning function.

Example: Y = f(x)

Linear Regression also explains how a change in the dependent variable varies with a unit change of the independent variable. In Simple Linear regression, we can change one variable at a time where else in Multiple Linear regression we can change multiple variables at the same time.

f(x) = Y = B0 + B1*X1 + B2*X2 + B3*X3 (Linear Regression)

Assumptions that are made for linear regression are as follows:

  1. Linearity
  2. Outliers
  3. Autocorrelation
  4. Multicollinearity
  5. Heteroskedasticity

Linearity

The relationship between independent and dependent variables must be linear in nature to perform Linear regression if we build a linear model with nonlinear datasets. The model will fail to capture the linear trend mathematically also this will predict wrongly on unseen data.

We can create the scatter plot of data to check the linearity and fit the best fit line which fits the given scatter plot in the best way, then we can calculate the residual.

To find the best fit line or regression line we must minimize the cost function.

cost function

Residual = |actual value — predicted value| OR

Scatter plot of sales and marketing budget.
scatter plot

With the above scatter plot we can see the linearity of the data where we can fit the best fit line.

As we can see we start with a scatter plot and we found the best fit line. We also calculated the Residual on each point. The residual sum of squares(RSS) will tell us how to fit our data point on the given regression line.

residual sum of squares(RSS)

Outliers

A point that falls outside the data set is called outliers. Which means in statistics an outlier is a data point that significantly differs from the other data points in a sample.

Scatter plot with outliers.
outliers indication

As we can see clearly in the above picture there is two data point significantly different from the other data point.

This will also affect the model if there are good no of outliers best way to split the data set in two-sample and create 2 different model one with outliers and other without outliers.

Outliers also affect the coefficient with changing the sign, Outliers increase the Residual which increases the error.

IQR(Interquartile range) and Box Plot is the technique to identify the outliers.

IQR = Q3-Q1 (where Q3 third quartile, Q1 first quartile)

Data set should be in between this range bellow formula

Outliers range => (Q1–1.5*IQR) to (Q3+ 1.5*IQR)

Multicollinearity

If there is a correlation between predictor variable we can say there is multicollinearity in between. Which mean independent variable are correlated to each other. In the sample, if we found the multicollinearity then it will lead the confusion of the true relationship between dependent and independent variables.

To check for multicollinearity, we can look for the Variance Inflation Factor (VIF) values. A VIF value of 5 or less indicates no multicollinearity.

In high correlation, VIF is >5 and we can drop that variable. Because it would be difficult to estimate the true relation between the dependent and independent variables.

Additional learning

Footnotes:

OK, That’s it, we are done now. If you have any questions or suggestions, please feel free to reach out to me. We will come up with more Machine Learning Algorithms soon.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments