Linear regression — How many and which features to include?

A guide on selecting independent variables for regression models.

M Adel
4 min readSep 17, 2021

In order to formulate a linear regression model, one must understand linear regression fundamentals and interpret the model outcome. In contrast, for multivariable regression models, the selection of the number of features requires more care so as to produce a model with good accuracy that does not have too many parameters that are unnecessary.

We will discuss multiple linear regression models using a case study from MITx Analytics Edge course that helps predict wine prices based on a set of quality-related variables. Click Here to Access the Complete Project on Github. MITx Analytics Edge is a course offered via the edX platform by MITx online education. From the very basics of data analytics to model development and evaluation, the course covers everything you need to know.

The work is structured in 5 main steps:

  1. Reading the dataset

Understanding the existing parameters, assess the completeness of data and data types.

2. Exploratory Data Analysis (EDA)

Involved visualizing the dataset using:

Linear plots

Correlation matrix

There is a positive correlation between WinterRain, AGST, and Age and a negative correlation between Year, HarvestRain, and FrancePop.

Note that FrancePop and the age of the wine have a high correlation and so do the year and the age.

3. Apply linear regression modelling

Applying multiple linear regression models using 1, 2, 3 and 4 independent variables to predict wine price.

# import the statsmodels
import statsmodels.api as sm
# Define the intercept
wine[‘intercept’] = 1
# apply linear regression model
lm = sm.OLS(wine[‘Price’],wine[[‘AGST’, ‘intercept’]])
result1 = lm.fit()

result1.summary()

The models were structured as follows:

  • Model1: has one independent variable. (AGST; the most correlated to the target variable “Price”).
  • Model2: has two independent variables (AGST, HarvestRain; the ones that have the highest absolute value of their correlation to Price).
  • Model3: has three independent variables (AGST, HarvestRain, Age).
  • Model4: has 4 independent variables (AGST, HarvestRain, Age, WinterRain).

As mentioned, the model independent variables were selected based on the ranking of the absolute value of their correlation to the independent variable. But why we did not take into account the FrancePop and Year variables?

We did not consider them in the model because the Age variable has a high correlation to both of them. Adding highly correlated predictors to the model cause high standard error to the model coefficients. this is known as Multicollinearity.

4. Model evaluation

The models were assessed based on their R-squared performance and then the predicted outcomes on the training dataset were visualized on density and linear plots as below.

Density plot of model outcome:

Line plot of model outcome:

The model represented by the red line (Model4) performed better than the other models based on its closeness to the actual data (price) line.

5. Analysis and Conclusion

Look at the summary statistics of two models with 2 independent variables. Model A was built by independent variables that have correlations of 0.65 and -0.56 to price. Model B was built by independent variables that have correlations of 0.65 and 0.44 to price. The R-squared and the adjusted R-squared are telling the difference between the models.

Model4 that has 4 independent variables with an R-squared score of 0.829 performed better than the other models.

Adding independent variables to the model that have the highest absolute value of correlation to the target variable increases the model performance.

Watch out for the correlation between the independent variables you choose to avoid Multicollinearity.

--

--

M Adel
M Adel

Written by M Adel

Engineering - Data - Thinking Tools - Continuous Education . https://www.linkedin.com/in/mustafa-adel-amer

No responses yet