Logistic regression — How to investigate your model performance?

Guide to logistic regression with a step-by-step example from MITx Analytics Edge course using Python

Mustafa Adel Amer
5 min readSep 26, 2021

We discussed linear regression and reading the summary statistics of linear regression models that aim at predicting the price of an item, temperature, saving, performance of an engine, …..

But what if the target variable is not continuous. What shall we do if we want to predict whether something is good or bad, strong or weak, fast or slow, 0 or 1?

Logistic regression is the answer. Logistic regression is an extension to linear regression but the target variable is categorical in the form of 0 and 1.

The outcome of a logistic regression model is probability. The probability of the outcome is 1 through the use of the logistic response function.

What is the logistic response function?

The logistic regression function sees values between 0 and 1 as shown in the graphical representation of the function below.

Logistic regression function — Image by author
Graphical representation of the function — Image by author

The work presented here is performed on the claims dataset which is generated from the physical claims from patients visiting doctors and pharmacies. Then, expert physicians assessed the quality of care based on investigating the claims. The aim of the exercise is to predict if the healthcare provided is good or poor. The dataset was used to illustrate a case example of the use of logistic regression on the MITx Analytics Edge Course.

The dataset and the python notebook file are available on Github.

How to control the outcome of logistic regression models?

In logistic regression, you can have control over the outcome of the model through the use of a threshold value (t). But how to select the threshold value and how to link it to model results?

The answer is the confusion matrix (classification matrix) through which we can measure the model metrics of accuracy, specificity, and sensitivity. Choose the threshold value that will result in errors that are not costly. Means, errors that will not cause risks that we want to avoid.

In our case, we are assessing the quality of healthcare service. We do not have a preference over good care or bad care. So we can choose a threshold value that results in a balanced outcome.

Confusion matrix and evaluation metrics

How does the threshold value affect the confusion matrix and hence the logistic regression outcome?

The ROC curve (Receiver Operator Characteristic curve) can help to decide which value of the threshold is best. The ROC curve shows the variation of the true positive and false-positive rates at different threshold values.

ROC and threshold vs false positive rate to show the impact of threshold on the ROC curve

Let's read the ROC curve above by investigating 4 points on the curve as illustrated below.

At point (a), the threshold value is 1 and the false positive rate is zero. This means the model can label all values as true positive with a predicted value of 1 and will not get any false positive value.

At point (b), the threshold value is 0.6. The model can label 40% of the true positive cases but with 5% false-positive cases.

At point (c ), the threshold value is 0.2. The model can label 85% of the true positive cases but with a 65% false-positive rate.

At point (d), the threshold value is 0. The model will label all the 1 values correctly but will label all 0 values as 1 as well.

In general: A model with a higher threshold will have a lower sensitivity and a higher specificity. A model with a lower threshold will have a higher sensitivity and a lower specificity.

Then which threshold you should pick. The answer is it depends on which trade-off you want to make and the cost of failing. If you are predicting whether a patient is in danger or not you should not miss a true positive that the person is a danger and you should pick a low threshold. But in our case threshold of 0.25 would result in a balanced result.

How to show that the model did well in predicting the outcomes?

The Area Under the Curve (AUC) is the measure of the power of the model to predict outcomes. It is the area under the ROC curve. The best score is 100% and the pure guessing score is 50%. The model in our case has an AUC of 0.77 which means the model can correctly predict 77% of the cases. The AUC is different for different independent variables used to build the model.

ROC curve and the AUC score

Conclusion

  • A confusion matrix assesses the ability of the model to predict labels correctly. It can change by changing the model threshold value.
  • Threshold value selection has a key influence on the performance of a logistic regression model together with parameter selection.
  • The threshold value depends on the cost of missing in the industry where you are applying the model.
  • AUC score assesses the quality of prediction of a logistic regression model. The higher the better and it should be greater than 50%.

--

--