In this tutorial, we will learn about binary logistic regression and its application to real life data using Python. We have also covered
binary logistic regression in R in another tutorial. Without a doubt, b
inary logistic regression remains the most widely used
predictive modeling method. Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. The method is used to model a binary variable that takes two possible values, typically coded as 0 and 1.
You can download the data files for this tutorial
here.
We’ll first recap a few aspects of binary logistic regression and then focus on statistical modeling, hypothesis testing and classification tables using Python. We’ll use a case study in the banking domain to demonstrate the method.
Binary Logistic Regression in Python
Binary logistic regression models the relationship between a set of independent variables and a binary dependent variable. It is useful when the dependent variable is dichotomous in nature, such as death or survival, absence or presence, pass or fail, for example. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P (Y=1) as a function of X. Independent variables can be categorical or continuous, for example, gender, age, income, geographical region and so on. Binary logistic regression models a dependent variable as a logit of p, where p is the probability that dependent variables take a value of ‘one'.
Statistical Model – For k PredictorsSo what does the statistical model in binary logistic regression look like? In this equation, p is the probability that Y equals one given X, where Y is the dependent variable and X’s are independent variables. B 0 to B K are the parameters of the model. These parameters of the model are estimated using the maximum likelihood method. The left-hand side of the equation ranges between minus infinity to plus infinity.
where,
p : Probability that Y=1 given X
Y : Dependent Variable
X1, X2 ,…, Xk : Independent Variables
b0, b1 ,…, bk : Parameters of Model
Case Study – Modeling Loan Defaults
Let’s explain the concept of binary logistic regression using a case study from the banking sector. Our bank has the demographic and transactional data of its loan customers. It wants to develop a model that predicts defaulters and help the bank in its loan disbursal decision making. The objective here is to predict whether customers applying for a loan will be defaulters or not. We will use a sample of size 700 to develop the model. The independent variables are age group, years at current address, years at current employer, debt to income ratio, credit card debt and other debt. All of these variables are collected at the time of the loan application process and will be used as independent variables. The dependent variable is the status observed after the loan is disbursed, which will be one if it is a defaulter and zero if not.
BLR Data SnapshotHere’s a snapshot of the data. Our dependent variable is binary, whereas the independent variables are either categorical or continuous in nature.
Binary Logistic Regression in PythonLet’s import our data and check the data structure in Python. As usual, we import the data using
read_csv function in the
pandas library, and use the
info function to check the data structure. We can see here that the Age variable is an integer type.
# Import data and check data structure before running model
import pandas as pd bankloan=pd.read_csv('BANK LOAN.csv')
bankloan.info()
# Output:
Age should be a categorical variable, and therefore needs to be converted into a category type. If it isn’t converted into a category type, then Python will interpret it as a numeric variable, which is not correct, as we are considering
age groups in our model
# Change ‘AGE’ variable into categorical
bankloan['AGE']=bankloan['AGE'].astype('category') bankloan.info()
Age is an integer and need to convert into type “category” for modeling purpose.# Output:
Logistic regression uses the
logit link function. As with the linear regression model, dependent and independent variables are separated using the tilde sign, and independent variables are separated by the plus sign.
So let’s see which independent variables impact customers turning into defaulters? After fitting the logistic regression model, we carry out individual hypothesis testing to identify significant variables. We then use the summary function on the model object to get detailed output. Variables whose P value is less than 0.05 are considered to be statistically significant. Since the p-value is < 0.05 for Employ, Address, Debtinc, and Creddebt, these independent variables are significant.
Logistic Regression using logit function
import statsmodels.formula.api as smf riskmodel = smf.logit(formula = 'DEFAULTER ~ AGE + EMPLOY + ADDRESS + DEBTINC + CREDDEBT + OTHDEBT', data = bankloan).fit()
logit() fits a logistic regression model to the data.BLR Model summary
riskmodel.summary()
summary() generates detailed summary of the model.
Re-run the BLR Model in PythonOnce the variables to be retained are finalized, we re-run the binary logistic regression model by including only the significant variables. Again the output of the summary function provides the revised coefficients for the model.
riskmodel = smf.logit(formula = 'DEFAULTER ~ EMPLOY + ADDRESS + DEBTINC + CREDDEBT', data = bankloan).fit() riskmodel.summary()
In this output, all independent variables are statistically significant and the signs are logical, so this model can be used for further diagnosis.
# Output:
Odds Ratios In Python After substituting values of parameter estimates this is how the final model will appear.
The probability of defaulting can be predicted if the values of the X variables are entered into the equation.
We use the odds ratio to measure the association between the independent variable and dependent variable. Once the parameter is estimated with confidence intervals, by simply taking the antilog we can get the Odds Ratios with confidence intervals. In Python the
‘conf_int’ function calculates the confidence interval for parameters, and then parameter estimates are added to the object. The antilog values are printed to give a table of odds ratios.
import numpy as np conf = riskmodel.conf_int()
conf['OR'] = riskmodel.params
conf.columns = ['2.5%', '97.5%', 'OR']
print(np.exp(conf))
conf_int(): calculates confidence intervals for parametersriskmodel.params: identify the model parameter estimates
Odds Ratios in Python
From the output here, we can see that none of the confidence intervals for the odds ratio includes one, which indicates that all the variables included in the model are significant. The odds ratio for CREDDEBT is approximately 1.77
So for one unit change CREDDEBT, the odds of being a defaulter will change 1.77 fold.
# Output:
Predicting Probabilities in PythonWe determine the probability of the final model using the predict function. Predicted probabilities are saved in the same bankloan dataset in the new variable ‘pred’.
The last column in the data gives predicted probabilities using the final model.
Classification Table
It’s important to measure the goodness of fit of any fitted model. Based on some cut off value of probability, the dependent variable Y is estimated to be either one or zero. A cross tabulation of observed values of Y and predicted values of Y is known as a classification table.
The accuracy percentage measures how accurate a model is in predicting the outcomes.
In the table, the dependent variable equals zero was observed and predicted 478 times, whereas it was observed and predicted to be one 92 times.
Therefore, the accuracy rate is calculated as 478 plus 92 divided by the total sample size of 700. The accuracy therefore is 81.43 %. The misclassification rate is the percentage of wrongly predicted observations. In this example, the misclassification rate is obtained as 38 + 91 divided by 700 giving misclassification rate as 18.57
%
Classification Table TerminologyDifferent terminologies are used for observations in a classification table. These are sensitivity, specificity, false positive rate and false negative rate. The sensitivity of a model is the percentage of correctly predicted occurrences or events. It is the probability that the predicted value of Y is one, given the observed value of Y being one. On the contrary, specificity is the percentage of non-occurrences being correctly predicted – that is the probability that the predicted value of Y is zero, given that the observed value of Y is also zero. The false positive rate is the percentage of non-occurrences that are predicted wrongly as events. Similarly, the false negative rate is the percentage of occurrences which are predicted incorrectly.
Sensitivity and Specificity calculationsThis table represents the accuracy, sensitivity and specificity values for different cut off values. On the basis of the accuracy, sensitivity and specificity values, we can deduce that the cut off value of 0.3 is the best cut off value for the model.
Classification table in PythonLet’s now obtain the classification table in Python. The predict function gives predicted probabilities. We set the threshold value to 0.5 and the predicted class is assigned a value of 1 if the predicted probability is greater than the threshold of 0.5. Finally, we use the confusion_matrix function to obtain a classification table using the observed defaulter status and the predicted class.
Predicting Probabilities
from sklearn.metrics import confusion_matrix predicted_values1 = riskmodel.predict()
threshold=0.5
predicted_class1=np.zeros(predicted_values1.shape)
predicted_class1[predicted_values1>threshold]=1
cm1 = confusion_matrix(bankloan['DEFAULTER'],predicted_class1)
print('Confusion Matrix : \n', cm1)
confusion_matrix function creates a cross table of observed Y (defaulter)vs. predicted Y# Output:
Sensitivity and Specificity in PythonNow let’s calculate sensitivity and specificity values in Python. We calculate these using the formula discussed earlier. On calculation, the sensitivity of the model is 50.27%, whereas the specificity is at 92.46%. The sensitivity value is definitely lower than the desired value so, we can try a different threshold and obtain optimum threshold as explained earlier.
Sensitivity and Specificity
sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1]) print('Sensitivity : ', sensitivity)
specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Specificity : ', specificity )
# Output:
Sensitivity : 0.5027322404371585
Specificity : 0.9245647969052224
Interpretation :The Sensitivity is at 50.27% and the Specificity is at 92.46%. Note that the threshold is set at 0.5
Precision & Recall values of the model
The precision and recall values of the model are routinely assessed in a classification model. Precision tells us what percentage of predicted positive cases are correctly predicted.
Recall tells us what percentage of actual positive cases are correctly predicted.
Classification ReportThe classifcation_report function in python is also very useful. We import it from the sklearn metrics library. It accepts observed Y and predicted class of Y as two arguments. The output shows the recall, precision and accuracy of the model.
#Classification Report
from sklearn.metrics import classification_report print(classification_report(bankloan['DEFAULTER'],predicted_class1))
classification_report() gives recall, precision and accuracy along with other measures. # Output:
Quick RecapLet’s quickly recap. In this tutorial, we learned about binary logistic regression modelling and its application. We then used python code to estimate model parameters and obtain a classification report.
Continue to the follow on tutorial on Binary Logistic Regression in Python Part II