We’re now ready to delve deeper into Logistic Regression! In this lesson, we will cover the different assumptions that go into logistic regression, model hyperparameters, how to evaluate a classifier, ROC curves and what do when there’s a class imbalance in the classification problem we’re working with.
For this lesson, we will be using the Wisconsin Breast Cancer Data Set (Diagnostic) to predict whether a tumor is benign or malignant based on characteristics of the cells, such as radius, texture, smoothness, etc. Like a lot of real-world data sets, the distribution of outcomes is uneven (benign diagnoses are more common than malignant) and there is a bias in terms of importance in the outcomes (classifying all malignant cases correctly is of the utmost importance).
We’re going to begin with the primary assumptions about the data that need to be checked before implementing a logistic regression model.
1. Target variable is binary
One of the most basic assumptions of logistic regression is that the outcome variable needs to be binary (or in the case of multinomial LR, discrete).
2. Independent observations
While often overlooked, checking for independent observations in a data set is important for the theory behind LR. This can be violated if, in this case, patients are biopsied multiple times (repeated sampling of the same individual).
3. Large enough sample size
Since logistic regression is fit using maximum likelihood estimation instead of least squares minimization, there needs to be a large enough sample to get convergence. Now, what does “large enough” mean – that is often up to interpretation or the individual. But often a rule of thumb is that there should be at least 10 samples per features per class.
4. No influential outliers
Logistic regression is sensitive to outliers, so it is also needed to check if there are any influential outliers for model building. Outliers are a broad topic with a lot of different definitions – z-scores, scaler of the interquartile range, Cook’s distance/influence/leverage, etc – so there are many ways to identify them. But here, we will use visual tools to rule-out obvious outliers.
Verify the values of “diagnosis” are binary classes by printing distinct diagnosis values and their frequency in dataset.
Check the dataset to see if the observations are unique by ID – i.e. print whether or not the number of unique patient IDs is equal to the number of samples.
If classes are very imbalanced, it is important the smallest class still meets this rule of thumb. Based on this how many features should we have at maximum? Define a variable
max_features and print the value.
We’re going to make a pairplot of the mean features to exclude any obvious outliers. Alternatively, you can also use a boxplot of z-scores, but as the features are not normally distributed (since all the values must be greater than zero and are right-skewed), it is more informative to look at the boxplot of the log transformed z-scores. Uncomment the relevant lines here and press Run to see the outputs.