Class imbalance is when your binary classes are not evenly split. Technically, anything different from a 50/50 distribution would be imbalanced and need appropriate care. However, the more imbalanced your data is, the more important it is to handle. In the case of rare events, sometimes the positive class can be less than 1% of the total.
The first issue that can arise is in randomly splitting your data – the smaller your dataset and more imbalanced the classes, the more likely there can be significant differences in classes for the training and test set. One way to mitigate this is to randomly split using stratification on the class labels. This ensures a nearly equal class distribution in your train and test sets.
In addition to using stratified sampling, class imbalance can be handled by undersampling the majority class or oversampling the minority class. For oversampling, repeated samples (with replacement) are taken from the minority class until the size is equal to that of the majority class. The downside is that the same data is used multiple times, giving a higher weight to these samples. On the other side, an undersampling used less of the majority class data to have equal size as the minority class. The downside here is that less data is used to build a model.
When training a model, the default setting is that every sample is weighted equally. However, in the case of class imbalance, this can result in poor predictive power in the smaller of the two classes. A way to counteract this in logistic regression is to use the parameter class_weight='balanced'
. This applies a weight inversely proportional to the class frequently, therefore supplying higher weight to misclassified instances in the smaller class. While overall accuracy may not increase, this can increase the accuracy in the smaller class. Again in the breast cancer dataset, the most important classification is in the underrepresented malignant class.
Instructions
Stratified Sampling
Calculate the percentage of malignancy in the test and train sets when using a random split. Repeat when using the parameter ‘stratify’ in train_test_split on the appropriate variable. Uncomment the relevant lines of code and press run to see how the positivity rate changes and how the new model performs.
Model weighting
Similar to above, set class_weight='balanced'
in the model argument and compare accuracy and recall scores for LR model again.
Use over/under-sampling
Use an oversampling technique on the minority class to balance the percentage of malignancy and benign. Verify the classes are balanced.