Model Training and Hyperparameters
Now that we have checked the assumptions of Logistic Regresion, we can eliminate the appropriate features and train and predict a model using
scikit-learn. To get the same results, some of the hyperparameters in the models will have to be specified from their default values. Hyperparameters are model settings that can be preset before the model implementation step and tuned later to improve model performance. They differ from parameters of a model (in the case of Logistic Regression, the feature coefficients) in that they are not the result of model implementation.
Despite the name, logistic regression is a classifier, so any evaluation metrics for classification tasks will apply. The simplest metric is accuracy – how many correct predictions did we make out of the total? However, when classes are imbalanced, this can be a misleading metric for model performance. Similarly, if we care more about accurately predicting a certain class, other metrics may be more appropriate to use, such as precision, recall, or F1-score may be better to evaluate performance. All of these metrics are available in
scikit-learn. Check out the Evaluation Metrics lesson if you’d like to brush up on the same.
Which metrics matter most?
For our breast cancer dataset, predicting ALL malignant cases as malignant is of the utmost importance – and even if there are some false-positives (benign cases that are marked as malignant), these likely will be discovered by follow-up tests. Whereas missing a malignant case (classifying is as benign) could have deadly consequences. Thus, we want to minimize false-negatives, which maximizes the ratio true-positives/(true-positives+false-negatives), which is recall (or sensitivity or true positive rate).
Using the mean predictor variables defined in the Workspace, train a logistic regression classifier on the train set using
scikit-learn with no regularization (i.e. no penalty) and an intercept term. This will mean setting
- parameters penalty=’none’ (the default penalty is ‘l2’).
- An intercept term (set
fit_intercept = True)
Print the model coefficients and intercept.
Print the confusion matrix and accuracy, precision, recall and f1-score scores. Which metric gives the highest and lowest values?