1. Features linearly related to log odds
Similar to linear regression, the underlying assumption of logistic regression is that the features are linearly related to the logit of the outcome. To test this visually, we can use seaborn’s regplot, with the parameter ‘logistic= True’ and the x value our feature of interest. If this condition is met, the model fit will resemble a sigmoidal curve (as in the case when ‘x=radius_mean’). We’ve created written code here to a second plot here using the feature ‘fractal_dimension_mean’. Press Run on the code in the workspace. How do the curves compare?
Like in linear regression, one of the assumptions is that there is not multicolinearity in the data. There are many ways to look at this, the most common are a correlation of features and variance inflation factor (VIF). With a correlation plot, features that are highly correlated can be dropped from the model to reduce duplication.
We’re going to look at the “mean” features and which are highly correlated which each other using a heatmap. Uncomment the relevant lines of code and press Run to see the heatmap. There are two features that are highly positively correlated with radius. Can you spot them?*
*The heatmap shows radius and perimeter and area are all highly positively correlated (think formula for area of a circle!).
There is another pair of features that’s highly correlated too. Identify and create an array name
correlated_pair containing the two features.