In this lesson you will learn how to implement the Logistic Regression algorithm and use it to make predictions on your data.

When an email lands in your inbox, how does your email service know whether it's real or spam? This evaluation is made billions of times per day, and one possible method is logistic regression. 

_**Logistic regression**_ is a supervised machine learning algorithm that predicts the probability, ranging from 0 to 1, of a datapoint belonging to a specific category, or class. These probabilities can then be used to assign, or *classify*, observations to the more probable group.

For example, we could use a logistic regression model to predict the probability that an incoming email is spam. If that probability is greater than `0.5`, we could automatically send it to a spam folder. This is called _binary classification_ because there are only two groups (eg., spam or not spam).

Some other examples of problems that we could solve using logistic regression:
* Disease identification &mdash; Is a tumor malignant? 
* Customer conversion &mdash; Will a customer arriving on a sign-up page enroll in a service?

In this lesson you will learn how to perform logistic regression and use it to make predictions!

If you are unfamiliar with linear regression, we recommend you review it before proceeding. Otherwise, let's dive in!

Introduction

With the data from Codecademy University, we want to predict whether each student will pass their final exam. Recall that in linear regression, we fit a line of the following form to the data:

```tex
y = b_{0} + b_{1}x_{1} + b_{2}x_{2} +\cdots + b_{n}x_{n}
``` 
where
* `y` is the value we are trying to predict
* `b_0` is the intercept of the regression line
* `b_1`, `b_2`, … `b_n` are the coefficients
* `x_1`, `x_2`, … `x_n` are the predictors (also sometimes called *features*)

For our data, `y` is a binary variable, equal to either `1` (passing), or `0` (failing). We have only one predictor (`x_1`): `num_hours_studied`. Below we've fitted a linear regression model to our data and plotted the results. The best fit line is in red.
 
<img src="https://static-assets.codecademy.com/Courses/logistic-regression/linear_regression_ccu.svg" title="Linear Regression Model on Exam Data" />

We see that the linear model does not fit the data well. Our goal is to predict whether a student passes or fails; however, a best fit line allows predictions between negative and positive infinity.



Linear Regression Approach

We saw that predicted outcomes from a linear regression model range from negative to positive infinity. These predictions don't really make sense for a classification problem. Step in _**logistic regression**_!

To build a logistic regression model, we apply a _**logit link function**_ to the left-hand side of our linear regression function. Remember the equation for a linear model looks like this:

```tex
y = b_{0} + b_{1}x_{1} + b_{2}x_{2} +\cdots + b_{n}x_{n}
``` 

When we apply the logit function, we get the following:

```tex
ln(\frac{y}{1-y}) = b_{0} + b_{1}x_{1} + b_{2}x_{2} +\cdots + b_{n}x_{n}
```

For the Codecademy University example, this means that we are fitting the curve shown below to our data &mdash; instead of a line, like in linear regression:

![sigmoid function imposed on the plot of passing vs. hours studied](https://static-assets.codecademy.com/Courses/logistic-regression/sigmoid_hours_studied_basic.svg)

Notice that the red line stays between 0 and 1 on the y-axis. It now makes sense to interpret this value as a probability of group membership; whereas that would have been non-sensical for regular linear regression.

Note that this is a pretty nifty trick for adapting a linear regression model to solve classification problems! There are actually many other kinds of link functions that we can use for different adaptations.

Logistic Regression

So far, we've learned that the equation for a logistic regression model looks like this:

```tex
ln(\frac{p}{1-p}) = b_{0} + b_{1}x_{1} + b_{2}x_{2} +\cdots + b_{n}x_{n}
```

Note that we've replaced *y* with the letter *p* because we are going to interpret it as a probability (eg., the probability of a student passing the exam). The whole left-hand side of this equation is called _**log-odds**_ because it is the natural logarithm (*ln*) of odds (*p/(1-p)*). The right-hand side of this equation looks exactly like regular linear regression!

In order to understand how this link function works, let's dig into the interpretation of _**log-odds**_ a little more. The _odds_ of an event occurring is:

```tex
Odds = \frac{p}{1-p} = \frac{P(event\ occurring)}{P(event\ not\ occurring)}
```

For example, suppose that the probability a student passes an exam is *0.7*. That means the probability of failing is *1 - 0.7 = 0.3*. Thus, the odds of passing are:

```tex
Odds\ of\ passing = \frac{0.7}{0.3} = 2.\overline{33}
```

This means that students are 2.33 times more likely to pass than to fail. 

Odds can only be a positive number. When we take the natural log of odds (the log odds), we transform the odds from a positive value to a number between negative and positive infinity &mdash; which is exactly what we need! The logit function (log odds) transforms a probability (which is a number between 0 and 1) into a continuous value that can be positive or negative.

Log-Odds

Let's return to the logistic regression equation and demonstrate how this works by fitting a model in sklearn. The equation is:

```tex
ln(\frac{p}{1-p}) = b_{0} + b_{1}x_{1} + b_{2}x_{2} +\cdots + b_{n}x_{n}
```

Suppose that we want to fit a model that predicts whether a visitor to a website will make a purchase. We'll use the number of minutes they spent on the site as a predictor. The following code fits the model:

```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(min_on_site, purchase)
```

Next, just like linear regression, we can use the right-hand side of our regression equation to make predictions for each of our original datapoints as follows:

```python
log_odds = model.intercept_ + model.coef_ * min_on_site 
print(log_odds)
```

Output:

```
[[-3.28394203]
 [-1.46465328]
 [-0.02039445]
 [ 1.22317391]
 [ 2.18476234]]
```

Notice that these predictions range from negative to positive infinity: these are log odds. In other words, for the first datapoint, we have:

```tex
ln(\frac{p}{1-p}) = -3.28394203
```

We can turn log odds into a probability as follows:

```tex
\begin{aligned}
ln(\frac{p}{1-p}) = -3.28 \\
\frac{p}{1-p} = e^{-3.28} \\
p = e^{-3.28} (1-p) \\
p = e^{-3.28} - e^{-3.28}*p \\
p + e^{-3.28}*p = e^{-3.28} \\
p * (1 + e^{-3.28}) = e^{-3.28} \\
p = \frac{e^{-3.28}}{1 + e^{-3.28}} \\
p = 0.04
\end{aligned}
```

In Python, we can do this simultaneously for all of the datapoints using NumPy (loaded as `np`):

```python
np.exp(log_odds)/(1+ np.exp(log_odds))
```

Output:

```
array([[0.0361262 ],
       [0.18775665],
       [0.49490156],
       [0.77262162],
       [0.89887279]])
```

The calculation that we just did required us to use something called the _sigmoid function_, which is the inverse of the logit function. The sigmoid function produces the S-shaped curve we saw previously:

<img src="https://content.codecademy.com/programs/data-science-path/logistic-regression/sigmoid.png" title="Sigmoid Function" />

Sigmoid Function

Now that we've learned a little bit about how logistic regression works, let's fit a model using [`sklearn`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

To do this, we'll begin by importing the `LogisticRegression` module and creating a `LogisticRegression` object:

```py
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
```

After creating the object, we need to fit our model on the data. We can accomplish this using the `.fit()` method, which takes two parameters: a matrix of features and a matrix of class labels (the outcome we are trying to predict).

```py
model.fit(features, labels)
```

Now that the model is trained, we can access a few useful attributes:
* `model.coef_` is a vector of the coefficients of each feature 
* `model.intercept_` is the intercept

The coefficients can be interpreted as follows: 

- Large positive coefficient: a one unit increase in that feature is associated with a large **increase** in the log odds (and therefore probability) of a datapoint belonging to the positive class (the outcome group labeled as `1`)
- Large negative coefficient: a one unit increase in that feature is associated with a large **decrease** in the log odds/probability of belonging to the positive class. 
- Coefficient of 0: The feature is not associated with the outcome.

One important note is that `sklearn`'s logistic regression implementation requires the features to be standardized because regularization is implemented by default.

Fitting a model in sklearn

Using a trained model, we can predict whether new datapoints belong to the positive class (the group labeled as `1`) using the `.predict()` method. The input is a matrix of features and the output is a vector of predicted labels, `1` or `0`. 

```py
print(model.predict(features))
# Sample output: [0 1 1 0 0]
```
If we are more interested in the predicted probability of group membership, we can use the `.predict_proba()` method. The input to `predict_proba()` is also a matrix of features and the output is an array of probabilities, ranging from `0` to `1`:

```py
print(model.predict_proba(features)[:,1])
# Sample output: [0.32 0.75  0.55 0.20 0.44]
```
By default, `.predict_proba()` returns the probability of class membership for both possible groups. In the example code above, we've only printed out the probability of belonging to the positive class. Notice that datapoints with predicted probabilities greater than 0.5 (the second and third datapoints in this example) were classified as `1`s by the `.predict()` method. This is a process known as thresholding. As we can see here, sklearn sets the default classification threshold probability as 0.5.

Predictions in sklearn

As we've seen, logistic regression is used to predict the probability of group membership. Once we have this probability, we need to make a decision about what class a datapoint belongs to. This is where the _**classification threshold**_ comes in!

The default threshold for `sklearn` is `0.5`. If the predicted probability of an observation belonging to the positive class is greater than or equal to the threshold, `0.5`, the datapoint is assigned to the positive class.

<img src="https://content.codecademy.com/programs/data-science-path/logistic-regression/Threshold-01.svg" title="Threshold at 0.5" />

We can choose to change the threshold of classification based on the use-case of our model. For example, if we are creating a logistic regression model that classifies whether or not an individual has cancer, we may want to be more sensitive to the positive cases. We wouldn't want to tell someone they don't have cancer when they actually do!

In order to ensure that most patients with cancer are identified, we can move the classification threshold down to `0.3` or `0.4`, increasing the sensitivity of our model to predicting a positive cancer classification. While this might result in more overall misclassifications, we are now missing fewer of the cases we are trying to detect: actual cancer patients.

<img src="https://content.codecademy.com/programs/data-science-path/logistic-regression/Threshold-02.svg" title="Threshold at 0.4" />

Classification Thresholding

When we fit a machine learning model, we need some way to evaluate it. Often, we do this by splitting our data into training and test datasets. We use the training data to fit the model; then we use the test set to see how well the model performs with new data.

As a first step, data scientists often look at a confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives. 

For example, suppose that the true and predicted classes for a logistic regression model are:

```python
y_true = [0, 0, 1, 1, 1, 0, 0, 1, 0, 1]
y_pred = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
```

We can create a confusion matrix as follows:

```python
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_true, y_pred))
```

Output:

```
array([[3, 2],
       [1, 4]])
```
This output tells us that there are `3` true negatives, `1` false negative, `4` true positives, and `2` false positives. Ideally, we want the numbers on the main diagonal (in this case, `3` and `4`, which are the true negatives and true positives, respectively) to be as large as possible.

Confusion matrix

Once we have a confusion matrix, there are a few different statistics we can use to summarize the four values in the matrix. These include accuracy, precision, recall, and F1 score. We won't go into much detail about these metrics here, but a quick summary is shown below (T = true, F = false, P = positive, N = negative). For all of these metrics, a value closer to 1 is better and closer to 0 is worse.

- Accuracy = (TP + TN)/(TP + FP + TN + FN)
- Precision = TP/(TP + FP)
- Recall = TP/(TP + FN)
- F1 score: weighted average of precision and recall

In `sklearn`, we can calculate these metrics as follows:

```python
# accuracy:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_true, y_pred))
# output: 0.7

# precision:
from sklearn.metrics import precision_score
print(precision_score(y_true, y_pred))
# output: 0.67

# recall: 
from sklearn.metrics import recall_score
print(recall_score(y_true, y_pred))
# output: 0.8

# F1 score
from sklearn.metrics import f1_score
print(f1_score(y_true, y_pred))
# output: 0.73
```

Accuracy, Recall, Precision, F1 Score

Congratulations! You just learned how a logistic regression model works and how to fit one to a dataset. Here are some of the things you learned:
* Logistic regression is used to perform binary classification.
* Logistic regression is an extension of linear regression where we use a logit link function to fit a sigmoid curve to the data, rather than a line.
* We can use the coefficients from a logistic regression model to estimate the log odds that a datapoint belongs to the positive class. We can then transform the log odds into a probability.
* The coefficients of a logistic regression model can be used to estimate relative feature importance.
* A classification threshold is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The default cutoff in sklearn is `0.5`.
* We can evaluate a logistic regression model using a confusion matrix or summary statistics such as accuracy, precision, recall, and F1 score.


Review

Find the probability of data samples belonging to a specific class with one of the most popular classification algorithms.

`(0.43 * 0.23) + (1.24 * -0.12) + (-0.85 * 0.33) = -0.3304`

`(0.43 * 0.23) + (1.24 * -0.12) + (-0.85 * 0.33) + 0.42 = 0.0896`

`(0.43 + 0.23) * (1.24 + -0.12) * (-0.85 + 0.33) + 0.42 = 0.0356`

`(0.43 + 0.23) * (1.24 + -0.12) * (-0.85 + 0.33) - 0.42 = -0.8044`

detects credit card fraud, where "fraudulent" is the positive class.

predicts flight delays, where "delayed" is the positive class.

predicts customer conversion on a checkout page, where "converted" is the positive class.

predicts cancer diagnoses, where "has cancer" is the positive class.

Greatest impact: `recent_purchase`
Least impact: `international`

Greatest impact: `charge_amount`
Least impact: `previous_purchase`

Greatest impact: `previous_purchase`
Least impact: `international`

Greatest impact: `previous_purchase`
Least impact: `previous_purchase`

Practice what you've learned about logistic regression with this multiple choice quiz!

In this project you will use a Logistic Regression model to predict whether or not a passenger survived the sinking of the RMS Titanic.

The file `passengers.csv` contains the data of `892` passengers onboard the Titanic when it sank that fateful day. Let's begin by loading the data into a pandas DataFrame named `passengers`. Print `passengers` and inspect the columns. What features could we use to predict survival?

Given the saying, "women and children first," `Sex` and `Age` seem like good features to predict survival. Let's map the text values in the `Sex` column to a numerical value. Update `Sex` such that all values `female` are replaced with `1` and all values `male` are replaced with `0`.

Let's take a look at `Age`. Print `passengers['Age'].values`. You can see we have multiple missing values, or `nan`s. Fill all the empty `Age` values in passengers with the mean age.

Given the strict class system onboard the Titanic, let's utilize the `Pclass` column, or the passenger class, as another feature. Create a new column named `FirstClass` that stores `1` for all passengers in first class and `0` for all other passengers.

Create a new column named `SecondClass` that stores `1` for all passengers in second class and `0` for all other passengers.

Print `passengers` and inspect the DataFrame to ensure all the updates have been made.

Now that we have cleaned our data, let's select the columns we want to build our model on. Select columns `Sex`, `Age`, `FirstClass`, and `SecondClass` and store them in a variable named `features`. Select column `Survived` and store it a variable named `survival`.

Split the data into training and test sets using `sklearn`'s `train_test_split()` method. We'll use the training set to train the model and the test set to evaluate the model.

Since `sklearn`'s Logistic Regression implementation uses _Regularization_, we need to scale our feature data. Create a `StandardScaler` object, `.fit_transform()` it on the training features, and `.transform()` the test features.

Create a `LogisticRegression` model with `sklearn` and `.fit()` it on the training data.

Fitting the model will perform gradient descent to find the feature coefficients that minimize the log-loss for the training data.

`.score()` the model on the training data and print the training score.

Scoring the model on the training data will run the data through the model and make final classifications on survival for each passenger in the training set. The score returned is the percentage of correct classifications, or the accuracy.

`.score()` the model on the test data and print the test score.

Similarly, scoring the model on the testing data will run the data through the model and make final classifications on survival for each passenger in the test set.

How well did your model perform?

Print the feature coefficients determined by the model. Which feature is most important in predicting survival on the sinking of the Titanic?

Let's use our model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger `Jack` and 1st class passenger `Rose`,  stored in `NumPy` arrays. The arrays store 4 feature values, in the following order:
* `Sex`, represented by a `0` for male and `1` for female
* `Age`, represented as an integer in years
* `FirstClass`, with a `1` indicating the passenger is in first class
* `SecondClass`, with a `1` indicating the passenger is in second class

A third array, `You`, is also provided in the code editor with empty feature values. Uncomment the line containing `You` and update the array with your information, or the information for some fictitious passenger. Make sure to enter all values as floats with a `.`!

Combine `Jack`, `Rose`, and `You` into a single `NumPy` array named `sample_passengers`.

Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the `StandardScaler` object created earlier, apply its `.transform()` method to `sample_passengers` and save the result to `sample_passengers`.

Print `sample_passengers` to view the scaled features.

Who will survive, and who will sink? Use your model's `.predict()` method on `sample_passengers` and print the result to find out.

Want to see the probabilities that led to these predictions? Call your model's `.predict_proba()` method on `sample_passengers` and print the result. The 1st column is the probability of a passenger perishing on the Titanic, and the 2nd column is the probability of a passenger surviving the sinking (which was calculated by our model to make the final classification decision).

Predict Titanic Survival

Use Logistic Regression to classify income levels of adults Census Income Data!

The dataset has been loaded for you in **script.py** and saved as a dataframe named `df`. The outcome variable here is income. Check if the dataset is imbalanced.

Notice we have created a variable named ` feature_cols`. This contains a list of the variables we will use as our predictor variables.

Transform the dataset of predictor variables to dummy variables and save this in a new DataFrame called `X` (upper case "X").

Using `X`, create a heatmap of the correlation values.

Determine if scaling is needed for `X` prior to modeling. 

Then create the `y` output variable which is binary, 0 when income is less than $50K, 1 when greater than $50K.

Split the data into a training and testing set. Set the `random_state` to `1` and `test_size` to `.2`.

Then using `x_train, x_test, y_train, y_test`, fit a logistic regression model in scikit-learn on the training set with parameters `C=0.05, penalty='l1', solver='liblinear'`.

Lastly, use `.predict()` to create the y predictions and save this as `y_pred`.

 Print the model parameters (intercept and coefficients).

Evaluate the predictions of the model on the test set.  Print the confusion matrix and accuracy score.

Create a new DataFrame of the model coefficients and variable names.  Sort values based on coefficient and exclude any that are equal to zero.  Print the values of the DataFrame.

Create a barplot of the coefficients sorted in ascending order.

Plot the ROC curve and print the AUC value.

Income Classification using Logistic Regression

Congratulations, you’ve successfully completed the Machine Learning: Logistic Regression course! You've learned how to use machine learning to classify data.

Your learning journey into Machine Learning isn't over yet! Here is our roadmap to mastering Machine Learning:

* [Machine Learning: Logistic Regression](https://www.codecademy.com/machine-learning-logistic-regression) <-- Completed!
* [Machine Learning: Random Forests & Decision Trees](https://www.codecademy.com/machine-learning-random-forests-decision-trees) <-- Up next!
* [Machine Learning: Clustering with K-Means](https://www.codecademy.com/machine-learning-clustering-with-k-means)
* [Machine Learning: Perceptrons](https://www.codecademy.com/machine-learning-perceptrons)

Once again, congratulations on finishing the Machine Learning: Logistic Regression course! We are excited to see what you accomplish next.

You've completed Machine Learning: Logistic Regression! What's next?

Next Steps

Get ready to dive deeper into logistic regression! Learn about the assumptions that go into it, what to do when there's class imbalance and how to set and evaluate prediction thresholds!

We're now ready to delve deeper into Logistic Regression! In this lesson, we will cover the different assumptions that go into logistic regression, model hyperparameters, how to evaluate a classifier, ROC curves, and what to do when there's a class imbalance in the classification problem we're working with. 

For this lesson, we will be using the [Wisconsin Breast Cancer Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) (Diagnostic) to predict whether a tumor is benign (`0`) or malignant (`1`) based on characteristics of the cells, such as radius, texture, smoothness, etc.  Like a lot of real-world data sets, the distribution of outcomes is uneven (benign diagnoses are more common than malignant) and there is a bias in terms of the importance of the outcomes (classifying all malignant cases correctly is of the utmost importance). 

We're going to begin with the primary assumptions about the data that need to be checked before implementing a logistic regression model.

###### 1. The target variable is binary 
One of the most basic assumptions of logistic regression is that the outcome variable needs to be binary, which means there are two possible outcomes. Multinomial logistic regression is an exception to this assumption and is beyond the scope of this lesson.

###### 2. Independent observations
While often overlooked, checking for independent observations in a data set is important for logistic regression.  This can be violated if, in this case, patients are biopsied multiple times (repeated sampling of the same individual).  

###### 3. Large enough sample size
Since logistic regression is fit using maximum likelihood estimation instead of least squares minimization, there must be a large enough sample to get convergence. When a model fails to converge, this causes the estimates to be extremely inaccurate.  Now, what does a "large enough" sample mean? Often a rule of thumb is that there should be at least 10 samples per feature for the smallest class in the outcome variable. 

For example, if there were 100 samples and the outcome variable `diagnosis` had 60 benign tumors and 40 malignant tumors, then the max number of features allowed would be 4. To get 4 we took the smallest of the classes in the outcome variable, 40, and divided it by 10.

###### 4. No influential outliers
Logistic regression is sensitive to outliers, so we must remove any extremely influential outliers for model building.  Outliers are a broad topic with many different definitions -- z-scores, scaler of the interquartile range, Cook's distance/influence/leverage, etc -- so there are many ways to identify them.  But here, we will use visual tools to rule out obvious outliers.  

Assumptions of Logistic Regression I

##### 1. Features linearly related to log odds
Similar to linear regression, the underlying assumption of logistic regression is that the features are linearly related to the logit of the outcome.  To test this visually, we can use Seaborn's regplot, with the parameter `logistic= True` and the x value as our feature of interest.  If this condition is met, the fit model will resemble a sigmoidal curve (as is the case when `x=radius_mean`). 

We've added code to create another plot using the feature `fractal_dimension_mean`. Press **Run** in the workspace. How do the curves compare?

##### 2. Multicollinearity
Like in linear regression, one of the assumptions is that there is no multicollinearity in the data. Meaning the features should not be highly correlated. Multicollinearity can cause the coefficients and p-values to be inaccurate. With a correlation plot, we can see which features are highly correlated and then we can drop one of the features.

We're going to look at the "mean" features which are highly correlated with each other using a heatmap correlation plot.


Assumptions of Logistic Regression II

#### Model Training and Hyperparameters
Now that we have checked the assumptions of Logistic Regression, we can train and predict a model using `scikit-learn`.  We will first set the _hyperparameters_ of our model. Hyperparameters are set before the model implementation step and tuned later to improve model performance. Conversely, parameters are the result of model implementation, such as the intercept and coefficients.

_Note_: Within `scikit-learn` these hyperparameters are often referred to as "parameters" which might cause some confusion. It is worth noting that the meaning within `scikit-learn` documentation refers to these being "parameters" _of the function_ and not of the model itself.

#### Evaluation Metrics
Despite the name, logistic regression is being used as a classifier here, so any evaluation metrics for classification tasks will apply. The simplest metric is accuracy -- how many correct predictions did we make out of the total?  However, when classes are imbalanced, this can be a misleading metric for model performance.  Similarly, if we care more about accurately predicting a certain class, other metrics may be more appropriate to use, such as precision, recall, or F1-score may be better to evaluate performance. All of these metrics are available in `scikit-learn`. Check out the [Evaluation Metrics](https://www.codecademy.com/paths/machine-learning-engineer/tracks/mle-machine-learning-fundamentals/modules/mlecp-supervised-learning-i-regressors-classifiers-and-trees/lessons/mlfun-evaluation-metrics-classification/exercises/confusion-matrix) lesson if you'd like to brush up on the same.
```
Accuracy = (TP + TN)/Total

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

F1 score = 2*((Precision*Recall)/(Precision+Recall))
```
#### Which metrics matter most?
For our breast cancer dataset, predicting ALL malignant cases as malignant is of the utmost importance -- and even if there are some false positives (benign cases that are marked as malignant), these likely will be discovered by follow-up tests.  Whereas missing a malignant case (classifying it as benign) could have deadly consequences.  Thus, we want to minimize false negatives. This in turn will maximize the recall ratio (also known as the sensitivity or true positive rate). 

`scikit-learn` Implementation

Logistic regression not only predicts the class of a sample, but also the probability of a sample belonging to each class. It provides us with a measure of certainty associated with each prediction. In the default implementation in `scikit-learn`, a probability greater than 50% means that the predicted outcome will belong to the positive class. This is referred to as a prediction threshold. If two samples have predicted probabilities of 51% and 99%, both will be considered positive with the default threshold.  However, if the threshold is increased to 60%, a predicted probability of 51% will be assigned the negative class.

![thresholding](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/logistic-regression/thresholding.png)

Consider the histogram of the predicted probabilities for the logistic regression classifier shown above.  The benign (or negative class) is depicted in blue, and the malignant (or positive class) in orange for the breast cancer data set.  The benign cases are heavily clustered around zero, which is good as they will be correctly classified as benign, whereas malignant cases are heavily clustered around one.  The vertical lines depict hypothetical threshold values at 25%, 50%, and 75%.  For the highest threshold, almost all the samples above 75% belong to the malignant class, but there will be some benign cases that are misdiagnosed as malignant (false positives).  In addition, there are a number of malignant cases that are missed (false negatives).  If instead the lowest threshold value is used, almost all the malignant cases are identified, but there are more false positives.  

Therefore, the value of the threshold is an additional lever that can be used to tune a model's predictions. A higher value is generally associated with fewer false positives and more false negatives. Whereas a lower value is associated with fewer false negatives and more false positives.

Prediction Thresholds

We have examined how changing the threshold can affect the logistic regression predictions.  There is a continuum of predictions available in a single model by varying the threshold incrementally from zero to one.  For each of these thresholds, the True Positive Rate (TPR) and the False Positive Rate (FPR) can be calculated and then plot.  The resulting curve these points form is known as the Receiver Operating Characteristic (ROC) curve.

![There is a two-dimensional plot with "FPR" on the x-axis and "TPR" on the y-axis. Both axes range from 0 to 1. There is an orange curve on the plot that is positively sloped with the words ROC labeled above it. There are three data points on this ROC curve with the labels .25, .5, and .75. There is also a straight dashed blue diagonal line that starts at zero and has a slope of 1.](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/logistic-regression/ROC_curve_thresh.svg)

In the ROC curve plotted above, the True Positive Rate (`TPR = TP / TP + FN`) is on the y-axis and the False Positive Rate (`FPR = FP / TN + FP`) is on the x-axis. The ROC curve is the orange line and the dashed blue line is the Dummy Classification line, which is the equivalent of random guessing.

Notice there are three data points on the ROC curve, each labeled with their threshold values. The classification threshold of `.5` will give us a TPR of about `.65` with an FPR of about `.28`. For our specific data, we want a higher TPR so that we catch every malignant tumor. We might select a lower threshold of `.25` so that our TPR is about `.8`, even though this may give us an FPR of about `.4`. The ROC curve can help us decide on a threshold that best fits our specific classification problem. 

While the ROC curve measures the probabilities, the AUC (Area Under the Curve) gives us a single metric for separability. The AUC tells us how well our model can distinguish between the two classes. An AUC score close to 1 is a near-perfect classifier, whereas a value of 0.5 is equivalent to random guessing. To visualize different AUC scores, look at the ROC curve plots below:

![There are two plots in this image that both have "FPR" on the x-axis and "TPR" on the y-axis. The plot on the left has an orange line that is labeled "ROC" that goes straight up at 0 on the x-axis and then goes directly to the right at the top of the plot. There are words inside the plot that says "AUC = 1". The plot on the right has a straight orange line labeled "ROC" that starts at 0 and has a slope of 1 (making the line 45 degrees away from the x-axis). The words inside this plot state, "AUC = .5".](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/logistic-regression/ROC_AUC.svg)

ROC Curve and AUC

Class imbalance is when your binary classes for the outcome variable are not evenly split.  Technically, anything different from a 50/50 distribution would be imbalanced and need appropriate care. In the case of rare events, sometimes the positive class can be less than 1% of the total. If your classes are significantly imbalanced, this could create a bias towards the majority class since the model learns that it can have a higher accuracy if it predicts the majority class more often.

###### Positivity Rate
We can use the positivity rate to tell us how balanced our classes are. The positivity rate is the rate of occurrence for the positive class. With our breast cancer data, the formula is `Positivity Rate = Total Malignant Cases / Total Cases`. If our positivity rate is close to .5, then our classes are balanced.

###### Stratification
If your classes are imbalanced (more likely to happen with smaller datasets) then this difference can become even greater after you split your data into a training and testing dataset. One way to mitigate this is to randomly split using stratification on the class labels. Stratification is when the data is sorted into subgroups to ensure a nearly equal class distribution in your train and test sets. After using stratification, the training and testing datasets should have a very similar positivity rate (but stratification does not necessarily cause the positivity rate of the dataset to reach closer to .5).

###### Undersampling/Oversampling
To bring the positivity rate of the dataset closer to .5, we can undersample the majority class or oversample the minority class. For oversampling, repeated samples (with replacement) are taken from the minority class until the size is equal to that of the majority class. This causes the same data to be used multiple times, giving a higher weight to these samples. Alternatively, undersampling leaves out some of the majority class data to have the same number of samples as the minority class, leaving fewer data to build the model. We will not be practicing undersampling/oversampling in this exercise even though it can be a useful way to balance the classes.

###### Balance the Class Weight
When training a model, it is the default for every sample to be weighted equally. However, in the case of class imbalance, this can result in poor predictive power for the smaller of the two classes. A way to counteract this in logistic regression is to use the parameter `class_weight='balanced'`. This applies a weight inversely proportional to the class frequency, therefore supplying higher weight to misclassified instances in the smaller class. While overall accuracy may not increase, this can increase the accuracy for the smaller class (e.g. increase the number of malignant cases correctly diagnosed). 

Keep in mind that we want the recall score (also known as the True Positive Rate) to be as high as we can get it for our breast cancer data.
```
Recall = TP / TP + FN
```


Class Imbalance

Nice work! Let's review all the concepts you've learned in this lesson:

- The logistic regression primary assumptions include: 
    1. The target variable is binary
    2. The observations are independent of one another
    3. The sample size must be large enough
    4. There should not be any extreme outliers in the data

- Additional assumptions of LR include: 
    1. Features are linearly related to the logit of the outcome (sigmoidal curve)
    2. No multicollinearity (can cause the coefficients and p-values to be inaccurate)

- We can train and predict a model using `scikit-learn`. 

- Hyperparameters are set before the model implementation step and tuned later to improve model performance.

- We can use metrics (accuracy, precision, recall, or F1-score) to evaluate our Logistic Regression model.

- The prediction threshold (a measure of certainty associated with each prediction) can be an additional lever to tune a model’s predictions. In our breast cancer data, we wanted a lower threshold with fewer false negatives (a malignant tumor classified as benign).

- For each prediction threshold, the True Positive Rate (TPR) and the False Positive Rate (FPR) can be calculated and then plotted. The resulting curve these points form is known as the Receiver Operating Characteristic (ROC) curve.

- The AUC (Area Under the Curve) tells us how well our model can distinguish between the two classes. An AUC score close to 1 is a near-perfect classifier, whereas a value of 0.5 is equivalent to random guessing.

- If your classes are significantly imbalanced, this could create a bias towards the majority class since the model learns that it can have a higher accuracy if it predicts the majority class more often (not good).

- To bring the positivity rate of the dataset closer to .5, we can undersample the majority class, oversample the minority class, or balance the weights with `class_weight='balanced'`.

Logistic Regression ll Review

Logistic Regression II

Include patients who completed multiple quit-smoking programs. Person #1234 has two rows in the dataset, one from when they completed the program in 2020 and one from when they completed the program in 2021.

Exclude patients who previously participated in a quit-smoking program.

Exclude patients who did not complete the quit-smoking program.

Randomly selecting a set of patients who completed the quit smoking program and had not previously participated in one.

| | Predicted False | Predicted True |
|--| --- | ----------- |
| Actual False |190 | 30 |
| Actual True|5 | 55 |

| | Predicted False | Predicted True |
|--| --- | ----------- |
| Actual False |210 | 10 |
| Actual True|30 | 30 |

| | Predicted False | Predicted True |
|--| --- | ----------- |
| Actual False |190 | 10 |
| Actual True|20 | 60 |

| | Predicted False | Predicted True |
|--| --- | ----------- |
| Actual False |200 | 10 |
| Actual True|10 | 40 |

Outcome classes are evenly balanced and correctly predicting positives and negatives are of equal importance.

Outcome classes are unbalanced and correctly predicting positives and negatives are of equal importance.

Outcome classes are balanced and correctly predicting positives is more important the correctly predicting negatives.

Outcome classes are unbalanced and correctly predicting negatives is more important than correctly predicting positives.

### About this course
Continue your Machine Learning learning journey with Machine Learning: Logistic Regression. Learn how to implement and evaluate Logistic Regression models, and interpret the probabilities it returns. Use these skills to predict the class of new data points.

### Skills you'll gain
* Prepare data for a Logistic Regression model
* Implement and assess Logistic Regression models
* Solve problems like disease identification and customer conversion


Learn about the assumptions behind the logistic regression algorithm, prediction thresholds, ROC curves and class imbalance.

Predict the probability that a datapoint belongs to a given class with Logistic Regression.