When working with nominal categorical variables in Python, it can be useful to use One-Hot Encoding, which is a technique that will effectively create binary variables for each of the nominal categories. This encodes the variable without creating an order among the categories. To one-hot encode a variable in a pandas dataframe, we can use the `.get_dummies()`

.

df = pd.get_dummies(data = df, columns= ['column1', 'column2')

Before diving into your deep learning, it is best practice to investigate your dataset to get acquainted with the features, size, and structure of the information you are working with. You can investigate your data with pandas, using properties such as `.shape`

and methods like `.describe()`

.

Neural networks cannot work with string data. Therefore, if upon inspection you find that your data contains strings, you can use *one hot encoding* to convert categorical features into numerical features. An example of this is pictured below. To do this in Python, you can use the `.get_dummies()`

pandas method.

#load the datasetdataset = pd.read_csv('dataset.csv')#choose first 7 columns as featuresfeatures = dataset.iloc[:,0:6]#choose the final column for predictionlabels = dataset.iloc[:,-1]#see useful summary statistics for numeric featuresprint(features.describe())#shape and summary statistics of labelsprint(labels.shape)print(labels.describe())# use one hot encodingnumerical_features = pd.get_dummies(features)

When training a deep learning model (or any other machine learning model), split your data into *train* and *test* sets. The train set is used during the learning process, while the test set is used to evaluate the results of your model.

To perform this in Python, we use the `train_test_split()`

method from the scikit-learn library.

from sklearn.model_selection import train_test_split# Here we chose the test size to be 33% of the total data, and random state controls the shuffling applied to the data before applying the split.features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=42)

When preprocessing our data, we want to make sure all our features have similar scales. This is because deep learning models (like all learning models) perform better if all our features are weighed equally. *Standardization* and *normalization* are both common scaling methods.

Standardization scales all the features to have a mean of zero and a unit variance (equal to one). Normalization scales all the features to be in a fixed range, normally between `0`

and `1`

. Both are viable options when getting your data prepared for the learning process.

# Standardization can be implemented in the following way with scikit-learn:from sklearn.preprocessing import StandardScalerfrom sklearn.compose import ColumnTransformerct = ColumnTransformer([(“scale”, StandardScaler(), ['age', 'bmi', 'children'])], remainder='passthrough')features_train = ct.fit_transform(features_train)features_test = ct.transform(features_test)

A *sequential* deep learning model is a linear stack of *layers* with one *input* layer where data enters the neural network and one *ouput* layer where data exits the neural network. These stacked layers each contain at least one neuron, and they are the building blocks of our neural networks.

Here is an example layer diagram with three neurons. The `W`

and `b`

labels in the diagram represent weights and bias.

from tensorflow.keras.models import Sequentialfrom tensorflow.keras import layers# initializing a sequential modelmodel = Sequential()# creating a layer with 3 neuronslayer = layers.Dense(3)

When compiling a deep learning model, *loss* is measured to evaluate the success of the results. A lower loss means better performance. Since the goal is to achieve the best performance possible (without overfitting or underfitting), *optimizers* are used to continuously update the weights and parameters and improve loss metrics.

In the case of regression, the most often used loss function is the Mean Squared Error `mse`

(the average squared difference between the estimated values and the actual value).

Additionally, we want to observe the progress of the Mean Absolute Error (`mae`

) while training the model because MAE can give us a better idea than `mse`

on how far off we are from the true values in the units we are predicting.

# compiling our deep learning model with the following parameters:# mean squared error as the loss function# mean average error as the metric# Adam as the optimizer -- a widely used oneopt = Adam(learning_rate = 0.01)my_model.compile(loss='mse', metrics=['mae'], optimizer=opt)

Once a deep learning model is compiled, it is time to fit it to the training data and evaluate it on the test data. Using the `.fit()`

scikit-learn method on the training data, we specify the following parameters:

- the training set of the data
- the true labels for the training set of data
`epochs`

which is the number of cycles through the full training dataset`batch_size`

which is the number of data points to work through before updating the model parameters

After we fit the model, we evaluate it using the `.evaluate()`

scikit-learn method on the test set of data.

# fiting our modelmy_model.fit(train_data, train_labels, epochs=50, batch_size=3, verbose=1)# evaluating our modelval_mse, val_mae = my_model.evaluate(test_data, test_labels, verbose = 0)

In a sequential deep learning model, we have three different types of layers:

*Input Layer*: A placeholder for data to enter the neural network*Output Layer*: The final layer of the neural network where results are outputted*Hidden Layer*: An intermediate layer that adds more complexity and captures non-linear interactions among inputs and outputs in a neural network

There is always only one input and output layer, while there can be as many hidden layers as desired (even zero). Together, all these layers create neural networks like the one shown here:

from tensorflow.keras.layers import InputLayerfrom tensorflow.keras.layers import Densefrom tensorflow.keras.models import Sequentialmy_model = Sequential()# adding an input layer for a dataframe with 15 columnsmy_model.add(InputLayer(input_shape=(15,)))# hidden layer with 64 neurons and relu activation functionmy_model.add(Dense(64, activation='relu'))# adding an output layer to our modelmy_model.add(Dense(1))

After training and evaluating a neural network model, one must start the process of *hyperparameter tuning*, which involves tweaking hyperparameter values to continuously improve results.

In the image, you’ll see how we use the three datasets and our hyperparameters to adjust and evaluate our model’s performance:

- We use training data to adjust the weights and biases of our model to change its fit.
- We use validation data to evaluate the model’s performance.
- If the validation performance is good, we can use our test data to check if our model still performs well with a completely new set of data.
- If the validation performance isn’t good, we tweak our hyperparameters before retraining the model:
- the learning rate
- batch size
- number of epochs
- number of hidden layers
- optimizer

When going through the process of hyperparamter tuning, there are several common parameters to adjust:

*learning rate*: determines how big of a change is applied to the weights as a consequence of the error gradient calculated on a batch of training data*batch size*: determines how many training samples are seen before updating the network’s parameters (weight and bias matrices)*epochs*: represents the number of complete passes through the training dataset*layers*: the number of hidden layers we decide to put in our model

Tuning these hyperparameters is key to strong model performance. Making slight changes to them can alter performance in major ways, so hyperparameter tuning is often the longest process of building a model.

While in the process of hyperparameter tuning for a deep learning model, a good rule of thumb is to start by adding one hidden layer and add as many parameters as there are features existing in the dataset.

To avoid overfitting in a deep learning model, one can specify early stopping in TensorFlow with Keras by creating an `EarlyStopping`

callback and adding it as a parameter when we fit our model. An implementation of `EarlyStopping`

is shown with the following:

`monitor = val_loss`

, which means we are monitoring the validation loss to decide when to stop the training`mode = min`

, which means we seek minimal loss`patience = 40`

, which means that if the learning reaches a plateau, it will continue for 40 more epochs in case the plateau leads to improved performance

from tensorflow.keras.callbacks import EarlyStoppingstop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=40)history = model.fit(features_train, labels_train, epochs=num_epochs, batch_size=16, verbose=0, validation_split=0.2, callbacks=[stop])

When tuning a deep learning model, one can use *grid search*, also called *exhaustive search*, to try every combination of desired hyperparameter values.

If, for example, we want to try learning rates of 0.01 and 0.001 and batch sizes of 10, 30, and 50, grid search will try six combinations of parameters (0.01 and 10, 0.01 and 30, 0.01 and 50, 0.001 and 10, and so on).

To implement this in Python, we use `GridSearchCV`

from scikit-learn. For regression, we need to first wrap our neural network model into a `KerasRegressor`

. Then, we need to setup the desired hyperparameters grid (we don’t use many values for the sake of speed). Finally, we initialize a `GridSearchCV`

object and fit our model to the data. The implementation of this is shown in the code snippet.

model = KerasRegressor(build_fn=design_model)# batch sizes and epochs to testbatch_size = [4, 8, 16, 64]epochs = [10, 50, 100, 200]# setting up our grid of parametersparam_grid = dict(batch_size=batch_size, epochs=epochs)# initiliazing a grid searchgrid = GridSearchCV(estimator = model, param_grid=param_grid, scoring = make_scorer(mean_squared_error, greater_is_better=False))# fitting the resultsgrid_result = grid.fit(features_train, labels_train, verbose = 0)

When tuning a deep learning model, one can use *random search* to go through random combinations of hyperparameters over a specific interval.

Randomized search will sample values for `batch_size`

and `nb_epoch`

from uniform distributions on specified intervals. For example, in the code snippet shown, we sample random batch sizes in the interval [2, 16] and random epoch sizes in the interval [10, 100], respectively, for a fixed number of iterations. In our case, 12 iterations:

# parameter grid with batch sizes between 2 and 16, and epochs between 10 and 100param_grid = {'batch_size': sp_randint(2, 16), 'nb_epoch': sp_randint(10, 100)}# initializing random search# score is using mse as the metric and looking for lower scores# 12 iterationsgrid = RandomizedSearchCV(estimator = model, param_distributions=param_grid, scoring = make_scorer(mean_squared_error, greater_is_better=False), n_iter = 12)

*Regularization* is a set of techniques that help avoid overfitting by preventing the learning process from fitting a deep learning model completely.

*Dropout* is a regularization technique that randomly ignores, or “drops out”, a number of outputs of a layer by setting them to zeros.

The dropout rate is the percentage of layer outputs set to zero (usually between 20% to 50%). In Keras, we can add a dropout layer by introducing the `Dropout`

layer.

# A model with two dropout layers# setting up model and input layermodel = Sequential()my_input = tf.keras.Input(shape=(20,))model.add(my_input)model.add(layers.Dense(128, activation='relu'))# dropout layer with dropout rate of 0.1model.add(layers.Dropout(0.1))model.add(layers.Dense(64, activation='relu'))# dropout layer with dropout rate of 0.2model.add(layers.Dropout(0.2))model.add(layers.Dense(24, activation='relu'))